87 lines
5.0 KiB
Markdown
87 lines
5.0 KiB
Markdown
---
|
|
type: math
|
|
---
|
|
|
|
## Data Basics
|
|
|
|
- **Variable Types:**
|
|
- **Numeric**: Variables with numerical values.
|
|
- **Categorical**: Variables with non-numerical values representing different categories.
|
|
|
|
## Mean vs. Median vs. Average
|
|
|
|
|
|
### Mean
|
|
|
|
- **Formula:**
|
|
|
|
$$ \hat{x} = \frac{1}{N} \sum_{i=1}^{N} x_i $$
|
|
where:
|
|
- $\hat{x}$ represents the mean.
|
|
- $N$ is the number of data points in the set.
|
|
- $x_i$ represents each individual data point.
|
|
- **Use Cases:** The mean is best used when you want a single number that represents the typical value of a dataset and the data is **not heavily skewed by outliers.** For example, the mean is often used to calculate the average income, height, or test score.
|
|
|
|
- **Limitations:** The mean is sensitive to extreme values (outliers), meaning that a few very high or very low values can significantly affect the mean.
|
|
|
|
|
|
### Median
|
|
|
|
- **Definition:** The median is the middle value in a sorted dataset. If the dataset has an even number of values, the median is the average of the two middle values.
|
|
- **How do we find it?**:
|
|

|
|
- **Use Cases:** The median is a robust measure of central tendency and is preferred when dealing with datasets that **contain outliers or have a skewed distribution.** It is often used to report housing prices or income distributions, where a few extreme values can significantly influence the mean.
|
|
- **Limitations:** The median may not accurately represent the center of a dataset if the distribution is bimodal or multimodal.
|
|
|
|
### Mode
|
|
|
|
- **Definition:** The mode is the most frequent value in a dataset. A dataset can have one mode (unimodal), two modes (bimodal), or more (multimodal).
|
|
- **How?**: just count $\mathcal{O}(n)$
|
|
- **Use Cases:** The mode is suitable for both **numeric and categorical data** and is particularly useful for identifying the most common category or value. For example, the mode can be used to determine the most popular color of a product or the most frequent response in a survey.
|
|
- **Limitations:** The mode may not be a good representation of the center of a dataset when data is evenly distributed or when there are multiple modes with similar frequencies.
|
|
|
|
### Choosing the Right Measure
|
|

|
|

|
|
- **Symmetrical data:** If your data is symmetrical and has no outliers, the **mean, median, and mode will be similar**, and any of them can be used.
|
|
- **Skewed data with outliers:** If your data is skewed or contains outliers, the **median is a better measure** of central tendency than the mean.
|
|
- **Categorical data:** If your data is categorical, the **mode is the only appropriate measure** of central tendency.
|
|
|
|
## Data Summary
|
|
|
|
**Data summaries help to understand the main features of a dataset.**
|
|
|
|
- **Univariate Summary**: Summarizing a single variable.
|
|
- **Quantiles/Percentiles:** Values that divide a sorted dataset into equal parts.
|
|
- **Quartiles:** Specific percentiles (0%, 25%, 50%, 75%, 100%).
|
|
- **Interquartile Range (IQR):** The difference between the 3rd and 1st quartiles.
|
|
- Measures data spread.
|
|
- **Standard Deviation:** The average deviation of data points from the mean.
|
|
- **Variance:** The square of the standard deviation.
|
|
- **Visual Methods:**
|
|
- **Bar Plot:** Represents categorical data with bars of varying heights.
|
|
- **Pie Chart:** Represents categorical data as slices of a pie.
|
|
- **Avoid using pie charts as they are less effective than bar plots.**
|
|
- **Histogram:** Shows the frequency distribution of a numeric variable.
|
|
- **Box Plot:** Visualizes quartiles, interquartiles, and outliers.
|
|
- Useful for comparing multiple statistics across variables.
|
|
- **Multivariate Summary**: Summarizing the relationship between two or more variables.
|
|
- **Numeric Methods:**
|
|
- **Covariance:** Measures the joint variability of two numeric variables.
|
|
- **Correlation:** Measures the strength and direction of the linear relationship between two numeric variables.
|
|
- Values range from -1 to 1.
|
|
- Closer to 1 or -1 indicates a stronger relationship.
|
|
- Closer to 0 indicates a weaker relationship.
|
|
- **Contingency Tables (Cross Tables):** Explore relationships between two categorical variables by showing absolute or conditional frequencies.
|
|
- **Visual Methods:**
|
|
- **Scatter Plot:** Displays the relationship between two numeric variables.
|
|
- **Heat Map:** Visualizes a correlation matrix.
|
|
- **Dodged, Stacked, Filled Bar Plots:** Represent categorical data with different bar arrangements.
|
|
- **Mosaic Plot:** Visualizes contingency tables.
|
|
|
|
## Series
|
|
|
|
- **Series**: A numerical variable with an order induced by another variable (often time).
|
|
- **Time Series**: Series where the order variable is time.
|
|
- **Line Plot:** Visualizes series.
|
|
- **Moving Average**: A common numerical summary method for series. |