2024-12-07 21:07:38 +01:00

87 lines
5.0 KiB
Markdown

---
type: math
---
## Data Basics
- **Variable Types:**
- **Numeric**: Variables with numerical values.
- **Categorical**: Variables with non-numerical values representing different categories.
## Mean vs. Median vs. Average
### Mean
- **Formula:**
$$ \hat{x} = \frac{1}{N} \sum_{i=1}^{N} x_i $$
where:
- $\hat{x}$ represents the mean.
- $N$ is the number of data points in the set.
- $x_i$ represents each individual data point.
- **Use Cases:** The mean is best used when you want a single number that represents the typical value of a dataset and the data is **not heavily skewed by outliers.** For example, the mean is often used to calculate the average income, height, or test score.
- **Limitations:** The mean is sensitive to extreme values (outliers), meaning that a few very high or very low values can significantly affect the mean.
### Median
- **Definition:** The median is the middle value in a sorted dataset. If the dataset has an even number of values, the median is the average of the two middle values.
- **How do we find it?**:
![quick_sort](quick_sort.gif)
- **Use Cases:** The median is a robust measure of central tendency and is preferred when dealing with datasets that **contain outliers or have a skewed distribution.** It is often used to report housing prices or income distributions, where a few extreme values can significantly influence the mean.
- **Limitations:** The median may not accurately represent the center of a dataset if the distribution is bimodal or multimodal.
### Mode
- **Definition:** The mode is the most frequent value in a dataset. A dataset can have one mode (unimodal), two modes (bimodal), or more (multimodal).
- **How?**: just count $\mathcal{O}(n)$
- **Use Cases:** The mode is suitable for both **numeric and categorical data** and is particularly useful for identifying the most common category or value. For example, the mode can be used to determine the most popular color of a product or the most frequent response in a survey.
- **Limitations:** The mode may not be a good representation of the center of a dataset when data is evenly distributed or when there are multiple modes with similar frequencies.
### Choosing the Right Measure
![](Pasted%20image%2020241121122306.png)
![](Pasted%20image%2020241121122345.png)
- **Symmetrical data:** If your data is symmetrical and has no outliers, the **mean, median, and mode will be similar**, and any of them can be used.
- **Skewed data with outliers:** If your data is skewed or contains outliers, the **median is a better measure** of central tendency than the mean.
- **Categorical data:** If your data is categorical, the **mode is the only appropriate measure** of central tendency.
## Data Summary
**Data summaries help to understand the main features of a dataset.**
- **Univariate Summary**: Summarizing a single variable.
- **Quantiles/Percentiles:** Values that divide a sorted dataset into equal parts.
- **Quartiles:** Specific percentiles (0%, 25%, 50%, 75%, 100%).
- **Interquartile Range (IQR):** The difference between the 3rd and 1st quartiles.
- Measures data spread.
- **Standard Deviation:** The average deviation of data points from the mean.
- **Variance:** The square of the standard deviation.
- **Visual Methods:**
- **Bar Plot:** Represents categorical data with bars of varying heights.
- **Pie Chart:** Represents categorical data as slices of a pie.
- **Avoid using pie charts as they are less effective than bar plots.**
- **Histogram:** Shows the frequency distribution of a numeric variable.
- **Box Plot:** Visualizes quartiles, interquartiles, and outliers.
- Useful for comparing multiple statistics across variables.
- **Multivariate Summary**: Summarizing the relationship between two or more variables.
- **Numeric Methods:**
- **Covariance:** Measures the joint variability of two numeric variables.
- **Correlation:** Measures the strength and direction of the linear relationship between two numeric variables.
- Values range from -1 to 1.
- Closer to 1 or -1 indicates a stronger relationship.
- Closer to 0 indicates a weaker relationship.
- **Contingency Tables (Cross Tables):** Explore relationships between two categorical variables by showing absolute or conditional frequencies.
- **Visual Methods:**
- **Scatter Plot:** Displays the relationship between two numeric variables.
- **Heat Map:** Visualizes a correlation matrix.
- **Dodged, Stacked, Filled Bar Plots:** Represent categorical data with different bar arrangements.
- **Mosaic Plot:** Visualizes contingency tables.
## Series
- **Series**: A numerical variable with an order induced by another variable (often time).
- **Time Series**: Series where the order variable is time.
- **Line Plot:** Visualizes series.
- **Moving Average**: A common numerical summary method for series.