--- type: math --- ## Data Basics - **Variable Types:** - **Numeric**: Variables with numerical values. - **Categorical**: Variables with non-numerical values representing different categories. ## Mean vs. Median vs. Average ### Mean - **Formula:** $$ \hat{x} = \frac{1}{N} \sum_{i=1}^{N} x_i $$ where: - $\hat{x}$ represents the mean. - $N$ is the number of data points in the set. - $x_i$ represents each individual data point. - **Use Cases:** The mean is best used when you want a single number that represents the typical value of a dataset and the data is **not heavily skewed by outliers.** For example, the mean is often used to calculate the average income, height, or test score. - **Limitations:** The mean is sensitive to extreme values (outliers), meaning that a few very high or very low values can significantly affect the mean. ### Median - **Definition:** The median is the middle value in a sorted dataset. If the dataset has an even number of values, the median is the average of the two middle values. - **How do we find it?**: ![quick_sort](quick_sort.gif) - **Use Cases:** The median is a robust measure of central tendency and is preferred when dealing with datasets that **contain outliers or have a skewed distribution.** It is often used to report housing prices or income distributions, where a few extreme values can significantly influence the mean. - **Limitations:** The median may not accurately represent the center of a dataset if the distribution is bimodal or multimodal. ### Mode - **Definition:** The mode is the most frequent value in a dataset. A dataset can have one mode (unimodal), two modes (bimodal), or more (multimodal). - **How?**: just count $\mathcal{O}(n)$ - **Use Cases:** The mode is suitable for both **numeric and categorical data** and is particularly useful for identifying the most common category or value. For example, the mode can be used to determine the most popular color of a product or the most frequent response in a survey. - **Limitations:** The mode may not be a good representation of the center of a dataset when data is evenly distributed or when there are multiple modes with similar frequencies. ### Choosing the Right Measure ![](Pasted%20image%2020241121122306.png) ![](Pasted%20image%2020241121122345.png) - **Symmetrical data:** If your data is symmetrical and has no outliers, the **mean, median, and mode will be similar**, and any of them can be used. - **Skewed data with outliers:** If your data is skewed or contains outliers, the **median is a better measure** of central tendency than the mean. - **Categorical data:** If your data is categorical, the **mode is the only appropriate measure** of central tendency. ## Data Summary **Data summaries help to understand the main features of a dataset.** - **Univariate Summary**: Summarizing a single variable. - **Quantiles/Percentiles:** Values that divide a sorted dataset into equal parts. - **Quartiles:** Specific percentiles (0%, 25%, 50%, 75%, 100%). - **Interquartile Range (IQR):** The difference between the 3rd and 1st quartiles. - Measures data spread. - **Standard Deviation:** The average deviation of data points from the mean. - **Variance:** The square of the standard deviation. - **Visual Methods:** - **Bar Plot:** Represents categorical data with bars of varying heights. - **Pie Chart:** Represents categorical data as slices of a pie. - **Avoid using pie charts as they are less effective than bar plots.** - **Histogram:** Shows the frequency distribution of a numeric variable. - **Box Plot:** Visualizes quartiles, interquartiles, and outliers. - Useful for comparing multiple statistics across variables. - **Multivariate Summary**: Summarizing the relationship between two or more variables. - **Numeric Methods:** - **Covariance:** Measures the joint variability of two numeric variables. - **Correlation:** Measures the strength and direction of the linear relationship between two numeric variables. - Values range from -1 to 1. - Closer to 1 or -1 indicates a stronger relationship. - Closer to 0 indicates a weaker relationship. - **Contingency Tables (Cross Tables):** Explore relationships between two categorical variables by showing absolute or conditional frequencies. - **Visual Methods:** - **Scatter Plot:** Displays the relationship between two numeric variables. - **Heat Map:** Visualizes a correlation matrix. - **Dodged, Stacked, Filled Bar Plots:** Represent categorical data with different bar arrangements. - **Mosaic Plot:** Visualizes contingency tables. ## Series - **Series**: A numerical variable with an order induced by another variable (often time). - **Time Series**: Series where the order variable is time. - **Line Plot:** Visualizes series. - **Moving Average**: A common numerical summary method for series.