2024-12-07 21:07:38 +01:00

5.0 KiB

type
math

Data Basics

  • Variable Types:
    • Numeric: Variables with numerical values.
    • Categorical: Variables with non-numerical values representing different categories.

Mean vs. Median vs. Average

Mean

  • Formula:

    \hat{x} = \frac{1}{N} \sum_{i=1}^{N} x_i

    where:

    • \hat{x} represents the mean.
    • N is the number of data points in the set.
    • x_i represents each individual data point.
  • Use Cases: The mean is best used when you want a single number that represents the typical value of a dataset and the data is not heavily skewed by outliers. For example, the mean is often used to calculate the average income, height, or test score.

  • Limitations: The mean is sensitive to extreme values (outliers), meaning that a few very high or very low values can significantly affect the mean.

Median

  • Definition: The median is the middle value in a sorted dataset. If the dataset has an even number of values, the median is the average of the two middle values.
  • How do we find it?: quick_sort
  • Use Cases: The median is a robust measure of central tendency and is preferred when dealing with datasets that contain outliers or have a skewed distribution. It is often used to report housing prices or income distributions, where a few extreme values can significantly influence the mean.
  • Limitations: The median may not accurately represent the center of a dataset if the distribution is bimodal or multimodal.

Mode

  • Definition: The mode is the most frequent value in a dataset. A dataset can have one mode (unimodal), two modes (bimodal), or more (multimodal).
  • How?: just count \mathcal{O}(n)
  • Use Cases: The mode is suitable for both numeric and categorical data and is particularly useful for identifying the most common category or value. For example, the mode can be used to determine the most popular color of a product or the most frequent response in a survey.
  • Limitations: The mode may not be a good representation of the center of a dataset when data is evenly distributed or when there are multiple modes with similar frequencies.

Choosing the Right Measure

  • Symmetrical data: If your data is symmetrical and has no outliers, the mean, median, and mode will be similar, and any of them can be used.
  • Skewed data with outliers: If your data is skewed or contains outliers, the median is a better measure of central tendency than the mean.
  • Categorical data: If your data is categorical, the mode is the only appropriate measure of central tendency.

Data Summary

Data summaries help to understand the main features of a dataset.

  • Univariate Summary: Summarizing a single variable. - Quantiles/Percentiles: Values that divide a sorted dataset into equal parts. - Quartiles: Specific percentiles (0%, 25%, 50%, 75%, 100%). - Interquartile Range (IQR): The difference between the 3rd and 1st quartiles. - Measures data spread. - Standard Deviation: The average deviation of data points from the mean. - Variance: The square of the standard deviation.
    • Visual Methods:
      • Bar Plot: Represents categorical data with bars of varying heights.
      • Pie Chart: Represents categorical data as slices of a pie.
        • Avoid using pie charts as they are less effective than bar plots.
      • Histogram: Shows the frequency distribution of a numeric variable.
      • Box Plot: Visualizes quartiles, interquartiles, and outliers.
        • Useful for comparing multiple statistics across variables.
  • Multivariate Summary: Summarizing the relationship between two or more variables.
    • Numeric Methods:
      • Covariance: Measures the joint variability of two numeric variables.
      • Correlation: Measures the strength and direction of the linear relationship between two numeric variables.
        • Values range from -1 to 1.
        • Closer to 1 or -1 indicates a stronger relationship.
        • Closer to 0 indicates a weaker relationship.
      • Contingency Tables (Cross Tables): Explore relationships between two categorical variables by showing absolute or conditional frequencies.
    • Visual Methods:
      • Scatter Plot: Displays the relationship between two numeric variables.
      • Heat Map: Visualizes a correlation matrix.
      • Dodged, Stacked, Filled Bar Plots: Represent categorical data with different bar arrangements.
      • Mosaic Plot: Visualizes contingency tables.

Series

  • Series: A numerical variable with an order induced by another variable (often time).
    • Time Series: Series where the order variable is time.
    • Line Plot: Visualizes series.
    • Moving Average: A common numerical summary method for series.