This commit is contained in:
2024-12-07 21:07:38 +01:00
parent 2fded76a5c
commit a9676272f2
120 changed files with 15925 additions and 1 deletions

View File

@ -0,0 +1,87 @@
---
type: math
---
## Data Basics
- **Variable Types:**
- **Numeric**: Variables with numerical values.
- **Categorical**: Variables with non-numerical values representing different categories.
## Mean vs. Median vs. Average
### Mean
- **Formula:**
$$ \hat{x} = \frac{1}{N} \sum_{i=1}^{N} x_i $$
where:
- $\hat{x}$ represents the mean.
- $N$ is the number of data points in the set.
- $x_i$ represents each individual data point.
- **Use Cases:** The mean is best used when you want a single number that represents the typical value of a dataset and the data is **not heavily skewed by outliers.** For example, the mean is often used to calculate the average income, height, or test score.
- **Limitations:** The mean is sensitive to extreme values (outliers), meaning that a few very high or very low values can significantly affect the mean.
### Median
- **Definition:** The median is the middle value in a sorted dataset. If the dataset has an even number of values, the median is the average of the two middle values.
- **How do we find it?**:
![quick_sort](quick_sort.gif)
- **Use Cases:** The median is a robust measure of central tendency and is preferred when dealing with datasets that **contain outliers or have a skewed distribution.** It is often used to report housing prices or income distributions, where a few extreme values can significantly influence the mean.
- **Limitations:** The median may not accurately represent the center of a dataset if the distribution is bimodal or multimodal.
### Mode
- **Definition:** The mode is the most frequent value in a dataset. A dataset can have one mode (unimodal), two modes (bimodal), or more (multimodal).
- **How?**: just count $\mathcal{O}(n)$
- **Use Cases:** The mode is suitable for both **numeric and categorical data** and is particularly useful for identifying the most common category or value. For example, the mode can be used to determine the most popular color of a product or the most frequent response in a survey.
- **Limitations:** The mode may not be a good representation of the center of a dataset when data is evenly distributed or when there are multiple modes with similar frequencies.
### Choosing the Right Measure
![](Pasted%20image%2020241121122306.png)
![](Pasted%20image%2020241121122345.png)
- **Symmetrical data:** If your data is symmetrical and has no outliers, the **mean, median, and mode will be similar**, and any of them can be used.
- **Skewed data with outliers:** If your data is skewed or contains outliers, the **median is a better measure** of central tendency than the mean.
- **Categorical data:** If your data is categorical, the **mode is the only appropriate measure** of central tendency.
## Data Summary
**Data summaries help to understand the main features of a dataset.**
- **Univariate Summary**: Summarizing a single variable.
- **Quantiles/Percentiles:** Values that divide a sorted dataset into equal parts.
- **Quartiles:** Specific percentiles (0%, 25%, 50%, 75%, 100%).
- **Interquartile Range (IQR):** The difference between the 3rd and 1st quartiles.
- Measures data spread.
- **Standard Deviation:** The average deviation of data points from the mean.
- **Variance:** The square of the standard deviation.
- **Visual Methods:**
- **Bar Plot:** Represents categorical data with bars of varying heights.
- **Pie Chart:** Represents categorical data as slices of a pie.
- **Avoid using pie charts as they are less effective than bar plots.**
- **Histogram:** Shows the frequency distribution of a numeric variable.
- **Box Plot:** Visualizes quartiles, interquartiles, and outliers.
- Useful for comparing multiple statistics across variables.
- **Multivariate Summary**: Summarizing the relationship between two or more variables.
- **Numeric Methods:**
- **Covariance:** Measures the joint variability of two numeric variables.
- **Correlation:** Measures the strength and direction of the linear relationship between two numeric variables.
- Values range from -1 to 1.
- Closer to 1 or -1 indicates a stronger relationship.
- Closer to 0 indicates a weaker relationship.
- **Contingency Tables (Cross Tables):** Explore relationships between two categorical variables by showing absolute or conditional frequencies.
- **Visual Methods:**
- **Scatter Plot:** Displays the relationship between two numeric variables.
- **Heat Map:** Visualizes a correlation matrix.
- **Dodged, Stacked, Filled Bar Plots:** Represent categorical data with different bar arrangements.
- **Mosaic Plot:** Visualizes contingency tables.
## Series
- **Series**: A numerical variable with an order induced by another variable (often time).
- **Time Series**: Series where the order variable is time.
- **Line Plot:** Visualizes series.
- **Moving Average**: A common numerical summary method for series.

View File

@ -0,0 +1,119 @@
---
type: math
---
### What is Probability?
Probability measures uncertainty and is used to create mathematical models for events with uncertain outcomes. While probability theory focuses on building these models, statistics deals with collecting data and comparing it to the models to assess how well they align with reality.
---
### Key Milestones in Probability
Here are some major milestones in the history of probability:
- **Girolamo Cardan (16th century):** Introduced basic probability concepts in the context of gambling.
- **Blaise Pascal & Pierre de Fermat (17th century):** Developed foundational principles of probability, also inspired by games of chance.
- **Jacob Bernoulli (17th century):** Pioneered ideas in statistical inference and introduced Bernoulli trials (experiments with two outcomes).
- **Abraham de Moivre & Pierre Simon Laplace (18th century):** Developed the normal distribution and central limit theorem, critical tools in modern probability and statistics.
- **Thomas Bayes (18th century):** Formulated Bayes Theorem, a key method for updating beliefs based on new evidence.
- **Andrey Kolmogorov (20th century):** Formalized probability theory using set theory, creating the modern framework we use today.
---
### Sample Space and Events
- **Sample Space (Ω):** The set of all possible outcomes in an experiment.
- **Event (ε):** A subset of the sample space, representing a specific outcome or group of outcomes.
**Example:**
For a single roll of a die:
- Sample Space: Ω = {1, 2, 3, 4, 5, 6}
- Event (e.g., rolling an even number): ε = {2, 4, 6}
---
### Probability Frameworks
There are three main approaches to defining probability:
1. **Classical Probability:**
Assumes all outcomes are equally likely. The probability of an outcome is$\frac{1}{\text{total outcomes}}$.
**Example:** Rolling a fair die: Each number has a probability of$\frac{1}{6}$.
2. **Frequentist Probability:**
Defines probability based on the relative frequency of an outcome in repeated trials.
**Example:** If you flip a coin many times, the proportion of heads approximates the probability of heads.
3. **Bayesian Probability:**
Views probability as a degree of belief, incorporating prior knowledge and updating it based on new evidence.
**Example:** Using weather forecasts and personal experience to estimate the chance of rain tomorrow.
**Limitations:**
- Frequentist methods dont work for one-time events (e.g., predicting the chance of life on Mars).
- Classical probability is unsuitable for infinite or unequal sample spaces.
---
### Axioms of Probability
Probability is formally defined using these axioms:
1. **Non-negativity:**$0 \leq P(ε) \leq 1$.
2. **Certainty:**$P(Ω) = 1$.
3. **Additivity:** For mutually exclusive events,$P(ε_1 \cup ε_2) = P(ε_1) + P(ε_2)$.
These axioms ensure that probabilities are consistent and logically sound.
---
### Key Properties of Probability
From the axioms, we can derive useful properties:
-$P(\emptyset) = 0$: The probability of an impossible event is zero.
-$P(ε^c) = 1 - P(ε)$: The probability of an event not happening is 1 minus the probability of it happening.
-$P(ε_1 \cup ε_2) = P(ε_1) + P(ε_2) - P(ε_1 \cap ε_2)$: The probability of either event occurring accounts for their overlap.
- **Monotonicity:** If one event is a subset of another,$P(ε_1) \leq P(ε_2)$.
---
### Conditional Probability
Conditional probability examines the likelihood of an event given that another event has occurred.
The formula is:
$$
P(ε|H) = \frac{P(ε \cap H)}{P(H)}
$$
It satisfies all the axioms of probability and forms the basis of important rules, such as:
- **Law of Total Probability:** Breaks down the probability of an event into parts based on a partition of the sample space.
[Conditional Probability Visualization](https://setosa.io/conditional/)
---
### Bayes' Theorem
Bayes' Theorem updates the probability of an event based on new information:
$$
P(H|ε) = \frac{P(ε|H)P(H)}{P(ε)}
$$
Where:
-$P(H)$: Prior belief about event$H$.
-$P(ε|H)$: Likelihood of observing$ε$if$H$is true.
-$P(H|ε)$: Updated belief after observing$ε$.
**Bayes' Theorem Formula Visualization:**
![Bayes' Theorem](Bayes_Theorem-1813835086.gif)
This is particularly useful for analyzing rare events and understanding false positives in testing.
### Independence of Events
Events are **independent** if one event occurring does not affect the probability of the other. Mathematically:
$$
P(ε_1 \cap ε_2) = P(ε_1)P(ε_2)
$$
This concept can extend to multiple events:
- **Pairwise Independence:** Any two events in a set are independent.
- **Mutual Independence:** All events in a set are independent, even in combinations.

Binary file not shown.

After

Width:  |  Height:  |  Size: 89 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 37 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 57 KiB