diff --git a/.obsidian/workspace.json b/.obsidian/workspace.json index d4b2c51..c19a931 100644 --- a/.obsidian/workspace.json +++ b/.obsidian/workspace.json @@ -201,6 +201,11 @@ }, "active": "96f5fe23af86a273", "lastOpenFiles": [ + "Pasted image 20250113151159.png", + "Advanced Algorithms/Graph Algorithms.md", + "Advanced Algorithms/Graphs.md", + "Introduction to Machine Learning/Introductory lecture.md", + "Introduction to Machine Learning/image.png", "Extracurricular/Circuitree/Committee Market/Macro pad.md", "Extracurricular/Circuitree/Committee Market/discussion/Committee market ideas.md", "Extracurricular/Circuitree/Committee Market/discussion/CA.md", @@ -208,7 +213,6 @@ "Extracurricular/Misc/Proposed Routine Plan.canvas", "Extracurricular/Misc/Ideas.md", "Functional Programming/Eq and Num.md", - "Introduction to Machine Learning/Introductory lecture.md", "Functional Programming/Proofs.md", "Operating Systems/Introductory lecture.md", "Discrete Structures/Relations and Digraphs.md", @@ -221,7 +225,6 @@ "Operating Systems/assets/image.png", "Operating Systems/image.png", "Operating Systems/assets", - "Pasted image 20250113151159.png", "conflict-files-obsidian-git.md", "Statistics and Probability/Mock exam run 1.md", "Operating Systems", @@ -234,17 +237,13 @@ "Discrete Structures/Midterm/attempt 2.md", "Discrete Structures/Midterm/attempt 1.md", "Discrete Structures/Midterm/Untitled.md", - "Discrete Structures/Midterm/Midterm prep.md", "Discrete Structures/Midterm", "Extracurricular/satQuest/img/Pasted image 20241206134156.png", - "Extracurricular/satQuest/Parts Proposal.md", "Untitled.canvas", - "Discrete Structures/Mathematical Data Structures.md", "Advanced Algorithms/Pasted image 20241203234600.png", "Excalidraw", "Extracurricular/satQuest/img/Pasted image 20241206134213.png", "Extracurricular/satQuest/img/Pasted image 20241206134207.png", - "Extracurricular/satQuest/img/Pasted image 20241206133007.png", "Extracurricular/satQuest/img", "Advanced Algorithms/assets/pnp", "Advanced Algorithms/assets/graph", diff --git a/Introduction to Machine Learning/Introductory lecture.md b/Introduction to Machine Learning/Introductory lecture.md index 634c559..f322f48 100644 --- a/Introduction to Machine Learning/Introductory lecture.md +++ b/Introduction to Machine Learning/Introductory lecture.md @@ -117,9 +117,95 @@ graph TD ``` +### Mathematical Formulation +[Markov Decision Process](https://en.wikipedia.org/wiki/Markov_decision_process)[^5] (MDP) is a mathematical framework for modeling decision-making in situations where outcomes are partly random and partly under the control of a decision maker. + +An MDP consists of: + +- A set of states $S$ +- A set of actions $A$ +- A reward function $R$ +- A transition function $P$ +- A discount factor $\gamma$ + +It can be represented as a tuple $(S, A, R, P, \gamma)$. +Or a graph: + +```mermaid +graph TD + A[States] --> B[Actions] + B --> C[Reward function] + C --> D[Transition function] + D --> E[Discount factor] +``` + +The process itself can be represented as a sequence of states, actions, and rewards: + +$(s_0, a_0, r_0, s_1, a_1, r_1, s_2, a_2, r_2, \ldots)$ + +The goal is to learn a policy $\pi$ that maps states to actions, i.e. $\pi(s) = a$. + +The policy can be deterministic or stochastic[^4]. + + +1. At time step $t=0$, the agent observes the current state $s_0$. +2. For $t=0$ until end: + - The agent selects an action $a_t$ based on the policy $\pi$. + - Environment grants reward $r_t$ and transitions to the next state $s_{t+1}$. + - Agent updates its policy based on the reward and the next state. + + +To summarize: + +$$ +G_t = \Sigma_{t\geq 0}y^t r_t = r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \ldots +$$ + +where $G_t$ is the return at time step $t$, $r_t$ is the reward at time step $t$, and $\gamma$ is the discount factor. + + +## The value function + +The value function $V(s)$ is the expected return starting from state $s$ and following policy $\pi$. + +$$ +V_\pi(s) = \mathbb{E}_\pi(G_t | s_t = s) +$$ + +Similarly, the action-value function $Q(s, a)$ is the expected return starting from state $s$, taking action $a$, and following policy $\pi$. + +$$ +Q_\pi(s, a) = \mathbb{E}_\pi(G_t | s_t = s, a_t = a) +$$ + +### Bellman equation + +Like Richard Bellman from the [Graph Algorithms](Graph%20Algorithms.md). + +States that the value of a state is the reward for that state plus the value of the next state. + +$$ +V_\pi(s) = \mathbb{E}_\pi(r_{t+1} + \gamma V_\pi(s_{t+1}) | s_t = s) +$$ + + +## Q-learning +Something makes me feel like this will be in the exam. + +The goal of Q-learning is to find the optimal policy by learning the optimal Q-values for each state-action pair. + +What's a Q-value? It's the expected return starting from state $s$, taking action $a$, and following policy $\pi$. + +$$ +Q^*(s, a) = \max_\pi Q_\pi(s, a) +$$ + +The optimal Q-value $Q^*(s, a)$ is the maximum Q-value for state $s$ and action $a$. The algorithm iteratively updates the Q-values based on the Bellman equation. This is called **value iteration**. ## Conclusion +As with every other fucking course that deals with graphs in any way shape or form, we have to deal with A FUCK TON of hard-to-read notation <3. + ![Comparison](assets/image.png) [^1]: Prototypes in this context means a representative sample of the data. For example, if we have a dataset of images of cats and dogs, we can represent the dataset by a few images of cats and dogs that are representative of the whole dataset. @@ -127,4 +213,8 @@ graph TD [^2]: Parametrization is the process of defining a model in terms of its parameters. For example, in the model $m = \gamma ( \beta_0 + \beta_1 x_1)$, $\beta_0$ and $\beta_1$ are the parameters of the model. -[^3]: An agent is an entity that interacts with the environment. For example, a self-driving car is an agent that interacts with the environment (the road, other cars, etc.) to achieve a goal (e.g. reach a destination). \ No newline at end of file +[^3]: An agent is an entity that interacts with the environment. For example, a self-driving car is an agent that interacts with the environment (the road, other cars, etc.) to achieve a goal (e.g. reach a destination). + +[^4]: A deterministic policy maps each state to a single action, while a stochastic policy maps each state to a probability distribution over actions. For example, a deterministic policy might map state $s$ to action $a$, while a stochastic policy might map state $s$ to a probability distribution over actions. + +[^5]:https://en.wikipedia.org/wiki/Markov_chain \ No newline at end of file diff --git a/Introduction to Machine Learning/image.png b/Introduction to Machine Learning/image.png new file mode 100644 index 0000000..d509455 Binary files /dev/null and b/Introduction to Machine Learning/image.png differ