vault backup: 2025-02-04 12:36:44

2025-02-04 12:36:44 +01:00
parent 03342ea07a
commit 14b3afb020
3 changed files with 96 additions and 7 deletions
--- a/.obsidian/workspace.json
+++ b/.obsidian/workspace.json
@ -201,6 +201,11 @@
  },
  "active": "96f5fe23af86a273",
  "lastOpenFiles": [
+    "Pasted image 20250113151159.png",
+    "Advanced Algorithms/Graph Algorithms.md",
+    "Advanced Algorithms/Graphs.md",
+    "Introduction to Machine Learning/Introductory lecture.md",
+    "Introduction to Machine Learning/image.png",
    "Extracurricular/Circuitree/Committee Market/Macro pad.md",
    "Extracurricular/Circuitree/Committee Market/discussion/Committee market ideas.md",
    "Extracurricular/Circuitree/Committee Market/discussion/CA.md",
@ -208,7 +213,6 @@
    "Extracurricular/Misc/Proposed Routine Plan.canvas",
    "Extracurricular/Misc/Ideas.md",
    "Functional Programming/Eq and Num.md",
-    "Introduction to Machine Learning/Introductory lecture.md",
    "Functional Programming/Proofs.md",
    "Operating Systems/Introductory lecture.md",
    "Discrete Structures/Relations and Digraphs.md",
@ -221,7 +225,6 @@
    "Operating Systems/assets/image.png",
    "Operating Systems/image.png",
    "Operating Systems/assets",
-    "Pasted image 20250113151159.png",
    "conflict-files-obsidian-git.md",
    "Statistics and Probability/Mock exam run 1.md",
    "Operating Systems",
@ -234,17 +237,13 @@
    "Discrete Structures/Midterm/attempt 2.md",
    "Discrete Structures/Midterm/attempt 1.md",
    "Discrete Structures/Midterm/Untitled.md",
-    "Discrete Structures/Midterm/Midterm prep.md",
    "Discrete Structures/Midterm",
    "Extracurricular/satQuest/img/Pasted image 20241206134156.png",
-    "Extracurricular/satQuest/Parts Proposal.md",
    "Untitled.canvas",
-    "Discrete Structures/Mathematical Data Structures.md",
    "Advanced Algorithms/Pasted image 20241203234600.png",
    "Excalidraw",
    "Extracurricular/satQuest/img/Pasted image 20241206134213.png",
    "Extracurricular/satQuest/img/Pasted image 20241206134207.png",
-    "Extracurricular/satQuest/img/Pasted image 20241206133007.png",
    "Extracurricular/satQuest/img",
    "Advanced Algorithms/assets/pnp",
    "Advanced Algorithms/assets/graph",
--- a/Learning/Introductory
+++ b/Learning/Introductory
@ -117,9 +117,95 @@ graph TD

 ```

+### Mathematical Formulation

+[Markov Decision Process](https://en.wikipedia.org/wiki/Markov_decision_process)[^5] (MDP) is a mathematical framework for modeling decision-making in situations where outcomes are partly random and partly under the control of a decision maker.
+
+An MDP consists of:
+
+- A set of states $S$
+- A set of actions $A$
+- A reward function $R$
+- A transition function $P$
+- A discount factor $\gamma$
+
+It can be represented as a tuple $(S, A, R, P, \gamma)$.
+Or a graph:
+
+```mermaid
+graph TD
+    A[States] --> B[Actions]
+    B --> C[Reward function]
+    C --> D[Transition function]
+    D --> E[Discount factor]
+```
+
+The process itself can be represented as a sequence of states, actions, and rewards:
+
+$(s_0, a_0, r_0, s_1, a_1, r_1, s_2, a_2, r_2, \ldots)$
+
+The goal is to learn a policy $\pi$ that maps states to actions, i.e. $\pi(s) = a$.
+
+The policy can be deterministic or stochastic[^4].
+
+
+1. At time step $t=0$, the agent observes the current state $s_0$.
+2. For $t=0$ until end:
+   - The agent selects an action $a_t$ based on the policy $\pi$.
+   - Environment grants reward $r_t$ and transitions to the next state $s_{t+1}$.
+   - Agent updates its policy based on the reward and the next state.
+
+
+To summarize:
+
+$$
+G_t = \Sigma_{t\geq 0}y^t r_t = r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \ldots
+$$
+
+where $G_t$ is the return at time step $t$, $r_t$ is the reward at time step $t$, and $\gamma$ is the discount factor. 
+
+
+## The value function
+
+The value function $V(s)$ is the expected return starting from state $s$ and following policy $\pi$.
+
+$$
+V_\pi(s) = \mathbb{E}_\pi(G_t | s_t = s)
+$$
+
+Similarly, the action-value function $Q(s, a)$ is the expected return starting from state $s$, taking action $a$, and following policy $\pi$.
+
+$$
+Q_\pi(s, a) = \mathbb{E}_\pi(G_t | s_t = s, a_t = a)
+$$
+
+### Bellman equation
+
+Like Richard Bellman from the [Graph Algorithms](Graph%20Algorithms.md).
+
+States that the value of a state is the reward for that state plus the value of the next state.
+
+$$
+V_\pi(s) = \mathbb{E}_\pi(r_{t+1} + \gamma V_\pi(s_{t+1}) | s_t = s)
+$$
+
+
+## Q-learning
+Something makes me feel like this will be in the exam.
+
+The goal of Q-learning is to find the optimal policy by learning the optimal Q-values for each state-action pair.
+
+What's a Q-value? It's the expected return starting from state $s$, taking action $a$, and following policy $\pi$.
+
+$$
+Q^*(s, a) = \max_\pi Q_\pi(s, a)
+$$
+
+The optimal Q-value $Q^*(s, a)$ is the maximum Q-value for state $s$ and action $a$. The algorithm iteratively updates the Q-values based on the Bellman equation. This is called **value iteration**.

 ## Conclusion
+As with every other fucking course that deals with graphs in any way shape or form, we have to deal with A FUCK TON of hard-to-read notation <3.
+
 ![Comparison](assets/image.png)

 [^1]: Prototypes in this context means a representative sample of the data. For example, if we have a dataset of images of cats and dogs, we can represent the dataset by a few images of cats and dogs that are representative of the whole dataset.
@ -127,4 +213,8 @@ graph TD
 [^2]: Parametrization is the process of defining a model in terms of its parameters. For example, in the model $m = \gamma ( \beta_0 + \beta_1 x_1)$, $\beta_0$ and $\beta_1$ are the parameters of the model.


-[^3]: An agent is an entity that interacts with the environment. For example, a self-driving car is an agent that interacts with the environment (the road, other cars, etc.) to achieve a goal (e.g. reach a destination).
+[^3]: An agent is an entity that interacts with the environment. For example, a self-driving car is an agent that interacts with the environment (the road, other cars, etc.) to achieve a goal (e.g. reach a destination).
+
+[^4]: A deterministic policy maps each state to a single action, while a stochastic policy maps each state to a probability distribution over actions. For example, a deterministic policy might map state $s$ to action $a$, while a stochastic policy might map state $s$ to a probability distribution over actions.
+
+[^5]:https://en.wikipedia.org/wiki/Markov_chain
--- a/Learning/image.png
+++ b/Learning/image.png