vault backup: 2025-02-04 12:36:44

2025-02-04 12:36:44 +01:00
parent 03342ea07a
commit 14b3afb020
3 changed files with 96 additions and 7 deletions
--- a/.obsidian/workspace.json
+++ b/.obsidian/workspace.json
@ -201,6 +201,11 @@
  },
  "active": "96f5fe23af86a273",
  "lastOpenFiles": [
    "Pasted image 20250113151159.png",
    "Advanced Algorithms/Graph Algorithms.md",
    "Advanced Algorithms/Graphs.md",
    "Introduction to Machine Learning/Introductory lecture.md",
    "Introduction to Machine Learning/image.png",
    "Extracurricular/Circuitree/Committee Market/Macro pad.md",
    "Extracurricular/Circuitree/Committee Market/discussion/Committee market ideas.md",
    "Extracurricular/Circuitree/Committee Market/discussion/CA.md",
@ -208,7 +213,6 @@
    "Extracurricular/Misc/Proposed Routine Plan.canvas",
    "Extracurricular/Misc/Ideas.md",
    "Functional Programming/Eq and Num.md",
    "Introduction to Machine Learning/Introductory lecture.md",
    "Functional Programming/Proofs.md",
    "Operating Systems/Introductory lecture.md",
    "Discrete Structures/Relations and Digraphs.md",
@ -221,7 +225,6 @@
    "Operating Systems/assets/image.png",
    "Operating Systems/image.png",
    "Operating Systems/assets",
    "Pasted image 20250113151159.png",
    "conflict-files-obsidian-git.md",
    "Statistics and Probability/Mock exam run 1.md",
    "Operating Systems",
@ -234,17 +237,13 @@
    "Discrete Structures/Midterm/attempt 2.md",
    "Discrete Structures/Midterm/attempt 1.md",
    "Discrete Structures/Midterm/Untitled.md",
    "Discrete Structures/Midterm/Midterm prep.md",
    "Discrete Structures/Midterm",
    "Extracurricular/satQuest/img/Pasted image 20241206134156.png",
    "Extracurricular/satQuest/Parts Proposal.md",
    "Untitled.canvas",
    "Discrete Structures/Mathematical Data Structures.md",
    "Advanced Algorithms/Pasted image 20241203234600.png",
    "Excalidraw",
    "Extracurricular/satQuest/img/Pasted image 20241206134213.png",
    "Extracurricular/satQuest/img/Pasted image 20241206134207.png",
    "Extracurricular/satQuest/img/Pasted image 20241206133007.png",
    "Extracurricular/satQuest/img",
    "Advanced Algorithms/assets/pnp",
    "Advanced Algorithms/assets/graph",
--- a/Learning/Introductory
+++ b/Learning/Introductory
@ -117,9 +117,95 @@ graph TD
 ```
 ### Mathematical Formulation
 [Markov Decision Process](https://en.wikipedia.org/wiki/Markov_decision_process)[^5] (MDP) is a mathematical framework for modeling decision-making in situations where outcomes are partly random and partly under the control of a decision maker.
 An MDP consists of:
 - A set of states $S$
 - A set of actions $A$
 - A reward function $R$
 - A transition function $P$
 - A discount factor $\gamma$
 It can be represented as a tuple $(S, A, R, P, \gamma)$.
 Or a graph:
 ```mermaid
 graph TD
    A[States] --> B[Actions]
    B --> C[Reward function]
    C --> D[Transition function]
    D --> E[Discount factor]
 ```
 The process itself can be represented as a sequence of states, actions, and rewards:
 $(s_0, a_0, r_0, s_1, a_1, r_1, s_2, a_2, r_2, \ldots)$
 The goal is to learn a policy $\pi$ that maps states to actions, i.e. $\pi(s) = a$.
 The policy can be deterministic or stochastic[^4].
 1. At time step $t=0$, the agent observes the current state $s_0$.
 2. For $t=0$ until end:
   - The agent selects an action $a_t$ based on the policy $\pi$.
   - Environment grants reward $r_t$ and transitions to the next state $s_{t+1}$.
   - Agent updates its policy based on the reward and the next state.
 To summarize:
 $$
 G_t = \Sigma_{t\geq 0}y^t r_t = r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \ldots
 $$
 where $G_t$ is the return at time step $t$, $r_t$ is the reward at time step $t$, and $\gamma$ is the discount factor. 
 ## The value function
 The value function $V(s)$ is the expected return starting from state $s$ and following policy $\pi$.
 $$
 V_\pi(s) = \mathbb{E}_\pi(G_t | s_t = s)
 $$
 Similarly, the action-value function $Q(s, a)$ is the expected return starting from state $s$, taking action $a$, and following policy $\pi$.
 $$
 Q_\pi(s, a) = \mathbb{E}_\pi(G_t | s_t = s, a_t = a)
 $$
 ### Bellman equation
 Like Richard Bellman from the [Graph Algorithms](Graph%20Algorithms.md).
 States that the value of a state is the reward for that state plus the value of the next state.
 $$
 V_\pi(s) = \mathbb{E}_\pi(r_{t+1} + \gamma V_\pi(s_{t+1}) | s_t = s)
 $$
 ## Q-learning
 Something makes me feel like this will be in the exam.
 The goal of Q-learning is to find the optimal policy by learning the optimal Q-values for each state-action pair.
 What's a Q-value? It's the expected return starting from state $s$, taking action $a$, and following policy $\pi$.
 $$
 Q^*(s, a) = \max_\pi Q_\pi(s, a)
 $$
 The optimal Q-value $Q^*(s, a)$ is the maximum Q-value for state $s$ and action $a$. The algorithm iteratively updates the Q-values based on the Bellman equation. This is called **value iteration**.
 ## Conclusion
 As with every other fucking course that deals with graphs in any way shape or form, we have to deal with A FUCK TON of hard-to-read notation <3.
 ![Comparison](assets/image.png)
 [^1]: Prototypes in this context means a representative sample of the data. For example, if we have a dataset of images of cats and dogs, we can represent the dataset by a few images of cats and dogs that are representative of the whole dataset.
@ -128,3 +214,7 @@ graph TD
 [^3]: An agent is an entity that interacts with the environment. For example, a self-driving car is an agent that interacts with the environment (the road, other cars, etc.) to achieve a goal (e.g. reach a destination).
 [^4]: A deterministic policy maps each state to a single action, while a stochastic policy maps each state to a probability distribution over actions. For example, a deterministic policy might map state $s$ to action $a$, while a stochastic policy might map state $s$ to a probability distribution over actions.
 [^5]:https://en.wikipedia.org/wiki/Markov_chain
--- a/Learning/image.png
+++ b/Learning/image.png