vault backup: 2025-02-04 12:36:44

This commit is contained in:
Boyan 2025-02-04 12:36:44 +01:00
parent 03342ea07a
commit 14b3afb020
3 changed files with 96 additions and 7 deletions

View File

@ -201,6 +201,11 @@
},
"active": "96f5fe23af86a273",
"lastOpenFiles": [
"Pasted image 20250113151159.png",
"Advanced Algorithms/Graph Algorithms.md",
"Advanced Algorithms/Graphs.md",
"Introduction to Machine Learning/Introductory lecture.md",
"Introduction to Machine Learning/image.png",
"Extracurricular/Circuitree/Committee Market/Macro pad.md",
"Extracurricular/Circuitree/Committee Market/discussion/Committee market ideas.md",
"Extracurricular/Circuitree/Committee Market/discussion/CA.md",
@ -208,7 +213,6 @@
"Extracurricular/Misc/Proposed Routine Plan.canvas",
"Extracurricular/Misc/Ideas.md",
"Functional Programming/Eq and Num.md",
"Introduction to Machine Learning/Introductory lecture.md",
"Functional Programming/Proofs.md",
"Operating Systems/Introductory lecture.md",
"Discrete Structures/Relations and Digraphs.md",
@ -221,7 +225,6 @@
"Operating Systems/assets/image.png",
"Operating Systems/image.png",
"Operating Systems/assets",
"Pasted image 20250113151159.png",
"conflict-files-obsidian-git.md",
"Statistics and Probability/Mock exam run 1.md",
"Operating Systems",
@ -234,17 +237,13 @@
"Discrete Structures/Midterm/attempt 2.md",
"Discrete Structures/Midterm/attempt 1.md",
"Discrete Structures/Midterm/Untitled.md",
"Discrete Structures/Midterm/Midterm prep.md",
"Discrete Structures/Midterm",
"Extracurricular/satQuest/img/Pasted image 20241206134156.png",
"Extracurricular/satQuest/Parts Proposal.md",
"Untitled.canvas",
"Discrete Structures/Mathematical Data Structures.md",
"Advanced Algorithms/Pasted image 20241203234600.png",
"Excalidraw",
"Extracurricular/satQuest/img/Pasted image 20241206134213.png",
"Extracurricular/satQuest/img/Pasted image 20241206134207.png",
"Extracurricular/satQuest/img/Pasted image 20241206133007.png",
"Extracurricular/satQuest/img",
"Advanced Algorithms/assets/pnp",
"Advanced Algorithms/assets/graph",

View File

@ -117,9 +117,95 @@ graph TD
```
### Mathematical Formulation
[Markov Decision Process](https://en.wikipedia.org/wiki/Markov_decision_process)[^5] (MDP) is a mathematical framework for modeling decision-making in situations where outcomes are partly random and partly under the control of a decision maker.
An MDP consists of:
- A set of states $S$
- A set of actions $A$
- A reward function $R$
- A transition function $P$
- A discount factor $\gamma$
It can be represented as a tuple $(S, A, R, P, \gamma)$.
Or a graph:
```mermaid
graph TD
A[States] --> B[Actions]
B --> C[Reward function]
C --> D[Transition function]
D --> E[Discount factor]
```
The process itself can be represented as a sequence of states, actions, and rewards:
$(s_0, a_0, r_0, s_1, a_1, r_1, s_2, a_2, r_2, \ldots)$
The goal is to learn a policy $\pi$ that maps states to actions, i.e. $\pi(s) = a$.
The policy can be deterministic or stochastic[^4].
1. At time step $t=0$, the agent observes the current state $s_0$.
2. For $t=0$ until end:
- The agent selects an action $a_t$ based on the policy $\pi$.
- Environment grants reward $r_t$ and transitions to the next state $s_{t+1}$.
- Agent updates its policy based on the reward and the next state.
To summarize:
$$
G_t = \Sigma_{t\geq 0}y^t r_t = r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \ldots
$$
where $G_t$ is the return at time step $t$, $r_t$ is the reward at time step $t$, and $\gamma$ is the discount factor.
## The value function
The value function $V(s)$ is the expected return starting from state $s$ and following policy $\pi$.
$$
V_\pi(s) = \mathbb{E}_\pi(G_t | s_t = s)
$$
Similarly, the action-value function $Q(s, a)$ is the expected return starting from state $s$, taking action $a$, and following policy $\pi$.
$$
Q_\pi(s, a) = \mathbb{E}_\pi(G_t | s_t = s, a_t = a)
$$
### Bellman equation
Like Richard Bellman from the [Graph Algorithms](Graph%20Algorithms.md).
States that the value of a state is the reward for that state plus the value of the next state.
$$
V_\pi(s) = \mathbb{E}_\pi(r_{t+1} + \gamma V_\pi(s_{t+1}) | s_t = s)
$$
## Q-learning
Something makes me feel like this will be in the exam.
The goal of Q-learning is to find the optimal policy by learning the optimal Q-values for each state-action pair.
What's a Q-value? It's the expected return starting from state $s$, taking action $a$, and following policy $\pi$.
$$
Q^*(s, a) = \max_\pi Q_\pi(s, a)
$$
The optimal Q-value $Q^*(s, a)$ is the maximum Q-value for state $s$ and action $a$. The algorithm iteratively updates the Q-values based on the Bellman equation. This is called **value iteration**.
## Conclusion
As with every other fucking course that deals with graphs in any way shape or form, we have to deal with A FUCK TON of hard-to-read notation <3.
![Comparison](assets/image.png)
[^1]: Prototypes in this context means a representative sample of the data. For example, if we have a dataset of images of cats and dogs, we can represent the dataset by a few images of cats and dogs that are representative of the whole dataset.
@ -128,3 +214,7 @@ graph TD
[^3]: An agent is an entity that interacts with the environment. For example, a self-driving car is an agent that interacts with the environment (the road, other cars, etc.) to achieve a goal (e.g. reach a destination).
[^4]: A deterministic policy maps each state to a single action, while a stochastic policy maps each state to a probability distribution over actions. For example, a deterministic policy might map state $s$ to action $a$, while a stochastic policy might map state $s$ to a probability distribution over actions.
[^5]:https://en.wikipedia.org/wiki/Markov_chain

Binary file not shown.

After

Width:  |  Height:  |  Size: 52 KiB