vault backup: 2025-02-04 12:36:44
This commit is contained in:
parent
03342ea07a
commit
14b3afb020
11
.obsidian/workspace.json
vendored
11
.obsidian/workspace.json
vendored
@ -201,6 +201,11 @@
|
|||||||
},
|
},
|
||||||
"active": "96f5fe23af86a273",
|
"active": "96f5fe23af86a273",
|
||||||
"lastOpenFiles": [
|
"lastOpenFiles": [
|
||||||
|
"Pasted image 20250113151159.png",
|
||||||
|
"Advanced Algorithms/Graph Algorithms.md",
|
||||||
|
"Advanced Algorithms/Graphs.md",
|
||||||
|
"Introduction to Machine Learning/Introductory lecture.md",
|
||||||
|
"Introduction to Machine Learning/image.png",
|
||||||
"Extracurricular/Circuitree/Committee Market/Macro pad.md",
|
"Extracurricular/Circuitree/Committee Market/Macro pad.md",
|
||||||
"Extracurricular/Circuitree/Committee Market/discussion/Committee market ideas.md",
|
"Extracurricular/Circuitree/Committee Market/discussion/Committee market ideas.md",
|
||||||
"Extracurricular/Circuitree/Committee Market/discussion/CA.md",
|
"Extracurricular/Circuitree/Committee Market/discussion/CA.md",
|
||||||
@ -208,7 +213,6 @@
|
|||||||
"Extracurricular/Misc/Proposed Routine Plan.canvas",
|
"Extracurricular/Misc/Proposed Routine Plan.canvas",
|
||||||
"Extracurricular/Misc/Ideas.md",
|
"Extracurricular/Misc/Ideas.md",
|
||||||
"Functional Programming/Eq and Num.md",
|
"Functional Programming/Eq and Num.md",
|
||||||
"Introduction to Machine Learning/Introductory lecture.md",
|
|
||||||
"Functional Programming/Proofs.md",
|
"Functional Programming/Proofs.md",
|
||||||
"Operating Systems/Introductory lecture.md",
|
"Operating Systems/Introductory lecture.md",
|
||||||
"Discrete Structures/Relations and Digraphs.md",
|
"Discrete Structures/Relations and Digraphs.md",
|
||||||
@ -221,7 +225,6 @@
|
|||||||
"Operating Systems/assets/image.png",
|
"Operating Systems/assets/image.png",
|
||||||
"Operating Systems/image.png",
|
"Operating Systems/image.png",
|
||||||
"Operating Systems/assets",
|
"Operating Systems/assets",
|
||||||
"Pasted image 20250113151159.png",
|
|
||||||
"conflict-files-obsidian-git.md",
|
"conflict-files-obsidian-git.md",
|
||||||
"Statistics and Probability/Mock exam run 1.md",
|
"Statistics and Probability/Mock exam run 1.md",
|
||||||
"Operating Systems",
|
"Operating Systems",
|
||||||
@ -234,17 +237,13 @@
|
|||||||
"Discrete Structures/Midterm/attempt 2.md",
|
"Discrete Structures/Midterm/attempt 2.md",
|
||||||
"Discrete Structures/Midterm/attempt 1.md",
|
"Discrete Structures/Midterm/attempt 1.md",
|
||||||
"Discrete Structures/Midterm/Untitled.md",
|
"Discrete Structures/Midterm/Untitled.md",
|
||||||
"Discrete Structures/Midterm/Midterm prep.md",
|
|
||||||
"Discrete Structures/Midterm",
|
"Discrete Structures/Midterm",
|
||||||
"Extracurricular/satQuest/img/Pasted image 20241206134156.png",
|
"Extracurricular/satQuest/img/Pasted image 20241206134156.png",
|
||||||
"Extracurricular/satQuest/Parts Proposal.md",
|
|
||||||
"Untitled.canvas",
|
"Untitled.canvas",
|
||||||
"Discrete Structures/Mathematical Data Structures.md",
|
|
||||||
"Advanced Algorithms/Pasted image 20241203234600.png",
|
"Advanced Algorithms/Pasted image 20241203234600.png",
|
||||||
"Excalidraw",
|
"Excalidraw",
|
||||||
"Extracurricular/satQuest/img/Pasted image 20241206134213.png",
|
"Extracurricular/satQuest/img/Pasted image 20241206134213.png",
|
||||||
"Extracurricular/satQuest/img/Pasted image 20241206134207.png",
|
"Extracurricular/satQuest/img/Pasted image 20241206134207.png",
|
||||||
"Extracurricular/satQuest/img/Pasted image 20241206133007.png",
|
|
||||||
"Extracurricular/satQuest/img",
|
"Extracurricular/satQuest/img",
|
||||||
"Advanced Algorithms/assets/pnp",
|
"Advanced Algorithms/assets/pnp",
|
||||||
"Advanced Algorithms/assets/graph",
|
"Advanced Algorithms/assets/graph",
|
||||||
|
@ -117,9 +117,95 @@ graph TD
|
|||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
### Mathematical Formulation
|
||||||
|
|
||||||
|
[Markov Decision Process](https://en.wikipedia.org/wiki/Markov_decision_process)[^5] (MDP) is a mathematical framework for modeling decision-making in situations where outcomes are partly random and partly under the control of a decision maker.
|
||||||
|
|
||||||
|
An MDP consists of:
|
||||||
|
|
||||||
|
- A set of states $S$
|
||||||
|
- A set of actions $A$
|
||||||
|
- A reward function $R$
|
||||||
|
- A transition function $P$
|
||||||
|
- A discount factor $\gamma$
|
||||||
|
|
||||||
|
It can be represented as a tuple $(S, A, R, P, \gamma)$.
|
||||||
|
Or a graph:
|
||||||
|
|
||||||
|
```mermaid
|
||||||
|
graph TD
|
||||||
|
A[States] --> B[Actions]
|
||||||
|
B --> C[Reward function]
|
||||||
|
C --> D[Transition function]
|
||||||
|
D --> E[Discount factor]
|
||||||
|
```
|
||||||
|
|
||||||
|
The process itself can be represented as a sequence of states, actions, and rewards:
|
||||||
|
|
||||||
|
$(s_0, a_0, r_0, s_1, a_1, r_1, s_2, a_2, r_2, \ldots)$
|
||||||
|
|
||||||
|
The goal is to learn a policy $\pi$ that maps states to actions, i.e. $\pi(s) = a$.
|
||||||
|
|
||||||
|
The policy can be deterministic or stochastic[^4].
|
||||||
|
|
||||||
|
|
||||||
|
1. At time step $t=0$, the agent observes the current state $s_0$.
|
||||||
|
2. For $t=0$ until end:
|
||||||
|
- The agent selects an action $a_t$ based on the policy $\pi$.
|
||||||
|
- Environment grants reward $r_t$ and transitions to the next state $s_{t+1}$.
|
||||||
|
- Agent updates its policy based on the reward and the next state.
|
||||||
|
|
||||||
|
|
||||||
|
To summarize:
|
||||||
|
|
||||||
|
$$
|
||||||
|
G_t = \Sigma_{t\geq 0}y^t r_t = r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \ldots
|
||||||
|
$$
|
||||||
|
|
||||||
|
where $G_t$ is the return at time step $t$, $r_t$ is the reward at time step $t$, and $\gamma$ is the discount factor.
|
||||||
|
|
||||||
|
|
||||||
|
## The value function
|
||||||
|
|
||||||
|
The value function $V(s)$ is the expected return starting from state $s$ and following policy $\pi$.
|
||||||
|
|
||||||
|
$$
|
||||||
|
V_\pi(s) = \mathbb{E}_\pi(G_t | s_t = s)
|
||||||
|
$$
|
||||||
|
|
||||||
|
Similarly, the action-value function $Q(s, a)$ is the expected return starting from state $s$, taking action $a$, and following policy $\pi$.
|
||||||
|
|
||||||
|
$$
|
||||||
|
Q_\pi(s, a) = \mathbb{E}_\pi(G_t | s_t = s, a_t = a)
|
||||||
|
$$
|
||||||
|
|
||||||
|
### Bellman equation
|
||||||
|
|
||||||
|
Like Richard Bellman from the [Graph Algorithms](Graph%20Algorithms.md).
|
||||||
|
|
||||||
|
States that the value of a state is the reward for that state plus the value of the next state.
|
||||||
|
|
||||||
|
$$
|
||||||
|
V_\pi(s) = \mathbb{E}_\pi(r_{t+1} + \gamma V_\pi(s_{t+1}) | s_t = s)
|
||||||
|
$$
|
||||||
|
|
||||||
|
|
||||||
|
## Q-learning
|
||||||
|
Something makes me feel like this will be in the exam.
|
||||||
|
|
||||||
|
The goal of Q-learning is to find the optimal policy by learning the optimal Q-values for each state-action pair.
|
||||||
|
|
||||||
|
What's a Q-value? It's the expected return starting from state $s$, taking action $a$, and following policy $\pi$.
|
||||||
|
|
||||||
|
$$
|
||||||
|
Q^*(s, a) = \max_\pi Q_\pi(s, a)
|
||||||
|
$$
|
||||||
|
|
||||||
|
The optimal Q-value $Q^*(s, a)$ is the maximum Q-value for state $s$ and action $a$. The algorithm iteratively updates the Q-values based on the Bellman equation. This is called **value iteration**.
|
||||||
|
|
||||||
## Conclusion
|
## Conclusion
|
||||||
|
As with every other fucking course that deals with graphs in any way shape or form, we have to deal with A FUCK TON of hard-to-read notation <3.
|
||||||
|
|
||||||

|

|
||||||
|
|
||||||
[^1]: Prototypes in this context means a representative sample of the data. For example, if we have a dataset of images of cats and dogs, we can represent the dataset by a few images of cats and dogs that are representative of the whole dataset.
|
[^1]: Prototypes in this context means a representative sample of the data. For example, if we have a dataset of images of cats and dogs, we can represent the dataset by a few images of cats and dogs that are representative of the whole dataset.
|
||||||
@ -128,3 +214,7 @@ graph TD
|
|||||||
|
|
||||||
|
|
||||||
[^3]: An agent is an entity that interacts with the environment. For example, a self-driving car is an agent that interacts with the environment (the road, other cars, etc.) to achieve a goal (e.g. reach a destination).
|
[^3]: An agent is an entity that interacts with the environment. For example, a self-driving car is an agent that interacts with the environment (the road, other cars, etc.) to achieve a goal (e.g. reach a destination).
|
||||||
|
|
||||||
|
[^4]: A deterministic policy maps each state to a single action, while a stochastic policy maps each state to a probability distribution over actions. For example, a deterministic policy might map state $s$ to action $a$, while a stochastic policy might map state $s$ to a probability distribution over actions.
|
||||||
|
|
||||||
|
[^5]:https://en.wikipedia.org/wiki/Markov_chain
|
BIN
Introduction to Machine Learning/image.png
Normal file
BIN
Introduction to Machine Learning/image.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 52 KiB |
Loading…
x
Reference in New Issue
Block a user