vault backup: 2025-02-04 12:36:44
This commit is contained in:
parent
03342ea07a
commit
14b3afb020
11
.obsidian/workspace.json
vendored
11
.obsidian/workspace.json
vendored
@ -201,6 +201,11 @@
|
||||
},
|
||||
"active": "96f5fe23af86a273",
|
||||
"lastOpenFiles": [
|
||||
"Pasted image 20250113151159.png",
|
||||
"Advanced Algorithms/Graph Algorithms.md",
|
||||
"Advanced Algorithms/Graphs.md",
|
||||
"Introduction to Machine Learning/Introductory lecture.md",
|
||||
"Introduction to Machine Learning/image.png",
|
||||
"Extracurricular/Circuitree/Committee Market/Macro pad.md",
|
||||
"Extracurricular/Circuitree/Committee Market/discussion/Committee market ideas.md",
|
||||
"Extracurricular/Circuitree/Committee Market/discussion/CA.md",
|
||||
@ -208,7 +213,6 @@
|
||||
"Extracurricular/Misc/Proposed Routine Plan.canvas",
|
||||
"Extracurricular/Misc/Ideas.md",
|
||||
"Functional Programming/Eq and Num.md",
|
||||
"Introduction to Machine Learning/Introductory lecture.md",
|
||||
"Functional Programming/Proofs.md",
|
||||
"Operating Systems/Introductory lecture.md",
|
||||
"Discrete Structures/Relations and Digraphs.md",
|
||||
@ -221,7 +225,6 @@
|
||||
"Operating Systems/assets/image.png",
|
||||
"Operating Systems/image.png",
|
||||
"Operating Systems/assets",
|
||||
"Pasted image 20250113151159.png",
|
||||
"conflict-files-obsidian-git.md",
|
||||
"Statistics and Probability/Mock exam run 1.md",
|
||||
"Operating Systems",
|
||||
@ -234,17 +237,13 @@
|
||||
"Discrete Structures/Midterm/attempt 2.md",
|
||||
"Discrete Structures/Midterm/attempt 1.md",
|
||||
"Discrete Structures/Midterm/Untitled.md",
|
||||
"Discrete Structures/Midterm/Midterm prep.md",
|
||||
"Discrete Structures/Midterm",
|
||||
"Extracurricular/satQuest/img/Pasted image 20241206134156.png",
|
||||
"Extracurricular/satQuest/Parts Proposal.md",
|
||||
"Untitled.canvas",
|
||||
"Discrete Structures/Mathematical Data Structures.md",
|
||||
"Advanced Algorithms/Pasted image 20241203234600.png",
|
||||
"Excalidraw",
|
||||
"Extracurricular/satQuest/img/Pasted image 20241206134213.png",
|
||||
"Extracurricular/satQuest/img/Pasted image 20241206134207.png",
|
||||
"Extracurricular/satQuest/img/Pasted image 20241206133007.png",
|
||||
"Extracurricular/satQuest/img",
|
||||
"Advanced Algorithms/assets/pnp",
|
||||
"Advanced Algorithms/assets/graph",
|
||||
|
@ -117,9 +117,95 @@ graph TD
|
||||
|
||||
```
|
||||
|
||||
### Mathematical Formulation
|
||||
|
||||
[Markov Decision Process](https://en.wikipedia.org/wiki/Markov_decision_process)[^5] (MDP) is a mathematical framework for modeling decision-making in situations where outcomes are partly random and partly under the control of a decision maker.
|
||||
|
||||
An MDP consists of:
|
||||
|
||||
- A set of states $S$
|
||||
- A set of actions $A$
|
||||
- A reward function $R$
|
||||
- A transition function $P$
|
||||
- A discount factor $\gamma$
|
||||
|
||||
It can be represented as a tuple $(S, A, R, P, \gamma)$.
|
||||
Or a graph:
|
||||
|
||||
```mermaid
|
||||
graph TD
|
||||
A[States] --> B[Actions]
|
||||
B --> C[Reward function]
|
||||
C --> D[Transition function]
|
||||
D --> E[Discount factor]
|
||||
```
|
||||
|
||||
The process itself can be represented as a sequence of states, actions, and rewards:
|
||||
|
||||
$(s_0, a_0, r_0, s_1, a_1, r_1, s_2, a_2, r_2, \ldots)$
|
||||
|
||||
The goal is to learn a policy $\pi$ that maps states to actions, i.e. $\pi(s) = a$.
|
||||
|
||||
The policy can be deterministic or stochastic[^4].
|
||||
|
||||
|
||||
1. At time step $t=0$, the agent observes the current state $s_0$.
|
||||
2. For $t=0$ until end:
|
||||
- The agent selects an action $a_t$ based on the policy $\pi$.
|
||||
- Environment grants reward $r_t$ and transitions to the next state $s_{t+1}$.
|
||||
- Agent updates its policy based on the reward and the next state.
|
||||
|
||||
|
||||
To summarize:
|
||||
|
||||
$$
|
||||
G_t = \Sigma_{t\geq 0}y^t r_t = r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \ldots
|
||||
$$
|
||||
|
||||
where $G_t$ is the return at time step $t$, $r_t$ is the reward at time step $t$, and $\gamma$ is the discount factor.
|
||||
|
||||
|
||||
## The value function
|
||||
|
||||
The value function $V(s)$ is the expected return starting from state $s$ and following policy $\pi$.
|
||||
|
||||
$$
|
||||
V_\pi(s) = \mathbb{E}_\pi(G_t | s_t = s)
|
||||
$$
|
||||
|
||||
Similarly, the action-value function $Q(s, a)$ is the expected return starting from state $s$, taking action $a$, and following policy $\pi$.
|
||||
|
||||
$$
|
||||
Q_\pi(s, a) = \mathbb{E}_\pi(G_t | s_t = s, a_t = a)
|
||||
$$
|
||||
|
||||
### Bellman equation
|
||||
|
||||
Like Richard Bellman from the [Graph Algorithms](Graph%20Algorithms.md).
|
||||
|
||||
States that the value of a state is the reward for that state plus the value of the next state.
|
||||
|
||||
$$
|
||||
V_\pi(s) = \mathbb{E}_\pi(r_{t+1} + \gamma V_\pi(s_{t+1}) | s_t = s)
|
||||
$$
|
||||
|
||||
|
||||
## Q-learning
|
||||
Something makes me feel like this will be in the exam.
|
||||
|
||||
The goal of Q-learning is to find the optimal policy by learning the optimal Q-values for each state-action pair.
|
||||
|
||||
What's a Q-value? It's the expected return starting from state $s$, taking action $a$, and following policy $\pi$.
|
||||
|
||||
$$
|
||||
Q^*(s, a) = \max_\pi Q_\pi(s, a)
|
||||
$$
|
||||
|
||||
The optimal Q-value $Q^*(s, a)$ is the maximum Q-value for state $s$ and action $a$. The algorithm iteratively updates the Q-values based on the Bellman equation. This is called **value iteration**.
|
||||
|
||||
## Conclusion
|
||||
As with every other fucking course that deals with graphs in any way shape or form, we have to deal with A FUCK TON of hard-to-read notation <3.
|
||||
|
||||

|
||||
|
||||
[^1]: Prototypes in this context means a representative sample of the data. For example, if we have a dataset of images of cats and dogs, we can represent the dataset by a few images of cats and dogs that are representative of the whole dataset.
|
||||
@ -127,4 +213,8 @@ graph TD
|
||||
[^2]: Parametrization is the process of defining a model in terms of its parameters. For example, in the model $m = \gamma ( \beta_0 + \beta_1 x_1)$, $\beta_0$ and $\beta_1$ are the parameters of the model.
|
||||
|
||||
|
||||
[^3]: An agent is an entity that interacts with the environment. For example, a self-driving car is an agent that interacts with the environment (the road, other cars, etc.) to achieve a goal (e.g. reach a destination).
|
||||
[^3]: An agent is an entity that interacts with the environment. For example, a self-driving car is an agent that interacts with the environment (the road, other cars, etc.) to achieve a goal (e.g. reach a destination).
|
||||
|
||||
[^4]: A deterministic policy maps each state to a single action, while a stochastic policy maps each state to a probability distribution over actions. For example, a deterministic policy might map state $s$ to action $a$, while a stochastic policy might map state $s$ to a probability distribution over actions.
|
||||
|
||||
[^5]:https://en.wikipedia.org/wiki/Markov_chain
|
BIN
Introduction to Machine Learning/image.png
Normal file
BIN
Introduction to Machine Learning/image.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 52 KiB |
Loading…
x
Reference in New Issue
Block a user