Notes/Advanced Algorithms/Pattern matching.md
2024-12-07 21:07:38 +01:00

87 lines
4.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
type: mixed
---
## Prefix Function ($\pi$)
The prefix function is a tool used in pattern matching algorithms, particularly in the **Knuth-Morris-Pratt (KMP) algorithm**. It is designed to preprocess a pattern to facilitate efficient searching.
### Definition
For a string $P$ of length $m$, the prefix function $\pi[i]$ for $i = 1, 2, \ldots, m$ is the length of the longest proper prefix of the substring $P[1 \ldots i]$ that is also a suffix of this substring.
### Key Points
1. A proper prefix of a string is a prefix that is not equal to the entire string. [^1]
2. $\pi[i]$ helps skip unnecessary comparisons in pattern matching by indicating the next position to check after a mismatch.
3. $\pi[1] = 0$ always, since no proper prefix of a single character can also be a suffix.
### Example
For the pattern $P = "ababcab"$:
- $P[1] = "a"$: $\pi[1] = 0$.
- $P[1 \ldots 2] = "ab"$: No prefix matches the suffix, so $\pi[2] = 0$.
- $P[1 \ldots 3] = "aba"$: Prefix "a" matches suffix "a", so $\pi[3] = 1$.
- $P[1 \ldots 4] = "abab"$: Prefix "ab" matches suffix "ab", so $\pi[4] = 2$.
- Continue similarly to compute $\pi[i]$ for the entire pattern.
---
## Knuth-Morris-Pratt (KMP) Algorithm
The KMP algorithm is a pattern matching algorithm that uses the prefix function $\pi$ to efficiently search for occurrences of a pattern $P$ in a text $T$.
### Key Idea
When a mismatch occurs during the comparison of $P$ with $T$, use the prefix function $\pi$ to determine the next position in $P$ to continue matching, rather than restarting from the beginning.
### Steps
1. Compute the prefix function $\pi$ for the pattern $P$.
2. Search:
- Compare $P$ with substrings of $T$.
- If theres a mismatch at $P[j]$ and $T[i]$, use $\pi[j]$ to shift $P$ rather than restarting at $P[1]$.
3. The algorithm runs in $O(n + m)$ time [complexity](Complexity.md), where $n$ is the length of $T$ and $m$ is the length of $P$.
---
## Rabin-Karp Algorithm
The Rabin-Karp algorithm is another pattern matching algorithm, notable for using hashing to identify potential matches.
### Key Idea
Instead of comparing substrings character by character, the algorithm compares hash values of the pattern and substrings of the text.
### Steps
1. Compute the hash value of the pattern $P$ and the first substring of the text $T$ of length $m$.
2. Slide the window over $T$ and compute hash values for the next substrings in constant time using a rolling hash. [^2]
3. If the hash value of a substring matches the hash value of $P$, compare the actual strings to confirm the match.
### Hash Function
The hash function is typically chosen such that it is fast to compute and minimizes collisions:
$$
h(s) = (s[1] \cdot p^{m-1} + s[2] \cdot p^{m-2} + \ldots + s[m] \cdot p^0) \mod q,
$$
where:
- $p$ is a base (e.g., a small prime number),
- $q$ is a large prime to avoid overflow.
### Complexity
- Best Case: $O(n + m)$, where $n$ is the length of the text and $m$ is the length of the pattern.
- Worst Case: $O(nm)$ due to hash collisions.
---
## KMP v.s. Rabin-Karp
| Feature | Knuth-Morris-Pratt (KMP) | Rabin-Karp |
| ------------- | ------------------------ | --------------------------------------------------- |
| Technique | Prefix function | Hashing |
| Preprocessing | Compute $\pi$ array | Compute hash of $P$ |
| Efficiency | $O(n + m)$ | $O(n + m)$ (best), $O(nm)$ (worst) |
| Use Case | Best for exact matches | Useful for multiple patterns or approximate matches |
| | | |
_This graphic is too AI generated for me_ -> Use KMP when looking for a pattern, use RK when multiple patterns
---
## Footnotes
[^1]: A proper prefix of a string $s$ is any prefix of $s$ that is not equal to $s$ itself. For example, proper prefixes of "abc" are "", "a", and "ab".
[^2]: A rolling hash computes the hash of a new substring by updating the hash of the previous substring, avoiding the need to recompute from scratch.