Notes/Advanced Algorithms/Pattern matching.md

87 lines
4.1 KiB
Markdown
Raw Normal View History

2024-12-07 21:07:38 +01:00
---
type: mixed
---
## Prefix Function ($\pi$)
The prefix function is a tool used in pattern matching algorithms, particularly in the **Knuth-Morris-Pratt (KMP) algorithm**. It is designed to preprocess a pattern to facilitate efficient searching.
### Definition
For a string $P$ of length $m$, the prefix function $\pi[i]$ for $i = 1, 2, \ldots, m$ is the length of the longest proper prefix of the substring $P[1 \ldots i]$ that is also a suffix of this substring.
### Key Points
1. A proper prefix of a string is a prefix that is not equal to the entire string. [^1]
2. $\pi[i]$ helps skip unnecessary comparisons in pattern matching by indicating the next position to check after a mismatch.
3. $\pi[1] = 0$ always, since no proper prefix of a single character can also be a suffix.
### Example
For the pattern $P = "ababcab"$:
- $P[1] = "a"$: $\pi[1] = 0$.
- $P[1 \ldots 2] = "ab"$: No prefix matches the suffix, so $\pi[2] = 0$.
- $P[1 \ldots 3] = "aba"$: Prefix "a" matches suffix "a", so $\pi[3] = 1$.
- $P[1 \ldots 4] = "abab"$: Prefix "ab" matches suffix "ab", so $\pi[4] = 2$.
- Continue similarly to compute $\pi[i]$ for the entire pattern.
---
## Knuth-Morris-Pratt (KMP) Algorithm
The KMP algorithm is a pattern matching algorithm that uses the prefix function $\pi$ to efficiently search for occurrences of a pattern $P$ in a text $T$.
### Key Idea
When a mismatch occurs during the comparison of $P$ with $T$, use the prefix function $\pi$ to determine the next position in $P$ to continue matching, rather than restarting from the beginning.
### Steps
1. Compute the prefix function $\pi$ for the pattern $P$.
2. Search:
- Compare $P$ with substrings of $T$.
- If theres a mismatch at $P[j]$ and $T[i]$, use $\pi[j]$ to shift $P$ rather than restarting at $P[1]$.
3. The algorithm runs in $O(n + m)$ time [complexity](Complexity.md), where $n$ is the length of $T$ and $m$ is the length of $P$.
---
## Rabin-Karp Algorithm
The Rabin-Karp algorithm is another pattern matching algorithm, notable for using hashing to identify potential matches.
### Key Idea
Instead of comparing substrings character by character, the algorithm compares hash values of the pattern and substrings of the text.
### Steps
1. Compute the hash value of the pattern $P$ and the first substring of the text $T$ of length $m$.
2. Slide the window over $T$ and compute hash values for the next substrings in constant time using a rolling hash. [^2]
3. If the hash value of a substring matches the hash value of $P$, compare the actual strings to confirm the match.
### Hash Function
The hash function is typically chosen such that it is fast to compute and minimizes collisions:
$$
h(s) = (s[1] \cdot p^{m-1} + s[2] \cdot p^{m-2} + \ldots + s[m] \cdot p^0) \mod q,
$$
where:
- $p$ is a base (e.g., a small prime number),
- $q$ is a large prime to avoid overflow.
### Complexity
- Best Case: $O(n + m)$, where $n$ is the length of the text and $m$ is the length of the pattern.
- Worst Case: $O(nm)$ due to hash collisions.
---
## KMP v.s. Rabin-Karp
| Feature | Knuth-Morris-Pratt (KMP) | Rabin-Karp |
| ------------- | ------------------------ | --------------------------------------------------- |
| Technique | Prefix function | Hashing |
| Preprocessing | Compute $\pi$ array | Compute hash of $P$ |
| Efficiency | $O(n + m)$ | $O(n + m)$ (best), $O(nm)$ (worst) |
| Use Case | Best for exact matches | Useful for multiple patterns or approximate matches |
| | | |
_This graphic is too AI generated for me_ -> Use KMP when looking for a pattern, use RK when multiple patterns
---
## Footnotes
[^1]: A proper prefix of a string $s$ is any prefix of $s$ that is not equal to $s$ itself. For example, proper prefixes of "abc" are "", "a", and "ab".
[^2]: A rolling hash computes the hash of a new substring by updating the hash of the previous substring, avoiding the need to recompute from scratch.