87 lines
4.1 KiB
Markdown
87 lines
4.1 KiB
Markdown
---
|
||
type: mixed
|
||
---
|
||
|
||
## Prefix Function ($\pi$)
|
||
|
||
The prefix function is a tool used in pattern matching algorithms, particularly in the **Knuth-Morris-Pratt (KMP) algorithm**. It is designed to preprocess a pattern to facilitate efficient searching.
|
||
|
||
### Definition
|
||
For a string $P$ of length $m$, the prefix function $\pi[i]$ for $i = 1, 2, \ldots, m$ is the length of the longest proper prefix of the substring $P[1 \ldots i]$ that is also a suffix of this substring.
|
||
|
||
### Key Points
|
||
1. A proper prefix of a string is a prefix that is not equal to the entire string. [^1]
|
||
2. $\pi[i]$ helps skip unnecessary comparisons in pattern matching by indicating the next position to check after a mismatch.
|
||
3. $\pi[1] = 0$ always, since no proper prefix of a single character can also be a suffix.
|
||
|
||
### Example
|
||
For the pattern $P = "ababcab"$:
|
||
- $P[1] = "a"$: $\pi[1] = 0$.
|
||
- $P[1 \ldots 2] = "ab"$: No prefix matches the suffix, so $\pi[2] = 0$.
|
||
- $P[1 \ldots 3] = "aba"$: Prefix "a" matches suffix "a", so $\pi[3] = 1$.
|
||
- $P[1 \ldots 4] = "abab"$: Prefix "ab" matches suffix "ab", so $\pi[4] = 2$.
|
||
- Continue similarly to compute $\pi[i]$ for the entire pattern.
|
||
|
||
---
|
||
|
||
## Knuth-Morris-Pratt (KMP) Algorithm
|
||
|
||
The KMP algorithm is a pattern matching algorithm that uses the prefix function $\pi$ to efficiently search for occurrences of a pattern $P$ in a text $T$.
|
||
|
||
### Key Idea
|
||
When a mismatch occurs during the comparison of $P$ with $T$, use the prefix function $\pi$ to determine the next position in $P$ to continue matching, rather than restarting from the beginning.
|
||
|
||
### Steps
|
||
1. Compute the prefix function $\pi$ for the pattern $P$.
|
||
2. Search:
|
||
- Compare $P$ with substrings of $T$.
|
||
- If there’s a mismatch at $P[j]$ and $T[i]$, use $\pi[j]$ to shift $P$ rather than restarting at $P[1]$.
|
||
3. The algorithm runs in $O(n + m)$ time [complexity](Complexity.md), where $n$ is the length of $T$ and $m$ is the length of $P$.
|
||
|
||
---
|
||
|
||
## Rabin-Karp Algorithm
|
||
|
||
The Rabin-Karp algorithm is another pattern matching algorithm, notable for using hashing to identify potential matches.
|
||
|
||
### Key Idea
|
||
Instead of comparing substrings character by character, the algorithm compares hash values of the pattern and substrings of the text.
|
||
|
||
### Steps
|
||
1. Compute the hash value of the pattern $P$ and the first substring of the text $T$ of length $m$.
|
||
2. Slide the window over $T$ and compute hash values for the next substrings in constant time using a rolling hash. [^2]
|
||
3. If the hash value of a substring matches the hash value of $P$, compare the actual strings to confirm the match.
|
||
|
||
### Hash Function
|
||
The hash function is typically chosen such that it is fast to compute and minimizes collisions:
|
||
$$
|
||
h(s) = (s[1] \cdot p^{m-1} + s[2] \cdot p^{m-2} + \ldots + s[m] \cdot p^0) \mod q,
|
||
$$
|
||
where:
|
||
- $p$ is a base (e.g., a small prime number),
|
||
- $q$ is a large prime to avoid overflow.
|
||
|
||
### Complexity
|
||
- Best Case: $O(n + m)$, where $n$ is the length of the text and $m$ is the length of the pattern.
|
||
- Worst Case: $O(nm)$ due to hash collisions.
|
||
|
||
---
|
||
|
||
## KMP v.s. Rabin-Karp
|
||
|
||
| Feature | Knuth-Morris-Pratt (KMP) | Rabin-Karp |
|
||
| ------------- | ------------------------ | --------------------------------------------------- |
|
||
| Technique | Prefix function | Hashing |
|
||
| Preprocessing | Compute $\pi$ array | Compute hash of $P$ |
|
||
| Efficiency | $O(n + m)$ | $O(n + m)$ (best), $O(nm)$ (worst) |
|
||
| Use Case | Best for exact matches | Useful for multiple patterns or approximate matches |
|
||
| | | |
|
||
_This graphic is too AI generated for me_ -> Use KMP when looking for a pattern, use RK when multiple patterns
|
||
|
||
---
|
||
|
||
## Footnotes
|
||
|
||
[^1]: A proper prefix of a string $s$ is any prefix of $s$ that is not equal to $s$ itself. For example, proper prefixes of "abc" are "", "a", and "ab".
|
||
[^2]: A rolling hash computes the hash of a new substring by updating the hash of the previous substring, avoiding the need to recompute from scratch.
|