Notes/Advanced Algorithms/Pattern matching.md

---
type: mixed
---

## Prefix Function ($\pi$)

The prefix function is a tool used in pattern matching algorithms, particularly in the **Knuth-Morris-Pratt (KMP) algorithm**. It is designed to preprocess a pattern to facilitate efficient searching.

### Definition
For a string $P$ of length $m$, the prefix function $\pi[i]$ for $i = 1, 2, \ldots, m$ is the length of the longest proper prefix of the substring $P[1 \ldots i]$ that is also a suffix of this substring.

### Key Points
1. A proper prefix of a string is a prefix that is not equal to the entire string. [^1]
2. $\pi[i]$ helps skip unnecessary comparisons in pattern matching by indicating the next position to check after a mismatch.
3. $\pi[1] = 0$ always, since no proper prefix of a single character can also be a suffix.

### Example
For the pattern $P = "ababcab"$:
- $P[1] = "a"$: $\pi[1] = 0$.
- $P[1 \ldots 2] = "ab"$: No prefix matches the suffix, so $\pi[2] = 0$.
- $P[1 \ldots 3] = "aba"$: Prefix "a" matches suffix "a", so $\pi[3] = 1$.
- $P[1 \ldots 4] = "abab"$: Prefix "ab" matches suffix "ab", so $\pi[4] = 2$.
- Continue similarly to compute $\pi[i]$ for the entire pattern.

---

## Knuth-Morris-Pratt (KMP) Algorithm

The KMP algorithm is a pattern matching algorithm that uses the prefix function $\pi$ to efficiently search for occurrences of a pattern $P$ in a text $T$.

### Key Idea
When a mismatch occurs during the comparison of $P$ with $T$, use the prefix function $\pi$ to determine the next position in $P$ to continue matching, rather than restarting from the beginning.

### Steps
1. Compute the prefix function $\pi$ for the pattern $P$.
2. Search:
   - Compare $P$ with substrings of $T$.
   - If there’s a mismatch at $P[j]$ and $T[i]$, use $\pi[j]$ to shift $P$ rather than restarting at $P[1]$.
3. The algorithm runs in $O(n + m)$ time [complexity](Complexity.md), where $n$ is the length of $T$ and $m$ is the length of $P$.

---

## Rabin-Karp Algorithm

The Rabin-Karp algorithm is another pattern matching algorithm, notable for using hashing to identify potential matches.

### Key Idea
Instead of comparing substrings character by character, the algorithm compares hash values of the pattern and substrings of the text.

### Steps
1. Compute the hash value of the pattern $P$ and the first substring of the text $T$ of length $m$.
2. Slide the window over $T$ and compute hash values for the next substrings in constant time using a rolling hash. [^2]
3. If the hash value of a substring matches the hash value of $P$, compare the actual strings to confirm the match.

### Hash Function
The hash function is typically chosen such that it is fast to compute and minimizes collisions:
$$
h(s) = (s[1] \cdot p^{m-1} + s[2] \cdot p^{m-2} + \ldots + s[m] \cdot p^0) \mod q,
$$
where:
- $p$ is a base (e.g., a small prime number),
- $q$ is a large prime to avoid overflow.

### Complexity
- Best Case: $O(n + m)$, where $n$ is the length of the text and $m$ is the length of the pattern.
- Worst Case: $O(nm)$ due to hash collisions.

---

## KMP v.s. Rabin-Karp

| Feature       | Knuth-Morris-Pratt (KMP) | Rabin-Karp                                          |
| ------------- | ------------------------ | --------------------------------------------------- |
| Technique     | Prefix function          | Hashing                                             |
| Preprocessing | Compute $\pi$ array      | Compute hash of $P$                                 |
| Efficiency    | $O(n + m)$               | $O(n + m)$ (best), $O(nm)$ (worst)                  |
| Use Case      | Best for exact matches   | Useful for multiple patterns or approximate matches |
|               |                          |                                                     |
_This graphic is too AI generated for me_ -> Use KMP when looking for a pattern, use RK when multiple patterns

---

## Footnotes

[^1]: A proper prefix of a string $s$ is any prefix of $s$ that is not equal to $s$ itself. For example, proper prefixes of "abc" are "", "a", and "ab".
[^2]: A rolling hash computes the hash of a new substring by updating the hash of the previous substring, avoiding the need to recompute from scratch.