Notes/Pattern matching.md at 8bf51950599db488a6a457c043da5b9a1b0778a1

2024-12-07 21:07:38 +01:00

4.1 KiB

Raw Blame History

type
mixed

Prefix Function (`\pi`)

The prefix function is a tool used in pattern matching algorithms, particularly in the Knuth-Morris-Pratt (KMP) algorithm. It is designed to preprocess a pattern to facilitate efficient searching.

Definition

For a string P of length m, the prefix function \pi[i] for i = 1, 2, \ldots, m is the length of the longest proper prefix of the substring P[1 \ldots i] that is also a suffix of this substring.

Key Points

A proper prefix of a string is a prefix that is not equal to the entire string. ¹
\pi[i] helps skip unnecessary comparisons in pattern matching by indicating the next position to check after a mismatch.
\pi[1] = 0 always, since no proper prefix of a single character can also be a suffix.

Example

For the pattern P = "ababcab":

P[1] = "a": \pi[1] = 0.
P[1 \ldots 2] = "ab": No prefix matches the suffix, so \pi[2] = 0.
P[1 \ldots 3] = "aba": Prefix "a" matches suffix "a", so \pi[3] = 1.
P[1 \ldots 4] = "abab": Prefix "ab" matches suffix "ab", so \pi[4] = 2.
Continue similarly to compute \pi[i] for the entire pattern.

Knuth-Morris-Pratt (KMP) Algorithm

The KMP algorithm is a pattern matching algorithm that uses the prefix function \pi to efficiently search for occurrences of a pattern P in a text T.

Key Idea

When a mismatch occurs during the comparison of P with T, use the prefix function \pi to determine the next position in P to continue matching, rather than restarting from the beginning.

Steps

Compute the prefix function \pi for the pattern P.
Search:
- Compare P with substrings of T.
- If there’s a mismatch at P[j] and T[i], use \pi[j] to shift P rather than restarting at P[1].
The algorithm runs in O(n + m) time complexity, where n is the length of T and m is the length of P.

Rabin-Karp Algorithm

The Rabin-Karp algorithm is another pattern matching algorithm, notable for using hashing to identify potential matches.

Key Idea

Instead of comparing substrings character by character, the algorithm compares hash values of the pattern and substrings of the text.

Steps

Compute the hash value of the pattern P and the first substring of the text T of length m.
Slide the window over T and compute hash values for the next substrings in constant time using a rolling hash. ²
If the hash value of a substring matches the hash value of P, compare the actual strings to confirm the match.

Hash Function

The hash function is typically chosen such that it is fast to compute and minimizes collisions:


h(s) = (s[1] \cdot p^{m-1} + s[2] \cdot p^{m-2} + \ldots + s[m] \cdot p^0) \mod q,

where:

p is a base (e.g., a small prime number),
q is a large prime to avoid overflow.

Complexity

Best Case: O(n + m), where n is the length of the text and m is the length of the pattern.
Worst Case: O(nm) due to hash collisions.

KMP v.s. Rabin-Karp

Feature	Knuth-Morris-Pratt (KMP)	Rabin-Karp
Technique	Prefix function	Hashing
Preprocessing	Compute `\pi` array	Compute hash of `P`
Efficiency	`O(n + m)`	`O(n + m)` (best), `O(nm)` (worst)
Use Case	Best for exact matches	Useful for multiple patterns or approximate matches

This graphic is too AI generated for me -> Use KMP when looking for a pattern, use RK when multiple patterns

Footnotes

A proper prefix of a string s is any prefix of s that is not equal to s itself. For example, proper prefixes of "abc" are "", "a", and "ab". ↩︎
A rolling hash computes the hash of a new substring by updating the hash of the previous substring, avoiding the need to recompute from scratch. ↩︎

4.1 KiB Raw Blame History Unescape Escape

Prefix Function (\pi)

Definition

Key Points

Example

Knuth-Morris-Pratt (KMP) Algorithm

Key Idea

Steps

Rabin-Karp Algorithm

Key Idea

Steps

Hash Function

Complexity

KMP v.s. Rabin-Karp

Footnotes

4.1 KiB

Raw Blame History

Prefix Function (`\pi`)