Notes/Advanced Algorithms/Pattern matching.md
2024-12-07 21:07:38 +01:00

4.1 KiB
Raw Blame History

type
mixed

Prefix Function (\pi)

The prefix function is a tool used in pattern matching algorithms, particularly in the Knuth-Morris-Pratt (KMP) algorithm. It is designed to preprocess a pattern to facilitate efficient searching.

Definition

For a string P of length m, the prefix function \pi[i] for i = 1, 2, \ldots, m is the length of the longest proper prefix of the substring P[1 \ldots i] that is also a suffix of this substring.

Key Points

  1. A proper prefix of a string is a prefix that is not equal to the entire string. 1
  2. \pi[i] helps skip unnecessary comparisons in pattern matching by indicating the next position to check after a mismatch.
  3. \pi[1] = 0 always, since no proper prefix of a single character can also be a suffix.

Example

For the pattern P = "ababcab":

  • P[1] = "a": \pi[1] = 0.
  • P[1 \ldots 2] = "ab": No prefix matches the suffix, so \pi[2] = 0.
  • P[1 \ldots 3] = "aba": Prefix "a" matches suffix "a", so \pi[3] = 1.
  • P[1 \ldots 4] = "abab": Prefix "ab" matches suffix "ab", so \pi[4] = 2.
  • Continue similarly to compute \pi[i] for the entire pattern.

Knuth-Morris-Pratt (KMP) Algorithm

The KMP algorithm is a pattern matching algorithm that uses the prefix function \pi to efficiently search for occurrences of a pattern P in a text T.

Key Idea

When a mismatch occurs during the comparison of P with T, use the prefix function \pi to determine the next position in P to continue matching, rather than restarting from the beginning.

Steps

  1. Compute the prefix function \pi for the pattern P.
  2. Search:
    • Compare P with substrings of T.
    • If theres a mismatch at P[j] and T[i], use \pi[j] to shift P rather than restarting at P[1].
  3. The algorithm runs in O(n + m) time complexity, where n is the length of T and m is the length of P.

Rabin-Karp Algorithm

The Rabin-Karp algorithm is another pattern matching algorithm, notable for using hashing to identify potential matches.

Key Idea

Instead of comparing substrings character by character, the algorithm compares hash values of the pattern and substrings of the text.

Steps

  1. Compute the hash value of the pattern P and the first substring of the text T of length m.
  2. Slide the window over T and compute hash values for the next substrings in constant time using a rolling hash. 2
  3. If the hash value of a substring matches the hash value of P, compare the actual strings to confirm the match.

Hash Function

The hash function is typically chosen such that it is fast to compute and minimizes collisions:


h(s) = (s[1] \cdot p^{m-1} + s[2] \cdot p^{m-2} + \ldots + s[m] \cdot p^0) \mod q,

where:

  • p is a base (e.g., a small prime number),
  • q is a large prime to avoid overflow.

Complexity

  • Best Case: O(n + m), where n is the length of the text and m is the length of the pattern.
  • Worst Case: O(nm) due to hash collisions.

KMP v.s. Rabin-Karp

Feature Knuth-Morris-Pratt (KMP) Rabin-Karp
Technique Prefix function Hashing
Preprocessing Compute \pi array Compute hash of P
Efficiency O(n + m) O(n + m) (best), O(nm) (worst)
Use Case Best for exact matches Useful for multiple patterns or approximate matches
This graphic is too AI generated for me -> Use KMP when looking for a pattern, use RK when multiple patterns

Footnotes


  1. A proper prefix of a string s is any prefix of s that is not equal to s itself. For example, proper prefixes of "abc" are "", "a", and "ab". ↩︎

  2. A rolling hash computes the hash of a new substring by updating the hash of the previous substring, avoiding the need to recompute from scratch. ↩︎