中图收术大 算法基础 第八讲:串匹配算法 主讲:顾乃杰教授 单位:计算机科学技术学院 方創 天寰 学期:2016-2017(秋) 基下宇 港英 题才 學府一
算法基础 第八讲:串匹配算法 主 讲: 顾 乃 杰 教授 单 位: 计算机科学技术学院 学 期: 2016-2017 (秋)
SIVERSITY ScIE\CE TECH\OLoGY 主要内容 CHINA The Naive algorithm(Brute Force The Knuth-Morris-Pratt Algorithm ◆ The ShIFt-Or algorithm The boyer-Moore algorithm The boyer-Moore-Horspool algorithm The Karp-Rabin algorithm C onclusion 本教案参考了下述有关 String Searching Algorithm的教案,在此表示感谢: 1.中国台湾省国立中山大学黄三益教授的教案 2. Princeton University· Kevin Wayne· Theory of algorithms·COS42 021/2 &T
2021/2/4 Department of Computer Science & Technology 2 主要内容 ◆ The Naive Algorithm (Brute Force ) ◆ The Knuth-Morris-Pratt Algorithm ◆ The SHIFT-OR Algorithm ◆ The Boyer-Moore Algorithm ◆ The Boyer-Moore-Horspool Algorithm ◆ The Karp-Rabin Algorithm ◆ Conclusion 本教案参考了下述有关 String Searching Algorithm 的教案,在此表示感谢: 1. 中国台湾省 国立中山大学 黃三益教授的教案 2. Princeton University • Kevin Wayne • Theory of Algorithms • COS 42
SIVERSITY ScIE\CE TECH\OLoGY 8.0串匹配问题 CHINA a String-matching Problem 1. Find one occurrences of a pattern in a text 2. Find out all the occurrences of a pattern in a text a Applications require two kinds of solution depending on which string, the pattern or the text, is given first algorithms based on the use of automata or combinatorial properties of strings are commonly implemented to preprocess the pattern and sol ve the first kind of problem 2. The notion of indexes realized by trees or automata is used in the second kind of solutions 021/2 &T
2021/2/4 Department of Computer Science & Technology 3 8.0 串匹配问题 String-matching Problem: 1. Find one occurrences of a pattern in a text ; 2. Find out all the occurrences of a pattern in a text. Applications require two kinds of solution depending on which string, the pattern or the text, is given first. 1. Algorithms based on the use of automata or combinatorial properties of strings are commonly implemented to preprocess the pattern and solve the first kind of problem. 2. The notion of indexes realized by trees or automata is used in the second kind of solutions
SIVERSITY ScIE\CE TECH\OLoGY 串匹配及其应用 CHINA 口 Some applications Wo word processors Virus scanning Text information retrieval systems. Lexis, Nexis) Digital libraries Natural language processing Specialized databases Computational molecular biology Web eb search engines Bioinformatic 021/2 &T
2021/2/4 Department of Computer Science & Technology 4 串匹配及其应用 Some applications. ➢ Word processors. ➢ Virus scanning. ➢ Text information retrieval systems. (Lexis, Nexis) ➢ Digital libraries. ➢ Natural language processing. ➢ Specialized databases. ➢ Computational molecular biology. ➢ Web search engines. ➢ Bioinformatic
SIVERSITY ScIE\CE TECH\OLoGY 串匹配示例 CHINA Search Text n nee edeneenee dl e n l d Search Pattern needle Successful Search n nee nledeneeneed len l d Search Pattern ne e dle 021/2 &T
2021/2/4 Department of Computer Science & Technology 5 串匹配示例 n n e e n l Search Text e d e n e e n e e d l e n l d n e e d l e Search Pattern Successful Search n n e e n l e d e n e e n e e d l e n l d n e e d l e Search Pattern
SIVERSITY ScIE\CE TECH\OLoGY CHINA 常用述语和定义 Parameters.记文本串为T,模式串为P n:the length of the text m: the length of the pattern(string Typically. n>> e.g. n=1 million, m= 1 hundred o: the size of the alphabet ∑: the alphabet Cn: the expected number of comparisons performed by an algorithm while searching the pattern in a text of length n 021/2 &T
2021/2/4 Department of Computer Science & Technology 6 常用述语和定义 ◼ Parameters. 记文本串为 T,模式串为 P ◼ n: the length of the text. ◼ m : the length of the pattern (string). ◼ Typically, n >> m. ◼ e.g., n = 1 million, m = 1 hundred ◼ σ : the size of the alphabet. ◼ ∑ : the alphabet. ◼ Cn : the expected number of comparisons performed by an algorithm while searching the pattern in a text of length n
IVERSITY ScIE\CE TECH\OLoGY 串匹配算法概述 CHINA 口目前教科书上所介绍的串匹配算法基本原理是: 利用一个大小等同于模式长度的 windo对文本串进行扫描; 2.首先将模式串与文本串的左端对齐; 3.对模式串与文本串的对应字符进行对比--称为一次 attempt 4.在每次成功匹配或每次失配之后,将 window右移; 5.重复3,4两步直到 window的右端超出文本串的右端。 口这种方法称为 sliding window mechanism 在将文本串中的当前 indow部份与模式串对比时可以: 从左到右,也可以从右到左,甚至可以用特定次序。 021/2 &T
2021/2/4 Department of Computer Science & Technology 7 串匹配算法概述 目前教科书上所介绍的串匹配算法基本原理是: 1. 利用一个大小等同于模式长度的 window 对文本串进行扫描; 2. 首先将模式串与文本串的左端对齐; 3. 对模式串与文本串的对应字符进行对比----称为一次 attempt 4. 在每次成功匹配或每次失配之后,将 window 右移; 5. 重复3,4两步直到 window 的右端超出文本串的右端。 这种方法称为 sliding window mechanism. ➢ 在将文本串中的当前window部份与模式串对比时可以: 从左到右,也可以从右到左,甚至可以用特定次序
串匹配算法概述(续)8 VERSITI E\CE TECH\OLOGY CHINA 口 From left to right Karp and rabin Knuth, Morris and Pratt 口 From right to left Boyer and moore H orspoo In any order Brute Force Algorithms( Naive Algorithm 021/2 &T
2021/2/4 Department of Computer Science & Technology 8 串匹配算法概述 (续) From left to right ➢ Karp and Rabin ➢ Knuth,Morris and Pratt From right to left ➢ Boyer and Moore ➢ Horspool In any order ➢ Brute Force Algorithms( Naive Algorithm )
IVERSITY ScIE\CE TECH\OLoGY 8.1 Brute force算法 CHINA Brute force Check for pattern starting at every text position, trying to match any substring of length m in the text with the pattern Analysis of brute force Running time depends on pattern and text. can be slow when strings repeat themselves Worst case: O(MN) comparisons. too slow when M and N are large 021/2 &T
2021/2/4 Department of Computer Science & Technology 9 8.1 Brute Force 算法 ◼ Brute force: Check for pattern starting at every text position, trying to match any substring of length m in the text with the pattern 。 Analysis of brute force. ➢ Running time depends on pattern and text. ➢ can be slow when strings repeat themselves Worst case: O(MN) comparisons. ➢ too slow when M and N are large
SIVERSITY ScIE\CE TECH\OLoGY Brute force算法伪代码1 CHINA Brute-Force-I(T, P) while i<n-m do j=0 /*left to right scan ofP while j<m and Plj+1]=t[+j+l do j=j+1 if j=m then Report match at position(i-j+1) i=i+1: Return 021/2 &T
2021/2/4 Department of Computer Science & Technology 10 Brute Force算法伪代码1 Brute-Force-1 (T,P) ; i =0 ; while i≤n-m do j = 0; //* left to right scan of P while j < m and P[j+1] = T[i+j+1] do j = j+1; if j=m then Report_match_at_position(i-j+1); i = i+1; Return