本节要点 正则表达式常应用于文本匹配: 串的查找 串的替换 将输入识别为一个个的记号
1 本节要点 • 正则表达式常应用于文本匹配: – 串的查找 – 串的替换 – 将输入识别为一个个的记号
正则表达式的应用 Use #1: Text-processing the web Web is full of data but it's in text form for humans to read · Screenscraping extracting the data you want from screen output these days, the output format is HTML Examples: extract tour schedule of your favorite bands from Ticketmaster web sites as web services: convert address to geo coordinates
正则表达式的应用 • Use #1: Text-processing the web – Web is full of data, but it’s in text form for humans to read • Screenscraping – extracting the data you want from screen output – these days, the output format is HTML • Examples: – extract tour schedule of your favorite bands from Ticketmaster – web sites as web services: convert address to geo coordinates 2
正则表达式的应用 Use #2: Text processing in general a spectrum of uses, from small to big Sma‖!fies: replacing " ugly quotes"with"smart quotes converting files between operating systems · Bigger tasks spell checking formatted documents(HTML): must extract text pretty printing code: find comments, etc; add format directives
正则表达式的应用 • Use #2: Text processing in general – a spectrum of uses, from small to big • Small fixes: – replacing "ugly quotes" with “smart quotes” – converting files between operating systems • Bigger tasks – spell checking formatted documents (HTML): must extract text – pretty printing code: find comments, etc; add format directives 3
正则表达式的应用 Use #3: Program processing especially on the web OntheWebprocedurecalls=httprequests procedure arguments"passed as strings argument extraction can be done with regular expressions · Other uses: extract components of an email address obfuscation: want to obfuscate all JS functions except those called from HTML embedded scripts; so scan web page for names of functions called from HTMl, to avoid obfuscating them
正则表达式的应用 • Use #3: Program processing – especially on the web • On the Web, procedure calls = http requests – “procedure arguments” passed as strings – argument extraction can be done with regular expressions • Other uses: – extract components of an email address – obfuscation: want to obfuscate all JS functions except those called from HTML embedded scripts; so scan web page for names of functions called from HTML, to avoid obfuscating them. 4
Regular Expression Tutorial Focus on the two languages: JavaScript Python a key rules common to both given a string and an regex. e Find the first position in string where a match is possible (except for the match( function in Python, which must match at the beginning of the string
Regular Expression Tutorial • Focus on the two languages: – JavaScript – Python A key rules common to both. Given a string and an regex: Find the first position in string where a match is possible. (except for the match() function in Python, which must match at the beginning of the string.) 5
String search: from simple to regexp JavaScript) Basic search methods for string objects string". indexof(rin") →2 string". indexof(new RegExp(rn))>-1 等效-" tring" search(new RegExp)4n -string" search(new RegExp(r n)2 string search(r. n/ 2 -"string". match(/tri str/ →["str" string". match(/ri ["st","ri"] string". match(/trilstr/g) strstr 参见( js. htm)
String search: from simple to regexp (JavaScript) • Basic search methods, for String objects: – "string".indexOf("rin") → 2 – "string".indexOf(new RegExp("r*n")) → -1 – "string".search(new RegExp("r*n")) → 4 – "string".search(new RegExp("r.*n")) → 2 – "string".search(/r.*n/) → 2 – "string".match(/tri|str/) → ["str"] – "string".match(/ri|st/g) → ["st", "ri"] – "string".match(/tri|str/g) → ["str"] 参见(js.htm) 6 等效
String search: from simple to regexp JavaScript indexof Syntax: object indexof(search Value, fromIndex) When called from a String object, this method returns the index of the first occurance of the specified searchvalue argument, starting from the specified fromIndex argument. search Syntax: object search(regexp) This method is used to search for a match between a regular expression and the specified string RegExp Syntax new RegExp( "pattern"L flags"l)EfEmyReg=pattern/flags
String search: from simple to regexp (JavaScript) • indexOf – Syntax: object.indexOf(searchValue,[fromIndex]) – When called from a String object, this method returns the index of the first occurance of the specified searchValue argument, starting from the specified fromIndex argument. • search – Syntax: object.search(regexp) – This method is used to search for a match between a regular expression and the specified string. – RegExp – Syntax: • new RegExp(“pattern”[, “flags”])或者myReg=pattern/flags
String search: from simple to regexp JavaScript match Syntax: object. match(regexp) This method is used to match a specified regular expression against a string If one or more matches are made, an array is returned that contains all of the matches. Each entry in the array is a copy of a string that contains a match. if no match is made, a nullis returned To perform a global match you must include the g global flag in the regular expression and to perform a case-insensitive match you must include the i'(ignore case) flag ·匹配用过的不用用于匹配
String search: from simple to regexp (JavaScript) • match – Syntax: object.match(regexp) – This method is used to match a specified regular expression against a string – If one or more matches are made, an array is returned that contains all of the matches. Each entry in the array is a copy of a string that contains a match. If no match is made, a null is returned. To perform a global match you must include the 'g' (global) flag in the regular expression and to perform a case-insensitive match you must include the 'i' (ignore case) flag. • 匹配用过的串不再用于匹配
Same for Python Basic search methods for String objects 表示是原始字义 Maton re match(r"tri rin" ,string") → no match/n re. search(r"tril rin","string"). group)o)>tri re compile(rtrilstr").findall("string )>['str re compile(r"rilst ). findall(string >['st,'ri] re search(r"Itr)I(in),string"). groups()>tr None)(()) capful edens note: match("expests the match to start at index o
Same for Python • Basic search methods, for String objects: – re.match(r"tri|rin", "string") → no match – re.search(r"tri|rin", "string").group(0) → 'tri' – re.compile(r"tri|str").findall("string") → ['str'] – re.compile(r"ri|st").findall("string") → ['st', 'ri'] – re.search(r"(tr)|(in)", "string").groups() → ('tr', None) • note: match() expects the match to start at index 0 9 表示是原始字义
Python正则表达式 ·支持“!,"*","+","?","|",“[y"八" ·“^N":匹配串的开始 “S":匹配到串尾 m}:m个重复 m,n}:m到n个重复 *?,+?,?2,{m,n}?:在第一个符号的意义上,改 贪婪的最大匹配为最小匹配 例:用正则表达式匹配“titles/H1>"时最大匹配可 匹配整个串,最小匹配匹配“ (.):匹配括号内的任意正则表达式,常用于分组
Python正则表达式 • 支持“.”, ”*”, ”+”, ”?”, ”|”, “[ ]”,”\” • “^” :匹配串的开始 • “$”:匹配到串尾 • {m}:m个重复 • {m,n}:m到n个重复 • *?, +?, ?? ,{m,n}? :在第一个符号的意义上,改 贪婪的最大匹配为最小匹配 • 例:用正则表达式匹配“title”时最大匹配可 匹配整个串,最小匹配匹配““ • (...) :匹配括号内的任意正则表达式,常用于分组