Lex Overview Usage Paradigm of Lex Lex is a tool for creating lexical analyzers. Lex source Lex program lex.yy.c Lexical analyzers fokenize input streams. Tokens are the terminals of a language. lex.yy.c C compiler a.out Regular expressions define tokens. input a.out tokens To Use Lex and Yacc Together Lex Internals Mechanism Lex source Yacc source (Lexical Rules)(Grammar Rules) Converts regular expressions into DFAs. DFAs are implemented as table driven state machines. lex.yy.c v.tab.c call Input yylex() yyparse() ◆Parsed Input return token lex.yy.c:What it produces Running Lex dofine YYTYPE unsignod char struct yyvork YYTYPE verify,advance;yycrank.{ To run lex on a source file,use the 88 0.0, 0,0, 6888 0,0, command: g事 lex source.I yycrankct-1. 0。 yycrank+-3, 8p3: This produces the file lex.yy.c which is the yycrank+0, veet1. yyvetop+5. C source for the lexical analyzer. unaigned char yymateh{ ·To compile this,use: 898182880t ,01,01,01 8:8: cc -o prog-0 lex.yy.c-lI 4++ 1
1 Lex Overview • Lex is a tool for creating lexical analyzers. • Lexical analyzers tokenize input streams. • Tokens are the terminals of a language. • Regular expressions define tokens . Usage Paradigm of Lex To Use Lex and Yacc Together Lex Yacc yylex() yyparse() Lex source (Lexical Rules) Yacc source (Grammar Rules) Input Parsed Input lex.yy.c y.tab.c call return token Lex Internals Mechanism • Converts regular expressions into DFAs. • DFAs are implemented as table driven state machines. lex.yy.c : What it produces Running Lex • To run lex on a source file, use the command: lex source.l • This produces the file lex.yy.c which is the C source for the lexical analyzer. • To compile this, use: cc -o prog -O lex.yy.c -ll
Versions and Reference Books General Format of Lex Source 第《 AT&T lex,GNU flex,and Win32 version lex yacc,2/e by John R.Levine,Tony Cr纱gop> ≤口n难2≤尼gXP》 Mason Doug Brown,O'Reilly 4年4 公常 Mastering Regular Expressions,by Jeffrey regp>Cact1on》 ≤王得艺xP7 <aGe1⊙12 E.F.Friedl,O'Reilly 年。4 艺落 User Subroutines (c code) Input specification file is in 3 parts Regular Policy of None- -Declarations:Definitions translated Source -Rules:Token Descriptions and actions Remember that Lex is turning the rules -Auxiliary Procedures:User-Written code into a program.Any source not intercepted Three parts are separated by % by Lex is copied into the generated Tips:The first part defines patterns, the third part defines actions,the program. second part puts together to express -Any line which is not part of a Lex rule or "If we see some pattern,then we do action which begins with a blank or tab some action”. -Anything included between lines containing only %and % -Anything after the second %delimiter Position of Copied Source Regular Policy of Translated Source source input prior to the first % Various variables or tables whose name -external to any function in the generated code prefixed by yy after the first %and prior to the second -yyleng.yysvec.yywork %% Various functions whose name prefixed by -appropriate place for declarations in the y function generated by Lex which contains the -yyless(,yymore(),yywarp(),yylex().. actions Various definition whose name are capital ·after the second%% -BEGIN,INITIAL... -after the Lex generated output 2
2 Versions and Reference Books • AT&T lex, GNU flex, and Win32 version • lex & yacc ,2/e by John R.Levine, Tony Mason & Doug Brown, O’Reilly • Mastering Regular Expressions, by Jeffrey E.F. Friedl, O’Reilly General Format of Lex Source • Input specification file is in 3 parts – Declarations: Definitions – Rules: Token Descriptions and actions – Auxiliary Procedures: User-Written code • Three parts are separated by %% • Tips: The first part defines patterns, the third part defines actions, the second part puts together to express “If we see some pattern, then we do some action”. Regular Policy of Nonetranslated Source • Remember that Lex is turning the rules into a program. Any source not intercepted by Lex is copied into the generated program. – Any line which is not part of a Lex rule or action which begins with a blank or tab – Anything included between lines containing only %{ and %} – Anything after the second %% delimiter Position of Copied Source • source input prior to the first %% – external to any function in the generated code • after the first %% and prior to the second %% – appropriate place for declarations in the function generated by Lex which contains the actions • after the second %% – after the Lex generated output Regular Policy of Translated Source • Various variables or tables whose name prefixed by yy – yyleng, yysvec[], yywork[] • Various functions whose name prefixed by yy – yyless(), yymore(), yywarp(), yylex()… • Various definition whose name are capital – BEGIN, INITIAL…
Default Rules and Actions Default Input and Output The first and second part must exist,but If you don't write your own main()to deal may be empty,the third part and the with the input and the output of yylex(),the second %are optional. default input of default main()is stdin and If the third part dose not contain a main(),- the default output of default main()is ll will link a default main()which calls stdout. yylex()then exits. Unmatched patterns will perform a default -stdin usually is to be keyboard input stdout usually is to be screen output action,which consists of copying the input -cs20:%./a.out inputfile>outputfile to the output Some Simple Lex Source A General Lex Source Example Examples A minimum lex program: %% It only copies the input to the output unchanged. Example lex source file A trivial program to deletes three spacing This first section contains necessary characters: *c declarations and includes %% *to use throughout the lex specifications. [\tin]: +/ Another trivial example: #include %% [t]+S: bin digit【o1] It deletes from the input all blanks or tabs at the ends of lines. (bin digit){ 各名 /*match all strings of 0's and 1's/ /, /Print out message with matching Now this is where you want your main *text program */ printf("BINARY:&s\n",yytext) int main(int argc,char "argv[])( / ([ab]*aa[ab]*bb[ab]*)I([ab]*bb[ab]*aa [ab]*){ call yylex to use the generated lexer /match all strings over */ (a,b)containing aa and bb yylex(); */ /* printf("AABB\n■): make sure everything was printed */ \n /ignore newlines fflush(yyout); exit(0); 3
3 Default Rules and Actions • The first and second part must exist, but may be empty, the third part and the second %% are optional. • If the third part dose not contain a main(), - ll will link a default main() which calls yylex() then exits. • Unmatched patterns will perform a default action, which consists of copying the input to the output Default Input and Output • If you don’t write your own main() to deal with the input and the output of yylex(), the default input of default main() is stdin and the default output of default main() is stdout. – stdin usually is to be keyboard input stdout usually is to be screen output – cs20: %./a.out outputfile Some Simple Lex Source Examples • A minimum lex program: %% It only copies the input to the output unchanged. • A trivial program to deletes three spacing characters: %% [ \t\n]; • Another trivial example: %% [ \t]+$; It deletes from the input all blanks or tabs at the ends of lines. A General Lex Source Example %{ /* * Example lex source file * This first section contains necessary * C declarations and includes * to use throughout the lex specifications. */ #include %} bin_digit [01] %% {bin_digit}* { /* match all strings of 0's and 1's */ /* Print out message with matching * text */ printf("BINARY: %s\n", yytext); } ([ab]*aa[ab]*bb[ab]*)|([ab]*bb[ab]*aa[ab]*) { /* match all strings over * (a,b) containing aa and bb */ printf("AABB\n"); } \n ; /* ignore newlines */ %% /* * Now this is where you want your main program */ int main(int argc, char *argv[]) { /* * call yylex to use the generated lexer */ yylex(); /* * make sure everything was printed */ fflush(yyout); exit(0); }
Token Definitions Elementary Operations(cont.) Extended Regular Expression -NOTE:.matches any character except the Elementary Operations newline -single characters -*-Kleene Closure ·except\.s^[】-?·+I()1{}%<> -+--Positive Closure -concatenation (put characters together) -alternation (abc) ·Examples: ·ab]=alb -0-9]+"[0-9+ ·a-k==ablc.jk note:without the quotes it could be any ·【a-z0-9j=any letter or digit character [a]==any character but a -t]+--is whitespace ·(except CR). There is a blank space character before the \t Special Characters: Special Characters (cont.) -matches any single character -A --means at the beginning of the line (except newline) (unless it is inside of a[) -"and \-quote the part as text -S means at the end of the line,same as An -It -tab -]--means anything except -In -newline I"[I"]"1"is a double quoted string -lb --backspace -(n,m)-m through n occurrences -" -double quote ·a(1,3}is a or aa or aaa -M -1 -{definition)-translation from definition ~ --this means the preceding was -/--matches only if followed by right part of optional ·ab?==aab 0/1 means the 0 of 01 but not 02 or 03 or... ·(ab)?=able -()-grouping Definitions The definitions can also contain variables and other declarations used by the Code ·NAME REG EXPR generated by Lex. -digs [0-9]+ These usually go at the start of this section, marked by %at the beginning and %at the end -integer fdigs) or the line which begins with a blank or tab. -plainreal (digs)"."Idigs) -Includes usually go here. -expreal (digs)"."(digs)[Ee]+-]?(digs) -It is usually convenient to maintain a line counter -real (plainrealKexpreal) so that error messages can be keyed to the lines in which the errors are found. NAME must be a valid C identifier ·% {NAME)is replaced by prior REG_EXPR int linecount=1; ·%) 4
4 Token Definitions ( Extended Regular Expression ) • Elementary Operations – single characters • except “ \ . $ ^ [ ] - ? * + | ( ) / { } % – concatenation (put characters together) – alternation (a|b|c) • [ab] == a|b • [a-k] == a|b|c|...|i|j|k • [a-z0-9] == any letter or digit • [^a] == any character but a • Elementary Operations (cont.) – NOTE: . matches any character except the newline – * -- Kleene Closure – + -- Positive Closure • Examples: – [0-9]+"."[0-9]+ • note: without the quotes it could be any character – [ \t]+ -- is whitespace • (except CR). • There is a blank space character before the \t • Special Characters: – . -- matches any single character (except newline) – “ and \ -- quote the part as text – \t -- tab – \n -- newline – \b -- backspace – \" -- double quote – \\ -- \ – ? -- this means the preceding was optional • ab? == a|ab • (ab)? == ab|ε • Special Characters (cont.) – ^ -- means at the beginning of the line (unless it is inside of a [ ]) – $ means at the end of the line, same as /\n – [^ ] -- means anything except • \"[^\"]*\" is a double quoted string – {n,m} – m through n occurrences • a{1,3} is a or aa or aaa – {definition} – translation from definition – / -- matches only if followed by right part of / • 0/1 means the 0 of 01 but not 02 or 03 or … – ( ) -- grouping Definitions • NAME REG_EXPR – digs [0-9]+ – integer {digs} – plainreal {digs}"."{digs} – expreal {digs}"."{digs}[Ee][+-]?{digs} – real {plainreal}|{expreal} • NAME must be a valid C identifier • {NAME} is replaced by prior REG_EXPR • The definitions can also contain variables and other declarations used by the Code generated by Lex. – These usually go at the start of this section, marked by %{ at the beginning and %} at the end or the line which begins with a blank or tab . – Includes usually go here. – It is usually convenient to maintain a line counter so that error messages can be keyed to the lines in which the errors are found. • %{ • int linecount = 1; • %}
Transition Rules Tokens and Actions ERE {program statement ·Example: program -(real) return FLOAT: statement -begin retum BEGIN; A null statement:will ignore the input -(newline)linecount++; Four special options: I ECHO:REJECT:BEGIN: -(integer){ printf("I found an integerin"): The unmatched token is using a default action ·return INTEGER: that ECHO from the input to the output ·} indicates that the action for this rule is from the action for the next rule Ambiguous Source Rules Multiple States lex allows the user to explicitly declare If 2 rules match the same pattern,Lex will multiple states in Definitions section use the first rule. %s COMMENT Lex always chooses the longest matching Default states is INITIAL or 0 substring for its tokens. Transition rules can be classified into To overide the choice,use action REJECT different states,which will be match ex: she (s++;REJECT;} depend on states he (h++;REJECT;} BEGIN is used to change state In; . ECHO: "/" {BEGIN COMMENT:) . {:} / (BEGIN INITIAL:) Lex Special Variables Lex library function calls identifiers used by Lex and Yacc begin with yylex( yy default main()contains a return yylex(: -yytext--a string containing the lexeme ·yywarp0 yyleng--the length of the lexeme -called by lexical analyzer if end of the input file -yyin-the input stream pointer -default yywarp()always retum 1 ·Example: ·yyless(n) -(integer){ printf("I found an integerin"): -n characters in yytext are retained sscanf(yytext"%d".&yylval): ·yymore0 return INTEGER: the next input expression recognized is to be tacked on to the end of this input -C++Comments-∥.. 5
5 Transition Rules • ERE { program statement program statement } • A null statement ; will ignore the input • Four special options: | ECHO; REJECT; BEGIN; • The unmatched token is using a default action that ECHO from the input to the output • | indicates that the action for this rule is from the action for the next rule Tokens and Actions • Example: – {real} return FLOAT; – begin return BEGIN; – {newline} linecount++; – {integer} { • printf("I found an integer\n"); • return INTEGER; • } Ambiguous Source Rules • If 2 rules match the same pattern, Lex will use the first rule. • Lex always chooses the longest matching substring for its tokens. • To overide the choice, use action REJECT ex: she {s++; REJECT;} he {h++; REJECT;} . | \n ; Multiple States • lex allows the user to explicitly declare multiple states ( in Definitions section ) %s COMMENT • Default states is INITIAL or 0 • Transition rules can be classified into different states, which will be match depend on states • BEGIN is used to change state Lex Special Variables • identifiers used by Lex and Yacc begin with yy – yytext -- a string containing the lexeme – yyleng -- the length of the lexeme – yyin – the input stream pointer • Example: – {integer} { • printf("I found an integer\n"); • sscanf(yytext,"%d", &yylval); • return INTEGER; • } – C++ Comments -- // ..... • //.* ; Lex library function calls • yylex() – default main() contains a return yylex(); • yywarp() – called by lexical analyzer if end of the input file – default yywarp() always return 1 • yyless(n) – n characters in yytext are retained • yymore() – the next input expression recognized is to be tacked on to the end of this input
User Written Code More Example 1 int lengs[100]: %% The actions associated with any given [a-z]+lengs[yyleng]++: token are normally specified using .1 statements in C.But occasionally the In; actions are complicated enough that it is %% better to describe them with a function call. yywrap() and define the function elsewhere. int i; Definitions of this sort go in the last section printf("Length No.wordsin"); of the Lex input. for(0=0:i0) printf("%5d%10dn",i,lengs[i]);return(1); More Example 2 Using yacc with lex int charCount=0,vordCount-0,lineCount-0; yacc will call yylex(to get the token from the input so that each lex rule should end word tin)+ with: (word) {wordcount++;charcount yyleng; return(token); n {charCount;lineCount;} where the appropriate token value is {charCount++; returned. main(){ An easy way is placing the line: yylex(): printf("Characters:%d Words:%d Lines:Xdin", #include "lex.yy.c" charCount,wordCount,lineCount);) in the last section of yacc input. 6
6 User Written Code • The actions associated with any given token are normally specified using statements in C. But occasionally the actions are complicated enough that it is better to describe them with a function call, and define the function elsewhere. • Definitions of this sort go in the last section of the Lex input. More Example 1 int lengs[100]; %% [a-z]+ lengs[yyleng]++; . | \n ; %% yywrap() { int i; printf("Length No. words\n"); for(i=0; i 0) printf("%5d%10d\n",i,lengs[i]); return(1); } More Example 2 Using yacc with lex • yacc will call yylex() to get the token from the input so that each lex rule should end with: return(token); where the appropriate token value is returned. • An easy way is placing the line: #include “lex.yy.c” in the last section of yacc input