Lexical Analyzer CS308 Compiler Theory 1
Lexical Analyzer CS308 Compiler Theory 1
Lexical Analyzer Lexical Analyzer reads the source program character by character to produce tokens. Normally a lexical analyzer doesn't return a list of tokens at one shot, it returns a token when the parser asks a token from it. token source Lexical to semantic Parser program Analyzer analysis getNextToken Symbol Table CS308 Compiler Theory 2
Lexical Analyzer • Lexical Analyzer reads the source program character by character to prod k uce tokens. • Normally a lexical analyzer doesn’t return a list of tokens at one shot, it t t k h th k t k f it it returns a token when the parser asks a token from it. CS308 Compiler Theory 2
Token Token represents a set of strings described by a pattern. Identifier represents a set of strings which start with a letter continues with letters and digits The actual string(newval)is called as lexeme. -Tokens:identifier,number,addop,delimeter,... Since a token can represent more than one lexeme,additional information should be held for that specific lexeme.This additional information is called as the attribute of the token. For simplicity,a token may have a single attribute which holds the required information for that token. For identifiers,this attribute is a pointer to the symbol table,and the symbol table holds the actual attributes for that token. ·Some attributes: - where attr is pointer to the symbol table where val is the actual value of the number. Token type and its attribute uniquely identifies a lexeme. Regular expressions are widely used to specify patterns. CS308 Compiler Theory 3
Token • Token represents a set of strings described by a pattern. – Identifier represents a set of strings which sta Identifier represents a set of strings which start with a letter continues with letters and digits with a letter continues with letters and digits – The actual string (newval) is called as lexeme. – Tokens: identifier, number, addop, delimeter, … • Since a token can represent more than one lexeme additional information should be Since a token can represent more than one lexeme, additional information should be held for that specific lexeme. This additional information is called as the attribute of the token. • For simplicity a token may have a single attribute which holds the required For simplicity, a token may have a single attribute which holds the required information for that token. – For identifiers, this attribute is a pointer to the symbol table, and the symbol table holds the actual attributes for that token. • Some attributes: – where attr is pointer to the symbol table – no attribute is needed (if there is only one assignment operator) no attribute is needed (if there is only one assignment operator) – where val is the actual value of the number. • Token type and its attribute uniquely identifies a lexeme. • Regular expressions are widely used to specify patterns 3 Regular expressions are widely used to specify patterns. CS308 Compiler Theory
Input Buffering Why a compiler needs buffers? ·Buffer Pairs ·Sentinels m*c2eof forward lexemeBegin CS308 Compiler Theory 4
Input Buffering • Why a compiler needs buffers? • Buffer Pairs • Sentinels CS308 Compiler Theory 4
Terminology of Languages Alphabet:a finite set of symbols (ASCII characters) String Finite sequence of symbols on an alphabet Sentence and word are also used in terms of string -8 is the empty string -s is the length of string s. ● Language:sets of strings over some fixed alphabet -3 the empty set is a language. -(s}the set containing empty string is a language The set of well-formed C programs is a language -The set of all possible identifiers is a language. Operators on Strings: -Concatenation:xy represents the concatenation of strings x and y.s s =s &s=s -sh =sss..s(n times)s CS308 Compiler Theory 5
Terminology of Languages • Alphabet : a finite set of symbols (ASCII characters) • String : – Finite sequence of symbols on an alphabet – Sentence and ord are also sed in terms of string Sentence and word are also used in terms of string – ε is the empty string – |s| is the length of string s. • Language: sets of strings over some fixed alphabet – ∅ the empty set is a language. – { ε}h ii i i l } t he set containing empty string is a language – The set of well-formed C programs is a language – The set of all possible identifiers is a language. • Operators on Strings: – Concatenation: xy represents the concatenation of strings x and y. s ε = s ε s = s 5 – s n = s s s .. s ( n times) s 0 = ε CS308 Compiler Theory
Operations on languages ·Concatenation: L L2={SiS2I S1E L1 and S2E L2} ·Union -L1UL2={s|s∈L1ors∈L2} ·Exponentiation: -L0={ε}L1=L L2=LL ·Kleene Closure -L-UL i=0 ·Positive Closure -=U2 CS308 Compiler Theory 6
Operations on Languages • Concatenation: – L L = { s s | s ∈ L and s ∈ L } 1L 2 { s1s2 | s1 ∈ L1 and s2 ∈ L2 } • Union – L L { | L L } 1 ∪ L 2 = { s| s ∈ L1 or s ∈ L2 } • Exponentiation: – L 0 = { ε} L1 = L L 2 = LL • Kleene Closure Kleene Closure – L* = U ∞ i = 0 i L • Positive Closure – L + = U ∞ i L 6 i =1 CS308 Compiler Theory
Example ·L1={a,b,c,d} L2={1,2} ·L1L2={al,a2,b1,b2,c1,c2,d1,d2} ·L1UL2={a,b,c,d,1,2} L3 all strings with length three (using a,b,c,d) L*=all strings using letters a,b,c,d and empty string L=doesn't include the empty string CS308 Compiler Theory 7
Example • L1 = {a,b,c,d} L2 = {1,2} • L1L2 = {a1,a2,b1,b2,c1,c2,d1,d2} • L1 ∪ L2 = {a,b,c,d,1,2} • L13 = all strings with length three (using a,b,c,d} • L1* = all strings using letters a,b,c,d and empty string • L1+ = doesn’t include the empty string 7 1 py g CS308 Compiler Theory
Regular Expressions We use regular expressions to describe tokens of a programming language. A regular expression is built up of simpler regular expressions(using defining rules) Each regular expression denotes a language. A language denoted by a regular expression is called as a regular set. CS308 Compiler Theory 8
Regular Expressions • We use regular expressions to describe tokens of a programming language. • A regular expression is built up of simpler regular expressions (using d fi i l ) d efi n ing ru les ) • Each regular expression denotes a language. • A l d t d b l i i ll d A language deno t e d by a regu lar express ion is call e d as a regul t ar se t. CS308 Compiler Theory 8
Regular Expressions (Rules) Regular expressions over alphabet Reg.Expr Language it denotes 8 {ε} a∈∑ {a} ()|(2) L()UL(2) (1)(2) L(r1)L(2) (r) (L(r) (r) Lr) ·(r)=(r)r)* ·(r)?=()|E CS308 Compiler Theory 9
Regular Expressions (Rules) Regular expressions over alphabet Σ Reg. Expr Language it denotes ε { ε } a∈ Σ {a} (r ) | (r ) L(r ) ∪ L(r ) 1) | (r 2 ) L(r1) ∪ L(r 2 ) (r1) (r 2) L(r1) L(r 2 ) (r) * (L(r)) * (r) L(r) • (r) + = (r)(r) * • ( r )? = ( r ) | ε 9 ( ) ()| CS308 Compiler Theory
Regular Expressions (cont.) We may remove parentheses by using precedence rules. 一米 highest -concatenation next -1 lowest ·ab*lc means (a(b))(c) 。EX: -∑={0,1} -01=>{0,1} -(011)011)=>{00,01,10,11} -0*=>{ε,0,00,000,0000…} -(01)*=>all strings with 0 and 1,including the empty string CS308 Compiler Theory 10
Regular Expressions (cont.) • We may remove parentheses by using precedence rules. – * highest – concatenation next – | lowest • ab *|c means (a(b) *)|(c) • Ex: – Σ = {0,1} – 0|1 => {0,1} – (0|1)(0|1) => {00,01,10,11} – 0* – 0 => { ε ,0 00 000 0000 } 0,00,000,0000,.... } – (0|1)* => all strings with 0 and 1, including the empty string CS308 Compiler Theory 10