Syntax Analyzer CS308 Compiler Theory
Syntax Analyzer CS308 Compiler Theory 1
Syntax Analyzer Syntax Analyzer creates the syntactic structure of the given source program. This syntactic structure is mostly a parse tree. Syntax Analyzer is also known as parser. The syntax of a programming is described by a context-free grammar (CFG).We will use BNF(Backus-Naur Form)notation in the description of CFGs. 。 The syntax analyzer(parser)checks whether a given source program satisfies the rules implied by a context-free grammar or not. -If it satisfies,the parser creates the parse tree of that program. Otherwise the parser gives the error messages. A context-free grammar gives a precise syntactic specification of a programming language. -the design of the grammar is an initial phase of the design of a compiler. a grammar can be directly converted into a parser by some tools. CS308 Compiler Theory 2
Syntax Analyzer • Syntax Analyzer creates the syntactic structure of the given source program. • Thi i i l This syntactic structure is mostly a parse tree. • Syntax Analyzer is also known as parser. • The syntax of a programming is described by a The syntax of a programming is described by a context context-free grammar (CFG) free grammar (CFG). We will We will use BNF (Backus-Naur Form) notation in the description of CFGs. • The syntax analyzer (parser) checks whether a given source program satisfies the rules i li d b implied by a context-free grammar or not. – If it satisfies, the parser creates the parse tree of that program. – Otherwise the parser gives the error messages. • A context-free grammar – gives a precise syntactic specification of a programming language. – the desigg p g p n of the grammar is an initial phase of the design of a compiler. – a grammar can be directly converted into a parser by some tools. CS308 Compiler Theory 2
Parser Parser works on a stream of tokens. The smallest item is a token. source Lexical token Parser parse tree program Analyzer get next token CS308 Compiler Theory 3
Parser • Parser works on a stream of tokens. • The smallest item is a token The smallest item is a token. Lexical Parser source token parse tree Analyzer Parser program get next token p CS308 Compiler Theory 3
Parsers (cont.) We categorize the parsers into two groups: 1.Top-Down Parser the parse tree is created top to bottom,starting from the root. 2.Bottom-Up Parser - the parse is created bottom to top;starting from the leaves o Both top-down and bottom-up parsers scan the input from left to right (one symbol at a time). Efficient top-down and bottom-up parsers can be implemented only for sub-classes of context-free grammars. -LL for top-down parsing LR for bottom-up parsing CS308 Compiler Theory 4
Parsers (cont.) • We categorize the parsers into two groups: 1. Top-Down Parser – the parse tree is created top to bottom starting from the root the parse tree is created top to bottom, starting from the root. 2. Bottom-Up Parser – the parse is created bottom to top; starting from the leaves the parse is created bottom to top; starting from the leaves • Both top Both top-down and bottom down and bottom-up parsers scan the input from left to right up parsers scan the input from left to right (one symbol at a time). • Efficient top-down and bottom-upp p y arsers can be implemented only for sub-classes of context-free grammars. – LL for top-down parsing LR f b i CS308 Compiler Theory 4 – LR for bottom-up parsing
Context-Free Grammars Inherently recursive structures of a programming language are defined by a context-free grammar. In a context-free grammar,we have: A finite set of terminals (in our case,this will be the set of tokens) -A finite set of non-terminals (syntactic-variables) A finite set of productions rules in the following form ·A→a where A is a non-terminal and a is a string of terminals and non-terminals (including the empty string) -A start symbol (one of the non-terminal symbol) ·Example: E→E+EIE-E|E*EE/E|-E E→(E) E→id CS308 Compiler Theory 5
Context-Free Grammars • Inherently recursive structures of a programming language are defined b a conte t by a conte x t -free grammar free grammar. • In a context-free grammar, we have: – A finite set of terminals (in our case, this will be the set of tokens) – A finite set of non-terminals (syntactic-variables) – A finite set of productions rules in the following form • A → α where A is a non-terminal and α is a string of terminals and non-terminals (including the empty string) – A start s y ( mbol (one of the non-terminal s y ) mbol ) • Example: E → E + E|E – E|E * E | E/E | E | E / E | - E E → ( E ) E → id CS308 Compiler Theory 5
Derivations E→E+E ·E+E derives from E we can replace E by E+E to able to do this,we have to have a production rule E->E+E in our grammar. E→E+E→id+E→id+id A sequence of replacements of non-terminal symbols is called a derivation of id+id from E. In general a derivation step is oAβ→oyβif there is a production rule A→y in our grammar where a and B are arbitrary strings of terminal and non-terminal symbols 01→02台.→0m (a derives from a or a derives a) → derives in one step 台 derives in zero or more steps derives in one or more steps CS308 Compiler Theory 6
Derivations E ⇒ E+E • E+E derives from E – we can replace E by E+E – to able to do this, we have to have a production rule E→E+E in our grammar. E ⇒ E+E ⇒ id+E ⇒ id+id • A sequence of replacements of non-terminal symbols is called a derivation of id+id from E. • In general a derivation step is αAβ ⇒ αγβ if there is a production rule A→γ in our grammar where α and β are arbitrary g strin s of terminal and non-terminal symbols α1 ⇒ α2 ⇒ ... ⇒ αn (αn derives from α1 or α1 derives αn ) ⇒ : di i t erives in one step ⇒ : derives in zero or more steps ⇒ : derives in one or more steps * + CS308 Compiler Theory 6
CFG-Terminology L(G)is the language ofG (the language generated by G)which is a set of sentences. A sentence ofL(G)is a string of terminal symbols of G. If S is the start symbol of G then o is a sentence ofL(G)iff S where o is a string of terminals of G. If G is a context-free grammar,L(G)is a context-free language. Two grammars are equivalent if they produce the same language. ·S0 If a contains non-terminals,it is called as a sentential form of G. If a does not contain non-terminals,it is called as a sentence of G. CS308 Compiler Theory 7
CFG - Terminology • L(G) is the language of G (the language generated by G) which is a set o f sentences. • A sentence of L(G) is a string of terminal symbols of G. • If S is the start symbol of G then ω is a sentence of L(G) iff S ⇒ ω where ω is a string of terminals of G. + • If G is a context-free grammar, L(G) is a context-free language. • Two grammars are Two grammars are equivalent equivalent if they produce the same language if they produce the same language. • S ⇒ α If t i t i l it i ll d i l f fG * • S ⇒ α - If α con t ains nonterminals, it is call e d as a sententia l form o f G. - If α does not contain non-terminals, it is called as a sentence of G. CS308 Compiler Theory 7
Derivation Example E→-E→-(E)→-(E+E)→-(id+E)→-(id+id) OR E→-E→-(E)→-(E+E)→-(E+id)→-(id+id) At each derivation step,we can choose any of the non-terminal in the sentential form of G for the replacement. If we always choose the left-most non-terminal in each derivation step,this derivation is called as left-most derivation. If we always choose the right-most non-terminal in each derivation step,this derivation is called as right-most derivation. CS308 Compiler Theory
Derivation Example E ⇒ -E ⇒ -(E) ⇒ -(E+E) ⇒ -(id+E) ⇒ -(id+id) O R E ⇒ -E ⇒ -(E) ⇒ -(E+E) ⇒ -(E+id) ⇒ -(id+id) • At each derivation step, we can choose any of the non-terminal in the sentential form of G for the re placement. • If we always choose the left-most non-terminal in each derivation step, this derivation i ll d s call e d as l f e ft-most di i er ivat ion. • If we always choose the right If we always choose the right -most non most non -terminal in each derivation step this terminal in each derivation step, this derivation is called as right-most derivation. CS308 Compiler Theory 8
Left-Most and Right-Most Derivations Left-Most Derivation E盒E盒(E)盒-(E+E)盒(id+E)盒(id+id) Right-Most Derivation E盒E盒-(E)a-(E+E)a(E+id)盒-(id+id) We will see that the top-down parsers try to find the left-most derivation of the given source program. We will see that the bottom-up parsers try to find the right-most derivation of the given source program in the reverse order. CS308 Compiler Theory 9
Left-Most and Right-Most Derivations Left-Most Derivation E ⇒ -E ⇒ -(E) ⇒ -(E+E) ⇒ -(id+E) ⇒ -(id+id) lm lm lm lm lm Right-Most Derivation E ⇒ - E ⇒ -(E) ⇒ -(E+E) ⇒ -(E+id) ⇒ -(id+id) (id+id) W ill h h d fi d h l f di i rm rm rm rm rm • We will see t hat t he topdown parsers try to fi n d t he l e ft-most der ivat ion of the given source program. • We will see that the bottom-up parsers try to find the right-most derivation of the given source program in the reverse order CS308 Compiler Theory 9 derivation of the given source program in the reverse order
Parse Tree Inner nodes of a parse tree are non-terminal symbols. The leaves of a parse tree are terminal symbols. A parse tree can be seen as a graphical representation of a derivation. E→-E →-(E) →-(E+E) →-(id+E) →-(id+id E id id id CS308 Compiler Theory 10
Parse Tree • Inner nodes of a parse tree are non-terminal symbols. • The leaves of a parse tree are terminal symbols. • A parse tree can be seen as a graphical representation of a derivation. E ⇒ -E E - E E - E E - E ⇒ -(E) ⇒ -(E+E) E E + E ( ) E ( ) E E E E - ( ) E E - ( ) ⇒ -(id+E) ⇒ -(id+id) E id E + id E + E id CS308 Compiler Theory 10