Term Vocabulary and Postings Lists Web Search and Mining Lecture 3: The term vocabulary and postings lists
Term Vocabulary and Postings Lists 1 Lecture 3: The term vocabulary and postings lists Web Search and Mining
Term Vocabulary and Postings Lists Recap of the previous lecture Basic inverted indexes Structure: Dictionary and Postings BRUTUS 124113145173174 CAeSAR 24561657132 calpurnIA→[23154101 Key step in construction Sorting Boolean query processing Intersection by linear time"merging Simple optimizations
Term Vocabulary and Postings Lists 2 Recap of the previous lecture ▪ Basic inverted indexes: ▪ Structure: Dictionary and Postings ▪ Key step in construction: Sorting ▪ Boolean query processing ▪ Intersection by linear time “merging” ▪ Simple optimizations
Term Vocabulary and Postings Lists Plan for this lecture Elaborate basic indexing Preprocessing to form the term vocabulary Documents Tokenization What terms do we put in the index? Postings Faster merges: skip lists Positional postings and phrase queries
Term Vocabulary and Postings Lists 3 Plan for this lecture Elaborate basic indexing ▪ Preprocessing to form the term vocabulary ▪ Documents ▪ Tokenization ▪ What terms do we put in the index? ▪ Postings ▪ Faster merges: skip lists ▪ Positional postings and phrase queries
Term Vocabulary and Postings Lists Recall the basic indexing pipeline Documents to Ga?b Friends. Romans, countrymen. be indexed Tokenizer Token stream Friends Romans Countrymen Linguistic modules Modified tokens friend roman countryman Indexer friend 24 averted index roman countryman put 1316
Term Vocabulary and Postings Lists 4 Recall the basic indexing pipeline Tokenizer Token stream. Friends Romans Countrymen Linguistic modules Modified tokens. friend roman countryman Indexer Inverted index. friend roman countryman 2 4 2 13 16 1 Documents to be indexed. Friends, Romans, countrymen
Term Vocabulary and Postings Lists Document Delineation Parsing a document What format is it in? pdf/word/ excel/html? What language is it in? What character set is in use? Each of these is a classification problem which we will study later in the course But these tasks are often done heuristically
Term Vocabulary and Postings Lists 5 Parsing a document ▪ What format is it in? ▪ pdf/word/excel/html? ▪ What language is it in? ▪ What character set is in use? Each of these is a classification problem, which we will study later in the course. But these tasks are often done heuristically … Document Delineation
Term Vocabulary and Postings Lists Document Delineation Complications: Format/language Documents being indexed can include docs from many different languages a single index may have to contain terms of several languages a Sometimes a document or its components can contain multiple languages/formats French email with a German pdf attachment What is a unit document? A file? An email?( Perhaps one of many in an mbox An email with 5 attachments? A group of files(ppt or la teX as html pages)
Term Vocabulary and Postings Lists 6 Complications: Format/language ▪ Documents being indexed can include docs from many different languages ▪ A single index may have to contain terms of several languages. ▪ Sometimes a document or its components can contain multiple languages/formats ▪ French email with a German pdf attachment. ▪ What is a unit document? ▪ A file? ▪ An email? (Perhaps one of many in an mbox.) ▪ An email with 5 attachments? ▪ A group of files (PPT or LaTeX as HTML pages) Document Delineation
Term Vocabulary and Postings Lists Vocabulary of Terms TOKENS AND TERMS
Term Vocabulary and Postings Lists 7 TOKENS AND TERMS Vocabulary of Terms
Term Vocabulary and Postings Lists Vocabulary of Terms Tokenization Input: "Friends, Romans and Countrymen Output: Tokens friends Romans Countrymen a token is an instance of a sequence of characters Each such token is now a candidate for an index entry after further processing Described below But what are valid tokens to emit?
Term Vocabulary and Postings Lists 8 Tokenization ▪ Input: “Friends, Romans and Countrymen” ▪ Output: Tokens ▪ Friends ▪ Romans ▪ Countrymen ▪ A token is an instance of a sequence of characters ▪ Each such token is now a candidate for an index entry, after further processing ▪ Described below ▪ But what are valid tokens to emit? Vocabulary of Terms
Term Vocabulary and Postings Lists Vocabulary of Terms Tokenization Issues in tokenization Finland's capita/→ Finland? Fin/ands? Finland's? Hewlett-Packard -> Hewlett and packard as two tokens? state-of-the-art: break up hyphenated sequence CO-education lowercase, ower-case, lower case It can be effective to get the user to put in possible hyphens San francisco one token or two? How do you decide it is one token?
Term Vocabulary and Postings Lists 9 Tokenization ▪ Issues in tokenization: ▪ Finland’s capital → Finland? Finlands? Finland’s? ▪ Hewlett-Packard → Hewlett and Packard as two tokens? ▪ state-of-the-art: break up hyphenated sequence. ▪ co-education ▪ lowercase, lower-case, lower case ? ▪ It can be effective to get the user to put in possible hyphens ▪ San Francisco: one token or two? ▪ How do you decide it is one token? Vocabulary of Terms
Term Vocabulary and Postings Lists Vocabulary of Terms Numbers 3/20/91 Mar12,1991 20/391 55B.C B-52 My PG key is 324a3df234cb23e 800)234-2333 Often have embedded spaces Older iR systems may not index numbers But often very useful: think about things like looking up error codes/stacktraces on the web (One answer is using n-grams Lecture 2.2) Will often index"meta-data" separately Creation date, format etc
Term Vocabulary and Postings Lists 10 Numbers ▪ 3/20/91 Mar. 12, 1991 20/3/91 ▪ 55 B.C. ▪ B-52 ▪ My PGP key is 324a3df234cb23e ▪ (800) 234-2333 ▪ Often have embedded spaces ▪ Older IR systems may not index numbers ▪ But often very useful: think about things like looking up error codes/stacktraces on the web ▪ (One answer is using n-grams: Lecture 2.2) ▪ Will often index “meta-data” separately ▪ Creation date, format, etc. Vocabulary of Terms