Computing Scores in a Complete Search System Web Search and Mining Lecture 8: Scoring and results assembly
Computing Scores in a Complete Search System Lecture 8: Scoring and results assembly Web Search and Mining 1
Computing Scores in a Complete Search System Recap: tf-idf weighting The tf-idf weight of a term is the product of its tf weight and its idf weight W,=(1+log tf, d)logo(N/df, Best known weighting scheme in information retrieval Increases with the number of occurrences within a document a Increases with the rarity of the term in the collection
Computing Scores in a Complete Search System Recap: tf-idf weighting ▪ The tf-idf weight of a term is the product of its tf weight and its idf weight. ▪ Best known weighting scheme in information retrieval ▪ Increases with the number of occurrences within a document ▪ Increases with the rarity of the term in the collection w (1 log tf ) log ( /df ) , t,d 10 N t t d = + 2
Computing Scores in a Complete Search System Recap: Queries as vectors Key idea 1: Do the same for queries: represent them as vectors in the space Key idea 2: Rank documents according to their proximity to the query in this space proximity similarity of vectors
Computing Scores in a Complete Search System Recap: Queries as vectors ▪ Key idea 1: Do the same for queries: represent them as vectors in the space ▪ Key idea 2: Rank documents according to their proximity to the query in this space ▪ proximity = similarity of vectors 3
Computing Scores in a Complete Search System Recap: cosine(query, document Dot product Unit vectors cos(g, d) ∑ cos(a, d) is the cosine similarity of g and d...on equivalently, the cosine of the angle between g and d
Computing Scores in a Complete Search System Recap: cosine(query,document) = = = = • = • = V i i V i i V i i i q d q d d d q q q d q d q d 1 2 1 2 1 cos( , ) Dot product Unit vectors cos(q,d) is the cosine similarity of q and d … or, equivalently, the cosine of the angle between q and d. 4
Computing Scores in a Complete Search System This lecture Speeding up vector space ranking Putting together a complete search system Will require learning about a number of miscellaneous topics and heuristics
Computing Scores in a Complete Search System This lecture ▪ Speeding up vector space ranking ▪ Putting together a complete search system ▪ Will require learning about a number of miscellaneous topics and heuristics 5
Computing Scores in a Complete Search System Efficient Ranking Computing cosine scores CosineSCORE(q float Scores[N=0 2 float Length[N] 3 for each query term t 4 do calculate Wt g and fetch postings list for t or each pair(d, tft. d )in postings list lo Scores[d+=Wd×Wtq 7 Read the array Length 8 for each d 9 do Scores d= Scores[d/Length[d 10 return Top K components of Scores
Computing Scores in a Complete Search System Computing cosine scores Efficient Ranking 6
Computing Scores in a Complete Search System Efficient Ranking Efficient cosine rankin Find the k docs in the collection "nearest to the query =K largest query-doc cosines ■ Efficient ran king: Computing a single cosine efficiently Choosing the k largest cosine values efficiently Can we do this without computing all n cosines?
Computing Scores in a Complete Search System Efficient cosine ranking ▪ Find the K docs in the collection “nearest” to the query K largest query-doc cosines. ▪ Efficient ranking: ▪ Computing a single cosine efficiently. ▪ Choosing the K largest cosine values efficiently. ▪ Can we do this without computing all N cosines? Efficient Ranking 7
Computing Scores in a Complete Search System Efficient Ranking Efficient cosine rankin What we re doing in effect solving the k-nearest neighbor problem for a query vector In general, we do not know how to do this efficiently for high-dimensional spaces But it is solvable for short queries and standard indexes support this well
Computing Scores in a Complete Search System Efficient cosine ranking ▪ What we’re doing in effect: solving the K-nearest neighbor problem for a query vector ▪ In general, we do not know how to do this efficiently for high-dimensional spaces ▪ But it is solvable for short queries, and standard indexes support this well Efficient Ranking 8
Computing Scores in a Complete Search System Efficient Ranking Special case -unweighted queries No weighting on query terms Assume each query term occurs only once Then for ranking, dont need to normalize query vector Slight simplification of agorithm from Lecture 7
Computing Scores in a Complete Search System Special case – unweighted queries ▪ No weighting on query terms ▪ Assume each query term occurs only once ▪ Then for ranking, don’t need to normalize query vector ▪ Slight simplification of algorithm from Lecture 7 Efficient Ranking 9
Computing Scores in a Complete Search System Efficient Ranking Faster cosine: unweighted query FASTCOSINESCORE(Q 1 float Scores[N=0 2 for each d 3 do Initialize Length[d] to the length of doc d 4 for each query term t 5 do calculate wtg and fetch postings list for t 6 for each pair(d, tft, d )in postings list 7 do add wf, d to Scores[d] 8 Read the array Length[dI g for each d 10 do Divide Scoresld] by Length[dI 11 return Top K components of Scoresll Figure 7.1 A faster algorithm for vector space scores
Computing Scores in a Complete Search System Faster cosine: unweighted query Efficient Ranking 10