Index Compression Web Search and Mining Lecture 6: Index Compression
Index Compression 1 Lecture 6: Index Compression Web Search and Mining
Index Compression Last lecture-index construction Sort-based indexing a Naive in-memory inversion Blocked Sort-Based Indexing Merge sort is effective for disk-based sorting( avoid seeks Single-Pass In-Memory Indexing No global dictionary Generate separate dictionary for each block Dont sort postings Accumulate postings in postings lists as they occur Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge
Index Compression 2 Last lecture – index construction ▪ Sort-based indexing ▪ Naïve in-memory inversion ▪ Blocked Sort-Based Indexing ▪ Merge sort is effective for disk-based sorting (avoid seeks!) ▪ Single-Pass In-Memory Indexing ▪ No global dictionary ▪ Generate separate dictionary for each block ▪ Don’t sort postings ▪ Accumulate postings in postings lists as they occur ▪ Distributed indexing using MapReduce ▪ Dynamic indexing: Multiple indices, logarithmic merge
Index Compression This lecture BRUTUS 1241131451731174 CAeSAR 2|4561657132 CALPURNIA→23154101 Collection statistics in more detail (with RCv1) How big will the dictionary and postings be? a Dictionary compression Postings compression
Index Compression 3 This lecture ▪ Collection statistics in more detail (with RCV1) ▪ How big will the dictionary and postings be? ▪ Dictionary compression ▪ Postings compression
Index Compression Why compression in general) Use less disk space Saves a little money Keep more stuff in memory Increases speed Increase speed of data transfer from disk to memory [read compressed data decompress] is faster than [read uncompressed datal Premise: Decompression algorithms are fast True of the decompression algorithms we use
Index Compression 4 Why compression (in general)? ▪ Use less disk space ▪ Saves a little money ▪ Keep more stuff in memory ▪ Increases speed ▪ Increase speed of data transfer from disk to memory ▪ [read compressed data | decompress] is faster than [read uncompressed data] ▪ Premise: Decompression algorithms are fast ▪ True of the decompression algorithms we use
Index Compression Why compression for inverted indexes? Dictionary Make it small enough to keep in main memory Make it so small that you can keep some postings lists in main memory too Posting gs file(s) Reduce disk space needed Decrease time needed to read postings lists from disk Large search engines keep a significant part of the postings in memory. Compression lets you keep more in memory We will devise various IR-specific compression schemes
Index Compression 5 Why compression for inverted indexes? ▪ Dictionary ▪ Make it small enough to keep in main memory ▪ Make it so small that you can keep some postings lists in main memory too ▪ Postings file(s) ▪ Reduce disk space needed ▪ Decrease time needed to read postings lists from disk ▪ Large search engines keep a significant part of the postings in memory. ▪ Compression lets you keep more in memory ▪ We will devise various IR-specific compression schemes
Index Compression Collection Statistics Reca‖ Reuters rcv1 symbol statistic value documents 800.000 avg tokens per doc 200 terms(=word types)" 400,000 avg. bytes per token 6 (incl spaces/punct. avg. bytes per token 4.5 (without spaces/punct avg. bytes per term 7.5 non-positional postings 100,000,000
Index Compression 6 Recall Reuters RCV1 ▪ symbol statistic value ▪ N documents 800,000 ▪ L avg. # tokens per doc 200 ▪ M terms (= word types) ~400,000 ▪ avg. # bytes per token 6 (incl. spaces/punct.) ▪ avg. # bytes per token 4.5 (without spaces/punct.) ▪ avg. # bytes per term 7.5 ▪ non-positional postings 100,000,000 Collection Statistics
Index Compression Collection Statistics Index parameters vs. what we index (details //R Table 5.1, p 80) size of word types(terms) non-positional positional postings postings dictionary non-positional index positional index size△% cumul size(K)△ cumu size(K)△ cumul % %% Unfiltered 484 109971 197.879 No numbers 4742 -2100,6808 8179,1589 9 Case folding 392-17 1996969-3 12179,1580 9 30 stopwords 391-0 -1983,390-14-24121,858-31 -38 150 stopwords39101967002303994,51747 52 stemming 322-17 -3363.812-4 4294.5170 52 Exercise: give intuitions for all the o' entries Why do some zero entries correspond to big deltas in other columns
Index Compression 7 Index parameters vs. what we index (details IIR Table 5.1, p.80) size of word types (terms) non-positional postings positional postings dictionary non-positional index positional index Size (K) ∆% cumul % Size (K) ∆ % cumul % Size (K) ∆ % cumul % Unfiltered 484 109,971 197,879 No numbers 474 -2 -2 100,680 -8 -8 179,158 -9 -9 Case folding 392 -17 -19 96,969 -3 -12 179,158 0 -9 30 stopwords 391 -0 -19 83,390 -14 -24 121,858 -31 -38 150 stopwords 391 -0 -19 67,002 -30 -39 94,517 -47 -52 stemming 322 -17 -33 63,812 -4 -42 94,517 0 -52 Exercise: give intuitions for all the ‘0’ entries. Why do some zero entries correspond to big deltas in other columns? Collection Statistics
Index Compression Collection Statistics Lossless vs lossy compression Lossless compression: All information is preserved What we mostly do in IR Lossy compression Discard some information Several of the preprocessing steps can be viewed as lossy compression: case folding stop words stemming number elimination Chap 7: Prune postings entries that are unlikely to turn up in the top k list for any query Almost no loss quality for top k list 8
Index Compression 8 Lossless vs. lossy compression ▪ Lossless compression: All information is preserved. ▪ What we mostly do in IR. ▪ Lossy compression: Discard some information ▪ Several of the preprocessing steps can be viewed as lossy compression: case folding, stop words, stemming, number elimination. ▪ Chap 7: Prune postings entries that are unlikely to turn up in the top k list for any query. ▪ Almost no loss quality for top k list. Collection Statistics
Index Compression Collection Statistics Vocabulary vs collection size How big is the term vocabulary? That is how many distinct words are there? Can we assume an upper bound? Not really: At least 7020 =1037 different words of length 20 In practice, the vocabulary will keep growing with the collection size Especially with Unicode
Index Compression 9 Vocabulary vs. collection size ▪ How big is the term vocabulary? ▪ That is, how many distinct words are there? ▪ Can we assume an upper bound? ▪ Not really: At least 7020 = 1037 different words of length 20 ▪ In practice, the vocabulary will keep growing with the collection size ▪ Especially with Unicode ☺ Collection Statistics
Index Compression Collection Statistics Vocabulary vs collection size Heaps’law:M=k7 M is the size of the vocabulary t is the number of tokens in the collection Typical values:30≤k≤100andb≈0.5 In a log-log plot of vocabulary size M vs. T, Heaps law predicts a line with slope about 2 It is the simplest possible relationship between the two in log- log space An empirical finding empirical law")
Index Compression 10 Vocabulary vs. collection size ▪ Heaps’ law: M = kTb ▪ M is the size of the vocabulary, T is the number of tokens in the collection ▪ Typical values: 30 ≤ k ≤ 100 and b ≈ 0.5 ▪ In a log-log plot of vocabulary size M vs. T, Heaps’ law predicts a line with slope about ½ ▪ It is the simplest possible relationship between the two in log-log space ▪ An empirical finding (“empirical law”) Collection Statistics