Index Compression Collection Statistics Reca‖ Reuters rcv1 symbol statistic value documents 800.000 avg tokens per doc 200 terms(=word types)" 400,000 avg. bytes per token 6 (incl spaces/punct. avg. bytes per token 4.5 (without spaces/punct avg. bytes per term 7.5 non-positional postings 100,000,000Index Compression 6 Recall Reuters RCV1 ▪ symbol statistic value ▪ N documents 800,000 ▪ L avg. # tokens per doc 200 ▪ M terms (= word types) ~400,000 ▪ avg. # bytes per token 6 (incl. spaces/punct.) ▪ avg. # bytes per token 4.5 (without spaces/punct.) ▪ avg. # bytes per term 7.5 ▪ non-positional postings 100,000,000 Collection Statistics