Index Compression Collection Statistics Vocabulary vs collection size How big is the term vocabulary? That is how many distinct words are there? Can we assume an upper bound? Not really: At least 7020 =1037 different words of length 20 In practice, the vocabulary will keep growing with the collection size Especially with UnicodeIndex Compression 9 Vocabulary vs. collection size ▪ How big is the term vocabulary? ▪ That is, how many distinct words are there? ▪ Can we assume an upper bound? ▪ Not really: At least 7020 = 1037 different words of length 20 ▪ In practice, the vocabulary will keep growing with the collection size ▪ Especially with Unicode ☺ Collection Statistics