正在加载图片...
信息检索与数据挖掘 2019/3/719 Reuters-RCV1语料:索引构建中的l临时文件 ·N=800,000 220>N>216 →文档D需32bit ·T=100,000,000 220+7>N>216→词条ID需32bit ·存储“词条D-文档D”需要 Doc 1 Doc 2 I did enact Julius Caesar:I was killed So let it be with Caesar.The noble Brutus hath told you Caesar was ambitious: T*(32bits+32bits)=0.8GBytes i'the Capitol;Brutus killed me. erm doclDterm doclD term term 1 ambitious 2 : term did 1 be 2 doc.freq. postings lists enact brutus 1 ambitious 1 : julius brutus 2 be 1 产 2 ◆ caesar 1 capitol 1 brutus2 -回 caesar 1 capitol 1 was caesar caesar 2 killed -回 符号含义 值 ◆ caesar did did 1 ◆ the enact ■ enact 1 1 N hath 1 2 文档总数 capitol hath 8 brutus ◆ killed 1 T 词条(Token)总数目 me 100,000,000 it1 2 it 2 2 2 julius 1 julius 1 it 2 killed killed 1 1 be 2 killed let 1 with 2 let 2 me 1 1 caesar 2 me noble 1 the 2 noble 2 ◆ 22 noble 2 5S0 2 s01 我们需要对0.8GB的ID对进行排序! brutus 2 the 2 -回 hath 2 the 2 told 1 told 2 told 2 ◆ you 1 而实际语料库要比RCV1大 you 2 you 2 was 2 caesar 2 was 1 2212 1-回 was was 2 : with 1 19 ambitious 2 with 2信息检索与数据挖掘 2019/3/7 19 Reuters-RCV1语料:索引构建中的临时文件 • N=800,000 220>N>216 文档ID需32bit • T=100,000,000 220+7>N>216 词条ID需32bit • 存储“词条ID-文档ID”需要 • T*(32bits+32bits)=0.8GBytes 符号 含义 值 N 文档总数 T 词条(Token)总数目 100,000,000 我们需要对0.8GB的ID对进行排序! 而实际语料库要比RCV1大 19
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有