正在加载图片...
Term Vocabulary and Postings Lists Vocabulary of Terms Tokenization Issues in tokenization Finland's capita/→ Finland? Fin/ands? Finland's? Hewlett-Packard -> Hewlett and packard as two tokens? state-of-the-art: break up hyphenated sequence CO-education lowercase, ower-case, lower case It can be effective to get the user to put in possible hyphens San francisco one token or two? How do you decide it is one token?Term Vocabulary and Postings Lists 9 Tokenization ▪ Issues in tokenization: ▪ Finland’s capital → Finland? Finlands? Finland’s? ▪ Hewlett-Packard → Hewlett and Packard as two tokens? ▪ state-of-the-art: break up hyphenated sequence. ▪ co-education ▪ lowercase, lower-case, lower case ? ▪ It can be effective to get the user to put in possible hyphens ▪ San Francisco: one token or two? ▪ How do you decide it is one token? Vocabulary of Terms
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有