信息检索与数据挖掘 2019/3/311 信息检索与数据挖掘 课程要求(2019):论文阅读&研讨
信息检索与数据挖掘 2019/3/31 1 信息检索与数据挖掘 课程要求(2019):论文阅读&研讨
信息检索与数据挖掘 2019/3/312 文献阅读建议 ·每人阅读一篇文献并做PPT,安排1或2次课讲解( 演讲人随机抽取)。 ·建议从课程内容相关会议的近10年的Best Paperi或 onourable Mentions中选取,如 SIGIR (Information Retrieval) ·WWW(World Wide Web) KDD(Knowledge Discovery and Data Mining) CIKM(Knowledge Management) http://jeffhuang.com/best_paper_awards.html NIPS (Neural Information Processing Systems) ·https:l/nips.cc/
信息检索与数据挖掘 2019/3/31 2 文献阅读建议 • 每人阅读一篇文献并做PPT,安排1或2次课讲解( 演讲人随机抽取)。 • 建议从课程内容相关会议的近10年的Best Paper或 Honourable Mentions中选取,如 • SIGIR (Information Retrieval) • WWW (World Wide Web) • KDD (Knowledge Discovery and Data Mining) • CIKM (Knowledge Management) • http://jeffhuang.com/best_paper_awards.html • NIPS (Neural Information Processing Systems) • https://nips.cc/
信息检索与数据挖掘 2019/3/313 Best Paper Awards in Computer Science(since 1996) https://jeffhuang.com/best paper awards.html By Conference:AAAI ACL CHI CIKM CVPR FOCS FSE ICCV ICML ICSE IJCAI INFO COM KDD MOBICOM NSDI OSDI PLDI PODS S&P SIGCOMM SIGIR SIGMETRICS SIGMOD SODA SOSP STOC UIST VLDB WWW
信息检索与数据挖掘 2019/3/31 3 Best Paper Awards in Computer Science (since 1996) • https://jeffhuang.com/best_paper_awards.html • By Conference: AAAI ACL CHI CIKM CVPR FOCS FSE ICCV ICML ICSE IJCAI INFO COM KDD MOBICOM NSDI OSDI PLDI PODS S&P SIGCOMM SIGIR SIGMETRICS SIGMOD SODA SOSP STOC UIST VLDB WWW
信息检索与数据挖掘 2019/3/314 http://sigir.org/sigir2019/ The 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval will take place on July 21-25,2019 in Paris. ·ACM SIGIR是国际计算机协会信息检索大会的缩写。 SIGR专注于信息存储、检索和传播的各个方面,包括研 究战略、输出方案和系统评估。 ·国际信息检索大会的历史可以追溯到1971年。当年,Jack Minker和Sam Rosenfeld:组织召开了ACM SIGIR的信息存 储和检索研讨会
信息检索与数据挖掘 2019/3/31 4 http://sigir.org/sigir2019/ • The 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval will take place on July 21-25, 2019 in Paris. • ACM SIGIR 是国际计算机协会信息检索大会的缩写。 SIGIR 专注于信息存储、检索和传播的各个方面,包括研 究战略、输出方案和系统评估。 • 国际信息检索大会的历史可以追溯到1971年。当年,Jack Minker 和Sam Rosenfeld组织召开了ACM SIGIR 的信息存 储和检索研讨会
信息检索与数据挖掘 2019/3/315 SIGIR 2018 Best Paper Should I Follow the Crowd?A Probabilistic Analysis of the Effectiveness of Popularity in Recommender Systems The use of IR methodology in the evaluation of recommender systems has become common practice in recent years.IR metrics have been found however to be strongly biased towards rewarding algorithms that recommend popular......The fundamental question remains open though whether popularity is really a bias we should avoid or not;whether it could be a useful and reliable signal in recommendation,or it may be unfairly rewarded by the experimental biases.......We build a crowdsourced dataset devoid of the usual biases MovieLens 1M Netflix 0.3 □ Random recommendation Popularity(nr.positive ratings) 0.2 ▣Average rating User-based kNN 01 Item-based kNN 0 Matrix factorization Figure 1:Typical offline experimental results for non-per- sonalized popularity-based recommendation compared to personalized algorithms on two public datasets
信息检索与数据挖掘 2019/3/31 5 SIGIR 2018 Best Paper Should I Follow the Crowd? A Probabilistic Analysis of the Effectiveness of Popularity in Recommender Systems The use of IR methodology in the evaluation of recommender systems has become common practice in recent years. IR metrics have been found however to be strongly biased towards rewarding algorithms that recommend popular …... The fundamental question remains open though whether popularity is really a bias we should avoid or not; whether it could be a useful and reliable signal in recommendation, or it may be unfairly rewarded by the experimental biases. …... We build a crowdsourced dataset devoid of the usual biases ……
信息检索与数据挖掘 2019/3/316 SIGIR 2017 Best Paper BitFunnel:Revisiting Signatures for Search Since the mid-90s there has been a widely-held belief that signature files are inferior to inverted files for text indexing.In recent years the Bing search engine has developed and deployed an index based on bit-sliced signatures.This index,known as BitFunnel,replaced an existing production system based on an inverted index....... The BitFunnel algorithm directly addresses four fundamental limitations in bit-sliced block signatures. At the same time,our mapping of the algorithm onto a cluster offers opportunities to avoid other costs associated with signatures.We show these innovations yield a significant efficiency gain versus classic bit- sliced signatures and then compare BitFunnel with Partitioned Elias-Fano Indexes,MG4J,and Lucene. https://dl.acm.org/citation.cfm?doid=3077136.3080789
信息检索与数据挖掘 2019/3/31 6 SIGIR 2017 Best Paper BitFunnel: Revisiting Signatures for Search • Since the mid-90s there has been a widely-held belief that signature files are inferior to inverted files for text indexing. In recent years the Bing search engine has developed and deployed an index based on bit-sliced signatures. This index, known as BitFunnel, replaced an existing production system based on an inverted index.…… • The BitFunnel algorithm directly addresses four fundamental limitations in bit-sliced block signatures. At the same time, our mapping of the algorithm onto a cluster offers opportunities to avoid other costs associated with signatures. We show these innovations yield a significant efficiency gain versus classic bitsliced signatures and then compare BitFunnel with Partitioned Elias-Fano Indexes, MG4J, and Lucene. https://dl.acm.org/citation.cfm?doid=3077136.3080789
信息检索与数据挖掘 2019/3/317 SIGR2017 Honourable mentions(最佳提名) IRGAN:A Minimax Game for Unifying Generative and Discriminative Information Retrieval Models Jun Wang (University College London),Lantao Yu(Shanghai Jiao Tong University),Weinan Zhang(Shanghai Jiao Tong University),Yu Gong (Alibaba Inc.)Yinghui Xu (Alibaba Inc.)Benyou Wang (Tianjin University),Peng Zhang(Tianjin University),Dell Zhang(Birkbeck, University of London) ·评价指标设计一直是信息检索技术研究中的核心问题之 一 , 而估计用户的期望收益与期望付出则是搜索用户行为模 型的关键组成部分。受模型框架限制,当前几乎所有信息 检索评价指标均无法做到同时将用户的期望收益和付出纳 入会话终上条件的估计。针对这一问题,迁算机系师生受 流行电子游戏"“Bejewed(中文名:宝右迷阵)”机制肩发 设计了一个创新性的用户交互模型框架,将期望收益与 付出因素重新建模,并把现有的绝大多数评价指标纳入这 二框架的范畴。在真实用户行为数据上的实验表明,该框 架比现宥指标能够更好的预测用芦满意程度
信息检索与数据挖掘 2019/3/31 7 SIGIR 2017 Honourable Mentions(最佳提名) • IRGAN: A Minimax Game for Unifying Generative and Discriminative Information Retrieval Models • Jun Wang (University College London), Lantao Yu (Shanghai Jiao Tong University), Weinan Zhang (Shanghai Jiao Tong University), Yu Gong (Alibaba Inc.), Yinghui Xu (Alibaba Inc.), Benyou Wang (Tianjin University), Peng Zhang (Tianjin University), Dell Zhang (Birkbeck, University of London) • 评价指标设计一直是信息检索技术研究中的核心问题之一 ,而估计用户的期望收益与期望付出则是搜索用户行为模 型的关键组成部分。受模型框架限制,当前几乎所有信息 检索评价指标均无法做到同时将用户的期望收益和付出纳 入会话终止条件的估计。针对这一问题,计算机系师生受 流行电子游戏“Bejewed(中文名:宝石迷阵)”机制启发 ,设计了一个创新性的用户交互模型框架,将期望收益与 付出因素重新建模,并把现有的绝大多数评价指标纳入这 一框架的范畴。在真实用户行为数据上的实验表明,该框 架比现有指标能够更好的预测用户满意程度
信息检索与数据挖掘 2019/3/318 SIGIR 2016 Best Paper Understanding Information Need:an fMRI Study In this paper,we investigate the connection between an information need and brain activity.Using functional Magnetic Resonance Imaging (fMRD),we measured the brain activity of twenty four participants while they performed a Question Answering (Q/A)Task,where the questions were carefully selected and developed from TREC-8 and TREC 2001 Q/A Track.The results of this experiment revealed a distributed network of brain regions commonly associated with activities related l to in-formation need and retrieval and differing brain activity in processing scenarios when participants knew the answer to a given question and when they did not and needed to search
信息检索与数据挖掘 2019/3/31 8 SIGIR 2016 Best Paper Understanding Information Need: an fMRI Study • In this paper, we investigate the connection between an information need and brain activity. Using functional Magnetic Resonance Imaging (fMRI), we measured the brain activity of twenty four participants while they performed a Question Answering (Q/A) Task, where the questions were carefully selected and developed from TREC-8 and TREC 2001 Q/A Track. The results of this experiment revealed a distributed network of brain regions commonly associated with activities related to in-formation need and retrieval and differing brain activity in processing scenarios when participants knew the answer to a given question and when they did not and needed to search
信息检索与数据挖掘 2019/3/319 Inferior Frontal Gyrus 14 N Z=24 Z=15 Z=7 Right Left left Caudate Body Thalamus Posterior Cingulate right Caudate Head 0 02 14 04 12 IN 制 12 Figure 3:The five activation clusters from Scenario 1 are projected onto the average anatomical structure for three transverse sections.Note that the brains are in radiological format where the left side of the brain is on the right side of the image
信息检索与数据挖掘 2019/3/31 9
信息检索与数据挖掘 2019/3/31 10 SIGIR 2015 Best Paper QuickScorer:a Fast Algorithm to Rank Documents with Additive Ensembles of Regression Trees Learning-to-Rank models based on additive ensembles of regression trees have proven to be very effective for ranking query results returned by Web search engines.......Unfortunately,the computational cost of these ranking models is high. ......we present QuickScorer,a new algorithm that adopts a novel bitvector representation of the tree- based ranking model,and performs an interleaved traversal of the ensemble by means of simple logical bitwise operations.......QuickScorer is able to achieve speedups over the best state-of-the-art baseline ranging from 2x to 6.5x. 注:线性回归方法可以有效的拟合所有样本点。当数据拥有众多特征并且特征之间关系十分复杂时,构建全局 模型的想法一个是困难一个是笨拙。此外,实际中很多问题为非线性的,例如常见到的分段函数,不可能用全 局线性模型来进行拟合。树回归将数据集切分成多份易建模的数据,然后利用线性回归进行建模和拟合
信息检索与数据挖掘 2019/3/31 10 SIGIR 2015 Best Paper QuickScorer: a Fast Algorithm to Rank Documents with Additive Ensembles of Regression Trees • Learning-to-Rank models based on additive ensembles of regression trees have proven to be very effective for ranking query results returned by Web search engines……. Unfortunately, the computational cost of these ranking models is high. ……we present QuickScorer, a new algorithm that adopts a novel bitvector representation of the treebased ranking model, and performs an interleaved traversal of the ensemble by means of simple logical bitwise operations. ……QuickScorer is able to achieve speedups over the best state-of-the-art baseline ranging from 2x to 6.5x. 注:线性回归方法可以有效的拟合所有样本点。当数据拥有众多特征并且特征之间关系十分复杂时,构建全局 模型的想法一个是困难一个是笨拙。此外,实际中很多问题为非线性的,例如常见到的分段函数,不可能用全 局线性模型来进行拟合。 树回归将数据集切分成多份易建模的数据,然后利用线性回归进行建模和拟合