Hashing based Answer Selection Dong Xu and Wu-Jun Li* National Key Laboratory for Novel Software Technology Collaborative Innovation Center of Novel Software Technology and Industrialization Department of Computer Science and Technology,Nanjing University,China dc.swindegmail.com,liwujunenju.edu.cn Abstract This phenomenon limits the performance of answer selec- tion models.Deep neural networks (DNN)based models, Answer selection is an important subtask of question answer- ing (QA),in which deep models usually achieve better per- also simply called deep models,can partly tackle this prob- formance than non-deep models.Most deep models adopt lem by using pre-trained word embeddings.Word embed- question-answer interaction mechanisms,such as attention, dings pre-trained on language corpus contain some common to get vector representations for answers.When these inter- knowledge and linguistic phenomena,which are helpful for action based deep models are deployed for online prediction, selecting answers.Deep models have achieved promising the representations of all answers need to be recalculated for performance for answer selection in recent years (Tan et each question.This procedure is time-consuming for deep al.2016b;Santos et al.2016;Tay,Tuan,and Hui 2018a; models with complex encoders like BERT which usually have Tran and Niederee 2018:Deng et al.2018). better accuracy than simple encoders.One possible solution is to store the matrix representation (encoder output)of each Most deep models for answer selection are constructed answer in memory to avoid recalculation.But this will bring with similar frameworks which contain an encoding large memory cost.In this paper,we propose a novel method, layer (also called encoder)and a composition layer (also called hashing based answer selection (HAS),to tackle this called composition module).Traditional models usually problem.HAS adopts a hashing strategy to learn a binary ma- adopt convolutional neural networks (CNN)(Feng et al. trix representation for each answer,which can dramatically 2015)or recurrent neural networks(RNN)(Tan et al.2016b: reduce the memory cost for storing the matrix representa- Tran and Niederee 2018)as encoders.Recently,complex tions of answers.Hence,HAS can adopt complex encoders like BERT in the model,but the online prediction of HAS pre-trained models such as BERT(Devlin et al.2018)and is still fast with a low memory cost.Experimental results on GPT-2(Radford et al.2019).are proposed for NLP tasks. three popular answer selection datasets show that HAS can BERT and GPT-2 adopt Transformer (Vaswani et al.2017) outperform existing models to achieve state-of-the-art perfor- as the key building block,which discards CNN and RNN en- mance. tirely.BERT and GPT-2 are typically pre-trained on a large- scale language corpus,which can encode abundant common knowledge into model parameters.This common knowledge Introduction is helpful when BERT or GPT-2 is fine-tuned on other tasks. Question answering (QA)is an important but challenging The output of the encoder for each sentence of either task in natural language processing (NLP)area.Answer question or answer is usually represented as a matrix and selection (answer ranking),which aims to select the cor- responding answer from a pool of candidate answers for each column or row of the matrix corresponds to a vec- tor representation for a word in the sentence.Composition a given question,is one of the key components in many modules are used to generate vector representations for sen- kinds of QA applications.For example,in community-based tences from the corresponding matrices.Composition mod- question answering (CQA)tasks,all answers need to be ules mainly include pooling and question-answer interaction ranked according to the quality.In frequently asked ques- mechanisms.Question-answer interaction mechanisms in- tions (FAQ)tasks,the most related answers need to be re- clude attention (Tan et al.2016b),attentive pooling(Santos turned back for answering the users'questions. One main challenge of answer selection is that both ques- et al.2016),multihop-attention (Tran and Niederee 2018) and so on.In general,question-answer interaction mecha- tions and answers are not long enough in most cases.As a nisms have better performance than pooling.However,in- result,questions and answers usually lack background infor- mation and knowledge about the context(Deng et al.2018). teraction mechanisms bring a problem that the vector repre- sentations of an answer are different with respect to different "Wu-Jun Li is the corresponding author. questions.When deep models with interaction mechanisms Copyright c)2020,Association for the Advancement of Artificial are deployed for online prediction,the representations of all Intelligence (www.aaai.org).All rights reserved. answers need to be recalculated for each question.This pro-
Hashing based Answer Selection Dong Xu and Wu-Jun Li∗ National Key Laboratory for Novel Software Technology Collaborative Innovation Center of Novel Software Technology and Industrialization Department of Computer Science and Technology, Nanjing University, China dc.swind@gmail.com, liwujun@nju.edu.cn Abstract Answer selection is an important subtask of question answering (QA), in which deep models usually achieve better performance than non-deep models. Most deep models adopt question-answer interaction mechanisms, such as attention, to get vector representations for answers. When these interaction based deep models are deployed for online prediction, the representations of all answers need to be recalculated for each question. This procedure is time-consuming for deep models with complex encoders like BERT which usually have better accuracy than simple encoders. One possible solution is to store the matrix representation (encoder output) of each answer in memory to avoid recalculation. But this will bring large memory cost. In this paper, we propose a novel method, called hashing based answer selection (HAS), to tackle this problem. HAS adopts a hashing strategy to learn a binary matrix representation for each answer, which can dramatically reduce the memory cost for storing the matrix representations of answers. Hence, HAS can adopt complex encoders like BERT in the model, but the online prediction of HAS is still fast with a low memory cost. Experimental results on three popular answer selection datasets show that HAS can outperform existing models to achieve state-of-the-art performance. Introduction Question answering (QA) is an important but challenging task in natural language processing (NLP) area. Answer selection (answer ranking), which aims to select the corresponding answer from a pool of candidate answers for a given question, is one of the key components in many kinds of QA applications. For example, in community-based question answering (CQA) tasks, all answers need to be ranked according to the quality. In frequently asked questions (FAQ) tasks, the most related answers need to be returned back for answering the users’ questions. One main challenge of answer selection is that both questions and answers are not long enough in most cases. As a result, questions and answers usually lack background information and knowledge about the context (Deng et al. 2018). ∗Wu-Jun Li is the corresponding author. Copyright c 2020, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. This phenomenon limits the performance of answer selection models. Deep neural networks (DNN) based models, also simply called deep models, can partly tackle this problem by using pre-trained word embeddings. Word embeddings pre-trained on language corpus contain some common knowledge and linguistic phenomena, which are helpful for selecting answers. Deep models have achieved promising performance for answer selection in recent years (Tan et al. 2016b; Santos et al. 2016; Tay, Tuan, and Hui 2018a; Tran and Niederee 2018; Deng et al. 2018). ´ Most deep models for answer selection are constructed with similar frameworks which contain an encoding layer (also called encoder) and a composition layer (also called composition module). Traditional models usually adopt convolutional neural networks (CNN) (Feng et al. 2015) or recurrent neural networks (RNN) (Tan et al. 2016b; Tran and Niederee 2018) as encoders. Recently, complex ´ pre-trained models such as BERT (Devlin et al. 2018) and GPT-2 (Radford et al. 2019), are proposed for NLP tasks. BERT and GPT-2 adopt Transformer (Vaswani et al. 2017) as the key building block, which discards CNN and RNN entirely. BERT and GPT-2 are typically pre-trained on a largescale language corpus, which can encode abundant common knowledge into model parameters. This common knowledge is helpful when BERT or GPT-2 is fine-tuned on other tasks. The output of the encoder for each sentence of either question or answer is usually represented as a matrix and each column or row of the matrix corresponds to a vector representation for a word in the sentence. Composition modules are used to generate vector representations for sentences from the corresponding matrices. Composition modules mainly include pooling and question-answer interaction mechanisms. Question-answer interaction mechanisms include attention (Tan et al. 2016b), attentive pooling (Santos et al. 2016), multihop-attention (Tran and Niederee 2018) ´ and so on. In general, question-answer interaction mechanisms have better performance than pooling. However, interaction mechanisms bring a problem that the vector representations of an answer are different with respect to different questions. When deep models with interaction mechanisms are deployed for online prediction, the representations of all answers need to be recalculated for each question. This pro-
cedure is time-consuming for deep models with complex en BERT and Transfer Learning To tackle the problem of coders like BERT which usually have better accuracy than insufficient background information and knowledge in an- simple encoders.One possible solution is to store the matrix swer selection.some methods introduce extra knowledge representation(with float or double values)of each answer from other data.(Deng et al.2018;Min,Seo,and Ha- in memory to avoid recalculation.But this will bring large jishirzi 2017:Wiese.Weissenborn,and Neves 2017)em- memory cost. ploy supervised transfer learning frameworks to pre-train In this paper,we propose a novel method,called hashing a model from a source dataset.There are also some un- based answer selection (HAS).to tackle this problem.The supervised transfer learning techniques (Yu et al.2018; main contributions of HAS are briefly outlined as follows: Chung,Lee,and Glass 2018).BERT (Devlin et al.2018) HAS adopts a hashing strategy to learn a binary matrix is a recently proposed model for language understanding. representation for each answer,which can dramatically By training on a large language corpus,abundant common knowledge and linguistic phenomena can be encoded into reduce the memory cost for storing the matrix represen- tations of answers.To the best of our knowledge,this is the parameters.As a result,BERT can be transferred to a wide range of NLP tasks and has shown promising results. the first time to use hashing for memory reduction in an- swer selection. Hashing Hashing (Li,Wang,and Kang 2016)tries to learn binary codes for data representations.Based on the By storing the (binary)matrix representations of an- binary code,hashing can be used to speedup retrieval and swers in the memory,HAS can avoid recalculation for reduce memory cost.In this paper,we take hashing to answer representations during online prediction.Subse- reduce memory cost,by learning binary matrix represen- quently,HAS can adopt complex encoders like BERT in tations for answers.There have already appeared many the model,but the online prediction of HAS is still fast hashing techniques for learning binary representation (Li. with a low memory cost. Wang,and Kang 2016;Cao et al.2017;Hubara et al.2016; Experimental results on three popular answer selection Fan et al.2019).To the best of our knowledge,there have datasets show that HAS can outperform existing models not existed works to use hashing for memory reduction in answer selection. to achieve state-of-the-art performance. Related Work Hashing based Answer Selection Answer Selection Most early models for answer selection In this section,we present the details of hashing based answer selection (HAS),which can be used to solve the are shallow (non-deep)models,which usually use bag-of- words (BOW)(Yih et al.2013),term frequency (Robert- problem faced by existing deep models with question- answer interaction mechanisms son et al.1994),manually designed rules (Tellez-Valero et al.2011),syntactic trees (Wang and Manning 2010; The framework of most existing deep models is shown in Figure 1(a).Compared with this framework.HAS has Cui et al.2005)as features.Different upper structures are de- an additional hashing layer,which is shown in Figure 1(b) signed for modeling the similarity of questions and answers based on these features.The main drawback of shallow mod- More specifically,HAS consists of an embedding layer,an encoding layer,a hashing layer,a composition layer and a els are the lacking of semantic information by using only similarity layer.With different choices of encoders (encod- surface features.Deep models can capture more semantic in- formation by distributed representations,which lead to bet- ing layer)and composition modules(composition layer)in ter results than shallow models.Early deep models use pool- HAS,several different models can be constructed.Hence, HAS provides a flexible framework for modeling. ing (Feng et al.2015)as the composition module to get vec- tor representations for sentences from the encoder outputs which are represented as matrices.Pooling cannot model the Embedding Layer and Encoding Layer interaction between questions and answers,which has been HAS is designed for modeling the similarity of question- outperformed by new composition modules with question- answer pairs.Hence,the inputs to HAS are two sequences answer interaction mechanisms.Attention(Bahdanau,Cho, of words,corresponding to the question text and answer and Bengio 2015)can generate a better representation of an- text respectively.Firstly,these sequences of words are rep- swers (Tan et al.2016b)than pooling,by introducing the in- resented by word embeddings through a word embedding formation flow between questions and answers into models. layer.Suppose the dimension of word embedding is E,and (Santos et al.2016)proposes attentive pooling for bidirec- the sequence length is L.The embeddings of question q tional attention.(Tran and Niederee 2018)proposes a strat- and answer a are represented by matrices ExL and egy of multihop attention which captures the complex rela- ARExL respectively.We use the same sequence length tions between question-answer pairs.(Wan et al.2016)fo- L for simplicity.Then,these two embedding matrices Q cuses on the word by word similarity between questions and and A are fed into an encoding layer to get the contextual answers.(Wang,Liu,and Zhao 2016)and (Chen et al.2018) word representations.Different choices of embedding layers propose inner attention which introduces the representation and encoders can be adopted in HAS.Here,we directly use of question to the answer encoder through gates.(Tay,Tuan, the embedding layer and encoding layer in BERT to utilize and Hui 2018a)designs a cross temporal recurrent cell to the common knowledge and linguistic phenomena encoded model the interaction between questions and answers. in BERT.Hence,the formulation of encoding layer is as fol-
cedure is time-consuming for deep models with complex encoders like BERT which usually have better accuracy than simple encoders. One possible solution is to store the matrix representation (with float or double values) of each answer in memory to avoid recalculation. But this will bring large memory cost. In this paper, we propose a novel method, called hashing based answer selection (HAS), to tackle this problem. The main contributions of HAS are briefly outlined as follows: • HAS adopts a hashing strategy to learn a binary matrix representation for each answer, which can dramatically reduce the memory cost for storing the matrix representations of answers. To the best of our knowledge, this is the first time to use hashing for memory reduction in answer selection. • By storing the (binary) matrix representations of answers in the memory, HAS can avoid recalculation for answer representations during online prediction. Subsequently, HAS can adopt complex encoders like BERT in the model, but the online prediction of HAS is still fast with a low memory cost. • Experimental results on three popular answer selection datasets show that HAS can outperform existing models to achieve state-of-the-art performance. Related Work Answer Selection Most early models for answer selection are shallow (non-deep) models, which usually use bag-ofwords (BOW) (Yih et al. 2013), term frequency (Robertson et al. 1994), manually designed rules (Tellez-Valero ´ et al. 2011), syntactic trees (Wang and Manning 2010; Cui et al. 2005) as features. Different upper structures are designed for modeling the similarity of questions and answers based on these features. The main drawback of shallow models are the lacking of semantic information by using only surface features. Deep models can capture more semantic information by distributed representations, which lead to better results than shallow models. Early deep models use pooling (Feng et al. 2015) as the composition module to get vector representations for sentences from the encoder outputs which are represented as matrices. Pooling cannot model the interaction between questions and answers, which has been outperformed by new composition modules with questionanswer interaction mechanisms. Attention (Bahdanau, Cho, and Bengio 2015) can generate a better representation of answers (Tan et al. 2016b) than pooling, by introducing the information flow between questions and answers into models. (Santos et al. 2016) proposes attentive pooling for bidirectional attention. (Tran and Niederee 2018) proposes a strat- ´ egy of multihop attention which captures the complex relations between question-answer pairs. (Wan et al. 2016) focuses on the word by word similarity between questions and answers. (Wang, Liu, and Zhao 2016) and (Chen et al. 2018) propose inner attention which introduces the representation of question to the answer encoder through gates. (Tay, Tuan, and Hui 2018a) designs a cross temporal recurrent cell to model the interaction between questions and answers. BERT and Transfer Learning To tackle the problem of insufficient background information and knowledge in answer selection, some methods introduce extra knowledge from other data. (Deng et al. 2018; Min, Seo, and Hajishirzi 2017; Wiese, Weissenborn, and Neves 2017) employ supervised transfer learning frameworks to pre-train a model from a source dataset. There are also some unsupervised transfer learning techniques (Yu et al. 2018; Chung, Lee, and Glass 2018). BERT (Devlin et al. 2018) is a recently proposed model for language understanding. By training on a large language corpus, abundant common knowledge and linguistic phenomena can be encoded into the parameters. As a result, BERT can be transferred to a wide range of NLP tasks and has shown promising results. Hashing Hashing (Li, Wang, and Kang 2016) tries to learn binary codes for data representations. Based on the binary code, hashing can be used to speedup retrieval and reduce memory cost. In this paper, we take hashing to reduce memory cost, by learning binary matrix representations for answers. There have already appeared many hashing techniques for learning binary representation (Li, Wang, and Kang 2016; Cao et al. 2017; Hubara et al. 2016; Fan et al. 2019). To the best of our knowledge, there have not existed works to use hashing for memory reduction in answer selection. Hashing based Answer Selection In this section, we present the details of hashing based answer selection (HAS), which can be used to solve the problem faced by existing deep models with questionanswer interaction mechanisms. The framework of most existing deep models is shown in Figure 1(a). Compared with this framework, HAS has an additional hashing layer, which is shown in Figure 1(b). More specifically, HAS consists of an embedding layer, an encoding layer, a hashing layer, a composition layer and a similarity layer. With different choices of encoders (encoding layer) and composition modules (composition layer) in HAS, several different models can be constructed. Hence, HAS provides a flexible framework for modeling. Embedding Layer and Encoding Layer HAS is designed for modeling the similarity of questionanswer pairs. Hence, the inputs to HAS are two sequences of words, corresponding to the question text and answer text respectively. Firstly, these sequences of words are represented by word embeddings through a word embedding layer. Suppose the dimension of word embedding is E, and the sequence length is L. The embeddings of question q and answer a are represented by matrices Qq ∈ R E×L and Aa ∈ R E×L respectively. We use the same sequence length L for simplicity. Then, these two embedding matrices Qq and Aa are fed into an encoding layer to get the contextual word representations. Different choices of embedding layers and encoders can be adopted in HAS. Here, we directly use the embedding layer and encoding layer in BERT to utilize the common knowledge and linguistic phenomena encoded in BERT. Hence, the formulation of encoding layer is as fol-
Similarity Layer Similarity Layer Composition Layer Composition Layer Composition Layer Composition Layer Hashing Layer Encoding Layer Encoding Layer Encoding Layer Encoding Layer Embedding Laye Embedding Laye Embedding Laye Embedding Layer Question Text Answer Text Question Text Answer Text (a) (6) Figure 1:(a)Framework of traditional deep models for answer selection;(b)Framework of HAS. lows: layer as that in (Li,Wang,and Kang 2016): U=BERT(Q), T(a)=‖Ba-Ball (2) Va BERT(Aa), where BaL is the binary matrix representation for answer a,is the Frobenius norm of a matrix.Here,Ba where U VaERDxLare the contextual semantic features is also a parameter to learn in HAS model. of words extracted by BERT for question g and answer a When the learned model is deployed for online predic- respectively,and D is the output dimension of BERT. tion,the learned binary matrices for answers will be stored in memory to avoid recalculation.With binary representation, Hashing Layer each element in the matrices only costs one bit of memory. The outputs of the encoding layer for question g and answer Hence,the memory cost can be dramatically reduced. a are U and Va,which are two real-valued(float or double) matrices.When deep models with question-answer interac- Composition Layer tion mechanisms store the output of encoding layer (Va) The outputs of encoding layer and hashing layer are matri- in memory to avoid recalculation,they will meet the high ces of size D x L.Composition layers are used to compose memory cost problem.For example,if we take float values these matrix representations into vectors.Pooling,atten- for Va,the memory cost for only one answer is over 600 KB tion (Tan et al.2016b),attentive pooling (Santos et al.2016) when L 200 and D =768.Here,D =768 is the output and other interaction mechanisms (Tran and Niederee 2018: dimension of BERT.If the number of answers in candidate Wan et al.2016)can be adopted in HAS.Interaction based set is large,excessive memory cost will lead to impractica- modules usually have better performance than pooling based bility,especially for mobile or embedded devices. modules which have no question-answer interaction.Here, In this paper,we adopt hashing to reduce memory cost we take attention as an example to illustrate the advantage by learning binary matrix representations for answers.More of HAS.More specifically,we adopt pooling for compos- specifically,we take the sign function y sgn(r)to bina- ing matrix representations of questions into question vec- rize the output of the encoding layer.But the gradient of the tors,and adopt attention for composing matrix representa- sign function is zero for all nonzero inputs,which leads to a tions of answers into answer vectors.The formulation of the problem that the gradients cannot back-propagate correctly. composition layer is as follows: y=tanh(r)is a commonly used approximate function for ug mar-pooling(Ug), y=sgn(z),which can make the training process end-to- end with back-propagation(BP).Here,we use a more flex- ible variant y tanh(Bx)with a hyper-parameter B>1. la)=attention(Ba,ug)=∑ai·ba, The derivative of y tanh(Bx)is i=1 dy ix exp(mT.tanh(W·ba+w2·ug), Ox =(1-y2). whereare the composed vectors of questions By using this function,the formulation of hashing layer and answers respectively,is the i-th word representation is as follows: in Ba=is the attention weight for the i-th Ba tanh(BVa), (1) word which is calculated by a softmax function,W1,W2 E RMxD,m E RM are attention parameters with M being where BaRDxL is the output of hashing layer. the hidden size of attention. To make sure that the elements in Ba can concentrate to The above formulation is for training.During test proce- binary values B={+1),we add an extra constraint for this dure,we just need to replace Ba by Ba
Embedding Layer Encoding Layer Composition Layer Similarity Layer Question Text Answer Text Embedding Layer Encoding Layer Composition Layer Embedding Layer Encoding Layer Composition Layer Similarity Layer Question Text Answer Text Embedding Layer Hashing Layer Encoding Layer Composition Layer (a) (b) Figure 1: (a) Framework of traditional deep models for answer selection; (b) Framework of HAS. lows: Uq = BERT(Qq), Va = BERT(Aa), where Uq,Va ∈ R D×L are the contextual semantic features of words extracted by BERT for question q and answer a respectively, and D is the output dimension of BERT. Hashing Layer The outputs of the encoding layer for question q and answer a are Uq and Va, which are two real-valued (float or double) matrices. When deep models with question-answer interaction mechanisms store the output of encoding layer (Va) in memory to avoid recalculation, they will meet the high memory cost problem. For example, if we take float values for Va, the memory cost for only one answer is over 600 KB when L = 200 and D = 768. Here, D = 768 is the output dimension of BERT. If the number of answers in candidate set is large, excessive memory cost will lead to impracticability, especially for mobile or embedded devices. In this paper, we adopt hashing to reduce memory cost by learning binary matrix representations for answers. More specifically, we take the sign function y = sgn(x) to binarize the output of the encoding layer. But the gradient of the sign function is zero for all nonzero inputs, which leads to a problem that the gradients cannot back-propagate correctly. y = tanh(x) is a commonly used approximate function for y = sgn(x), which can make the training process end-toend with back-propagation (BP). Here, we use a more flexible variant y = tanh(βx) with a hyper-parameter β ≥ 1. The derivative of y = tanh(βx) is ∂y ∂x = β(1 − y 2 ). By using this function, the formulation of hashing layer is as follows: Ba = tanh(βVa), (1) where Ba ∈ R D×L is the output of hashing layer. To make sure that the elements in Ba can concentrate to binary values B = {±1}, we add an extra constraint for this layer as that in (Li, Wang, and Kang 2016): J c (a) = ||Ba − Ba||2 F , (2) where Ba ∈ B D×L is the binary matrix representation for answer a, ||·||F is the Frobenius norm of a matrix. Here, Ba is also a parameter to learn in HAS model. When the learned model is deployed for online prediction, the learned binary matrices for answers will be stored in memory to avoid recalculation. With binary representation, each element in the matrices only costs one bit of memory. Hence, the memory cost can be dramatically reduced. Composition Layer The outputs of encoding layer and hashing layer are matrices of size D × L. Composition layers are used to compose these matrix representations into vectors. Pooling, attention (Tan et al. 2016b), attentive pooling (Santos et al. 2016) and other interaction mechanisms (Tran and Niederee 2018; ´ Wan et al. 2016) can be adopted in HAS. Interaction based modules usually have better performance than pooling based modules which have no question-answer interaction. Here, we take attention as an example to illustrate the advantage of HAS. More specifically, we adopt pooling for composing matrix representations of questions into question vectors, and adopt attention for composing matrix representations of answers into answer vectors. The formulation of the composition layer is as follows: uq = max pooling(Uq), v (q) a = attention(Ba,uq) = X L i=1 αi · b (a) i , αi ∝ exp(m> · tanh(W1 · b (a) i + W2 · uq)), where uq, v (q) a ∈ R D are the composed vectors of questions and answers respectively, b (a) i is the i-th word representation in Ba = [b (a) 1 , ..., b (a) L ], αi is the attention weight for the i-th word which is calculated by a softmax function, W1,W2 ∈ RM×D, m ∈ RM are attention parameters with M being the hidden size of attention. The above formulation is for training. During test procedure, we just need to replace Ba by Ba
Similarity Layer and Loss Function The similarity layer measures the similarity between Table 1:Statistics of the datasets.“#questions'”and“C.A.” denote the number of questions and candidate answers re- question-answer pairs based on their vector representations spectively. and Here,we choose cosine function as the simi- insuranceQA yahooQA wikiQA larity function,which is usually adopted in answer selection #questions (Train) 12887 50112 873 tasks: #questions(Dev) 1000 6289 126 s(q,a)=cos(ug,va), #questions (Test1) 1800 6283 243 #questions (Test2) 1800 where s(g,a)ER is the similarity between question q and #C.A.per question 500 9 answer a. Based on the similarity between questions and answers, we can define the loss function.The most commonly used loss function for ranking is the triplet-based hinge loss(Tan yahooQA I is a large CQA corpus collected from Ya- et al.2016b;Tran and Niederee 2018).To combine the hinge hoo!Answers.We adopt the dataset splits as those in (Tay et al.2017;Tay,Tuan,and Hui 2018a;Deng et al.2018)for loss and the binary constraint in hashing together,we can get fair comparison.Questions and answers are filtered by their the following optimization problem: length,and only sentences with length among the range of - ∑ [Tm(g,p,n)+6J(p)+6T(n)] 5-50 are preserved.The number of candidate answers for each question is five,in which only one answer is positive. (q,P,n) The other four negative answers are sampled from the top [max(0,0.1-s(q,p)+s(q,n)+ 1000 hits using Lucene search for each question.As in ex- (9,P,n isting works (Tay et al.2017;Tay,Tuan,and Hui 2018a; 6.1Bp-Bpll+6.Bn -Bnl], Deng et al.2018),P@1 and Mean Reciprocal Rank(MRR) are adopted as evaluation metrics. where Jm(q,p,n)=max(0,0.1-s(g,p)+s(g:n))is the wikiQA (Yang,Yih,and Meek 2015)is a benchmark for hinge loss for a triplet(g,p,n)from the training set,p is a open-domain answer selection.The questions of wikiOA are positive answer corresponding to g.n is a randomly selected factual questions which are collected from Bing search logs negative answer,is the coefficient of the binary constraint Each question is linked to a Wikipedia page,and the sen- Te(p)and Je(n)for the positive answer p and the nega- tences in the summary section are collected as the candidate tive answer n respectively.B.denotes a set of binary matrix answers.The size of candidate answer set for each question representations for all answers.donates the parameters in is different and there may be more than one positive answer HAS except B.. to some questions.We filter out the questions which have no These two sets of parameters and B,can be optimized positive answers as previous works (Yang,Yih,and Meek alternately (Li,Wang,and Kang 2016).More specifically, 2015;Deng et al.2018;Wang,Liu,and Zhao 2016).Mean Ba E B.corresponding to answer a can be optimized as Average Precision (MAP)and MRR are adopted as evalua- follows when is fixed: tion metrics as in existing works. Ba=sgn(Ba). Hyperparameters and Baselines And 0 can be updated by utilizing back propagation(BP) We use base BERT as the encoder in our experiments.Large when B,is fixed. BERT may have better performance,but the encoding layer is not the focus of this paper.More specifically,the embed- Experiment ding size E and output dimension D of BERT are 768.The Datasets probability of dropout is 0.1.Weight decay coefficient is 0.01.Batch size is 64 for yahooQA,and 32 for insuranceQA We evaluate HAS on three popular answer selection and wikiOA.The attention hidden size M for insuranceOA datasets.The statistics about the datasets are presented in is 768.M is 128 for yahooQA and wikiQA.Learning rate is Table 1. 5e-6 for all models.The numbers of training epoches are 60 insuranceQA (Feng et al.2015)is a FAQ dataset from for insuranceQA,18 for wikiQA and 9 for yahooQA.More insurance domain.We use the first version of this dataset, epoches cannot bring apparent performance gain on the val- which has been widely used in existing works (Tan et idation set.We evaluate all models on the validation set af- al.2016b:Wang,Liu,and Zhao 2016:Tan et al.2016a: ter each epoch and choose the parameters which achieve the Deng et al.2018;Tran and Niederee 2018).This dataset has best results on the validation set for final test.All reported already been partitioned into four subsets:Train,Dev,Test1 results are the average of five runs. and Test2.The total size of candidate answers is 24981.To There are also two other important parameters,B in reduce the complexity,the dataset has provided a candidate tanh(Bx)and the coefficient o of the binary constraint. set of 500 answers for each question,including positive and B is tuned among {1,2,5,10,20},and 6 is tuned among negative answers.There is more than one positive answer {0,1e-7,1e-6,1e-5,1e-4. to some questions.As in existing works(Feng et al.2015; Tran and Niederee 2018;Deng et al.2018),we adopt Preci- https://webscope.sandbox.yahoo.com/catalog.php?datatype= sion@1 (P@1)as the evaluation metric. l&guccounter=1
Similarity Layer and Loss Function The similarity layer measures the similarity between question-answer pairs based on their vector representations uq and v (q) a . Here, we choose cosine function as the similarity function, which is usually adopted in answer selection tasks: s(q, a) = cos(uq, v (q) a ), where s(q, a) ∈ R is the similarity between question q and answer a. Based on the similarity between questions and answers, we can define the loss function. The most commonly used loss function for ranking is the triplet-based hinge loss (Tan et al. 2016b; Tran and Niederee 2018). To combine the hinge ´ loss and the binary constraint in hashing together, we can get the following optimization problem: min θ,B? J = X (q,p,n) [J m(q, p, n) + δ · J c (p) + δ · J c (n)] = X (q,p,n) [max(0, 0.1 − s(q, p) + s(q, n))+ δ · ||Bp − Bp||2 F + δ · ||Bn − Bn||2 F ], where J m(q, p, n) = max(0, 0.1 − s(q, p) + s(q, n)) is the hinge loss for a triplet (q, p, n) from the training set, p is a positive answer corresponding to q, n is a randomly selected negative answer, δ is the coefficient of the binary constraint J c (p) and J c (n) for the positive answer p and the negative answer n respectively. B? denotes a set of binary matrix representations for all answers. θ donates the parameters in HAS except B?. These two sets of parameters θ and B? can be optimized alternately (Li, Wang, and Kang 2016). More specifically, Ba ∈ B? corresponding to answer a can be optimized as follows when θ is fixed: Ba = sgn(Ba). And θ can be updated by utilizing back propagation (BP) when B? is fixed. Experiment Datasets We evaluate HAS on three popular answer selection datasets. The statistics about the datasets are presented in Table 1. insuranceQA (Feng et al. 2015) is a FAQ dataset from insurance domain. We use the first version of this dataset, which has been widely used in existing works (Tan et al. 2016b; Wang, Liu, and Zhao 2016; Tan et al. 2016a; Deng et al. 2018; Tran and Niederee 2018). This dataset has ´ already been partitioned into four subsets: Train, Dev, Test1 and Test2. The total size of candidate answers is 24981. To reduce the complexity, the dataset has provided a candidate set of 500 answers for each question, including positive and negative answers. There is more than one positive answer to some questions. As in existing works (Feng et al. 2015; Tran and Niederee 2018; Deng et al. 2018), we adopt Preci- ´ sion@1 (P@1) as the evaluation metric. Table 1: Statistics of the datasets. “#questions” and “#C.A.” denote the number of questions and candidate answers respectively. insuranceQA yahooQA wikiQA #questions (Train) 12887 50112 873 #questions (Dev) 1000 6289 126 #questions (Test1) 1800 6283 243 #questions (Test2) 1800 — — #C.A. per question 500 5 9 yahooQA 1 is a large CQA corpus collected from Yahoo! Answers. We adopt the dataset splits as those in (Tay et al. 2017; Tay, Tuan, and Hui 2018a; Deng et al. 2018) for fair comparison. Questions and answers are filtered by their length, and only sentences with length among the range of 5 - 50 are preserved. The number of candidate answers for each question is five, in which only one answer is positive. The other four negative answers are sampled from the top 1000 hits using Lucene search for each question. As in existing works (Tay et al. 2017; Tay, Tuan, and Hui 2018a; Deng et al. 2018), P@1 and Mean Reciprocal Rank (MRR) are adopted as evaluation metrics. wikiQA (Yang, Yih, and Meek 2015) is a benchmark for open-domain answer selection. The questions of wikiQA are factual questions which are collected from Bing search logs. Each question is linked to a Wikipedia page, and the sentences in the summary section are collected as the candidate answers. The size of candidate answer set for each question is different and there may be more than one positive answer to some questions. We filter out the questions which have no positive answers as previous works (Yang, Yih, and Meek 2015; Deng et al. 2018; Wang, Liu, and Zhao 2016). Mean Average Precision (MAP) and MRR are adopted as evaluation metrics as in existing works. Hyperparameters and Baselines We use base BERT as the encoder in our experiments. Large BERT may have better performance, but the encoding layer is not the focus of this paper. More specifically, the embedding size E and output dimension D of BERT are 768. The probability of dropout is 0.1. Weight decay coefficient is 0.01. Batch size is 64 for yahooQA, and 32 for insuranceQA and wikiQA. The attention hidden size M for insuranceQA is 768. M is 128 for yahooQA and wikiQA. Learning rate is 5e −6 for all models. The numbers of training epoches are 60 for insuranceQA, 18 for wikiQA and 9 for yahooQA. More epoches cannot bring apparent performance gain on the validation set. We evaluate all models on the validation set after each epoch and choose the parameters which achieve the best results on the validation set for final test. All reported results are the average of five runs. There are also two other important parameters, β in tanh(βx) and the coefficient δ of the binary constraint. β is tuned among {1, 2, 5,10, 20}, and δ is tuned among {0, 1e −7 , 1e −6 , 1e −5 , 1e −4}. 1 https://webscope.sandbox.yahoo.com/catalog.php?datatype= l&guccounter=1
Table 2:Results on insuranceOA.The results of models Table 3:Results on yahooQA.The results of models marked marked with are reported from (Tran and Niederee 2018) with are reported from (Tay,Tuan,and Hui 2018a).Other Other results marked with o are reported from their origi- results marked with o are reported from their original paper. nal paper.P@1 is adopted as evaluation metric by following P@l and MRR are adopted as evaluation metrics by follow- previous works.'our impl.'denotes our implementation. ing previous works. Model P@1 (Test1)P@1(Test2) Model P@1 MRR CNN* 62.80 59.20 Random Guess 20.0045.86 CNN with GESD* 65.30 61.00 NTN-LSTM 54.50 73.10 OA-LSTM (our impl.) 66.08 62.63 HD-LSTM* 55.70 73.50 AP-LSTM★ 69.00 64.80 AP-CNN* 56.00 72.60 IARNN-GATE 70.10 62.80 AP-BiLSTM* 56.80 73.10 Multihop-Sequential-LSTM* 70.50 66.90 CTRN* 60.10 75.50 AP-CNN 69.80 66.30 HyperQA 68.30 80.10 AP-BiLSTM 71.70 66.40 KAN (Tgt-Only)o 67.20 80.30 MULT◆ 75.20 73.40 KAN 74.40 84.00 KAN(Tgt-Only)o 71.50 68.80 HAS 73.89 82.10 KAN 75.20 72.50 HAS 76.38 73.71 Table 4:Results on wikiOA.The results marked with o are reported from their original paper.MAP and MRR are The state-of-the-art baselines on three datasets are dif- adopted as evaluation metrics by following previous works. ferent.Hence,we adopt different baselines for compar- Model MAP MRR ison on different datasets according to previous works. AP-CNN 68.86 69.57 AP-BiLSTM 67.05 68.42 Baselines using single model without extra knowledge in- RNN-POA◇ 72.12 73.12 clude:CNN,CNN with GESD (Feng et al.2015),QA- Multihop-Sequential-LSTM 72.20 73.80 LSTM (Tan et al.2016b).AP-LSTM (Tran and Niederee IARNN-GATE 72.58 73.94 2018),Multihop-Sequential-LSTM (Tran and Niederee CA-RNN 73.58 74.50 2018),IARNN-GATE (Wang,Liu,and Zhao 2016),NTN- MULT◇ 74.33 75.45 LSTM,HD-LSTM (Tay et al.2017),HyperQA (Tay, MV-FNN 74.62 75.76 Tuan.and Hui 2018b).AP-CNN(Santos et al.2016).AP- SUMBASE,PTK◇ 75.59 77.00 BiLSTM(Santos et al.2016),CTRN (Tay,Tuan,and Hui LRXNET◇ 76.57 75.10 2018a),CA-RNN (Chen et al.2018),RNN-POA (Chen et HAS 81.0182.22 al.2017),MULT (Wang and Jiang 2017),MV-FNN (Sha et al.2018).Single models with external knowledge in- clude:KAN (Deng et al.2018).Ensemble models include: 2018),which utilizes external knowledge,is the state-of-the- LRXNET (Narayan et al.2018),SUMBASE.PTK (Ty- art model on this dataset.HAS outperforms all baselines moshenko and Moschitti 2018). except KAN.The performance gain of KAN mainly owes Because HAS adopts BERT as encoder,we also construct to the external knowledge,by pre-training on a source QA two BERT-based baselines for comparison.BERT-pooling is dataset SQuAD-T.Please note that HAS does not adopt ex- a model in which both questions and answers are composed ternal QA dataset for pre-training.HAS can outperform the into vectors by pooling.BERT-attention is a model which target-only version of KAN,denoted as KAN (Tgt-Only). adopts attention as the composition module.Both BERT- which is only trained on yahooQA without SQuAD-T.Once pooling and BERT-attention use BERT as the encoder,and again,the result on yahooQA verifies the effectiveness of hashing is not adopted in them. HAS. Experimental Results Results on wikiOA Table 4 shows the results on wik- Results on insuranceQA We compare HAS with base- iQA dataset.SUMBASE.PTK (Tymoshenko and Moschitti lines on insuranceQA dataset.The results are shown in Ta- 2018)and LRXNET (Narayan et al.2018)are two ensemble ble 2.MULT (Wang and Jiang 2017)and KAN (Deng et models which represent the state-of-the-art results on this al.2018)are two strong baselines which represent the state- dataset.HAS outperforms all the baselines again,which fur- of-the-art results on this dataset.Here,KAN adopts exter- ther proves the effectiveness of our HAS nal knowledge for performance improvement.KAN (Tgt- Only)denotes the KAN variant without external knowledge We can find that HAS outperforms all the baselines,which Comparison with BERT-based Models We compare proves the effectiveness of HAS. HAS with BERT-pooling and BERT-attention on three datasets.As shown in Table 5,BERT-attention and HAS outperform BERT-pooling on all three datasets,which veri- Results on yahooQa We also evaluate HAS and baselines fies that question-answer interaction mechanisms have bet- on yahooQA.Table 3 shows the results.KAN (Deng et al. ter performance than pooling.Furthermore,we can find that
Table 2: Results on insuranceQA. The results of models marked with ? are reported from (Tran and Niederee 2018). ´ Other results marked with are reported from their original paper. P@1 is adopted as evaluation metric by following previous works. ‘our impl.’ denotes our implementation. Model P@1 (Test1) P@1 (Test2) CNN ? 62.80 59.20 CNN with GESD ? 65.30 61.00 QA-LSTM (our impl.) 66.08 62.63 AP-LSTM ? 69.00 64.80 IARNN-GATE ? 70.10 62.80 Multihop-Sequential-LSTM ? 70.50 66.90 AP-CNN 69.80 66.30 AP-BiLSTM 71.70 66.40 MULT 75.20 73.40 KAN (Tgt-Only) 71.50 68.80 KAN 75.20 72.50 HAS 76.38 73.71 The state-of-the-art baselines on three datasets are different. Hence, we adopt different baselines for comparison on different datasets according to previous works. Baselines using single model without extra knowledge include: CNN, CNN with GESD (Feng et al. 2015), QALSTM (Tan et al. 2016b), AP-LSTM (Tran and Niederee´ 2018), Multihop-Sequential-LSTM (Tran and Niederee´ 2018), IARNN-GATE (Wang, Liu, and Zhao 2016), NTNLSTM, HD-LSTM (Tay et al. 2017), HyperQA (Tay, Tuan, and Hui 2018b), AP-CNN (Santos et al. 2016), APBiLSTM (Santos et al. 2016), CTRN (Tay, Tuan, and Hui 2018a), CA-RNN (Chen et al. 2018), RNN-POA (Chen et al. 2017), MULT (Wang and Jiang 2017), MV-FNN (Sha et al. 2018). Single models with external knowledge include: KAN (Deng et al. 2018). Ensemble models include: LRXNET (Narayan et al. 2018), SUMBASE,P TK (Tymoshenko and Moschitti 2018). Because HAS adopts BERT as encoder, we also construct two BERT-based baselines for comparison. BERT-pooling is a model in which both questions and answers are composed into vectors by pooling. BERT-attention is a model which adopts attention as the composition module. Both BERTpooling and BERT-attention use BERT as the encoder, and hashing is not adopted in them. Experimental Results Results on insuranceQA We compare HAS with baselines on insuranceQA dataset. The results are shown in Table 2. MULT (Wang and Jiang 2017) and KAN (Deng et al. 2018) are two strong baselines which represent the stateof-the-art results on this dataset. Here, KAN adopts external knowledge for performance improvement. KAN (TgtOnly) denotes the KAN variant without external knowledge. We can find that HAS outperforms all the baselines, which proves the effectiveness of HAS. Results on yahooQA We also evaluate HAS and baselines on yahooQA. Table 3 shows the results. KAN (Deng et al. Table 3: Results on yahooQA. The results of models marked with ? are reported from (Tay, Tuan, and Hui 2018a). Other results marked with are reported from their original paper. P@1 and MRR are adopted as evaluation metrics by following previous works. Model P@1 MRR Random Guess 20.00 45.86 NTN-LSTM ? 54.50 73.10 HD-LSTM ? 55.70 73.50 AP-CNN ? 56.00 72.60 AP-BiLSTM ? 56.80 73.10 CTRN ? 60.10 75.50 HyperQA 68.30 80.10 KAN (Tgt-Only) 67.20 80.30 KAN 74.40 84.00 HAS 73.89 82.10 Table 4: Results on wikiQA. The results marked with are reported from their original paper. MAP and MRR are adopted as evaluation metrics by following previous works. Model MAP MRR AP-CNN 68.86 69.57 AP-BiLSTM 67.05 68.42 RNN-POA 72.12 73.12 Multihop-Sequential-LSTM 72.20 73.80 IARNN-GATE 72.58 73.94 CA-RNN 73.58 74.50 MULT 74.33 75.45 MV-FNN 74.62 75.76 SUMBASE,P TK 75.59 77.00 LRXNET 76.57 75.10 HAS 81.01 82.22 2018), which utilizes external knowledge, is the state-of-theart model on this dataset. HAS outperforms all baselines except KAN. The performance gain of KAN mainly owes to the external knowledge, by pre-training on a source QA dataset SQuAD-T. Please note that HAS does not adopt external QA dataset for pre-training. HAS can outperform the target-only version of KAN, denoted as KAN (Tgt-Only), which is only trained on yahooQA without SQuAD-T. Once again, the result on yahooQA verifies the effectiveness of HAS. Results on wikiQA Table 4 shows the results on wikiQA dataset. SUMBASE,P TK (Tymoshenko and Moschitti 2018) and LRXNET (Narayan et al. 2018) are two ensemble models which represent the state-of-the-art results on this dataset. HAS outperforms all the baselines again, which further proves the effectiveness of our HAS. Comparison with BERT-based Models We compare HAS with BERT-pooling and BERT-attention on three datasets. As shown in Table 5, BERT-attention and HAS outperform BERT-pooling on all three datasets, which veri- fies that question-answer interaction mechanisms have better performance than pooling. Furthermore, we can find that
Table 5:Comparison with BERT-based models. insuranceQA yahooOA wikiOA Model P@1 (Test1)P@1(Test2) P@1 Mrr MaP MRR BERT-pooling 74.52 71.97 73.49 81.93 77.22 78.27 BERT-attention 76.12 74.12 74.7882.6880.65 81.83 HAS 76.38 73.71 73.8982.1081.0182.22 Table 6:Comparison of accuracy,time cost and memory cost on insuranceQA.Each question has 500 candidate answers. "Memory Cost"is the memory cost for storing representations of answers. Model P@1(Test1)P@1 (Test2) Time Cost per Question Memory Costo BERT-pooling 74.52 71.97 0.03s 0.07GB BERT-attention (recal.) 76.12 74.12 4.19s 0.02GB BERT-attention (store) 76.12 74.12 0.28s 14.29GB HAS 76.38 73.71 0.28s 0.45GB Multihop-Sequential-LSTM 70.50 66.90 0.13s 5.25GB AP-CNN 69.80 66.30 7.44GB AP-BiLSTM 71.70 66.40 5.25GB HAS can achieve comparable accuracy as BERT-attention. puts of the encoding layer for an answer are different for But BERT-attention has either speed(time cost)problem or different questions.Thus,these two models cannot store rep- memory cost problem,which will be shown in the following resentations for reusing.We compare HAS with Multihop- subsection. Sequential-LSTM,AP-CNN,and AP-BiLSTM.The mem- We also find that HAS can improve the results of BERT- ory cost of these three models is 5.25 GB.7.44 GB.and 5.25 attention on insuranceQA and wikiQA.One reason might be GB,respectively,which are 11.75,16.67,11.75 times larger that hashing can act as a regularization(constrained to be bi- than that of HAS.Other baselines are not adopted for com- nary)for feature representation learning,and hence reduce parison,but almost all baselines with question-answer inter- the model complexity and increase the generalization ability action mechanisms have either time cost problem or memory when the model already has enough capacity.The wikiQA cost problem as that in BERT-attention. dataset is a relatively small dataset on which deep models are We can find that our HAS is fast with a low memory cost, easy to overfit.HAS outperforms BERT-attention on wik- which also makes HAS have promising potential for embed- iQA in terms of both MAP and MRR,which is consistent ded or mobile applications. with our view about generalization. Sensitivity Analysis of 6 and B In this section,we study Time Cost and Memory Cost To further prove the effec- the sensitivity of the two important hyper-parameters in tiveness of HAS,we compare HAS with baselines on in- HAS,which are the coefficient o of7(a)and the value suranceQA in terms of time cost and memory cost when of B in tanh(Bx).We design a sensitivity study of these two the model is deployed for prediction.The results are shown hyper-parameters on insuranceQA and wikiQA.As shown in Table 6.All experiments are run on a TitanXP GPU. in Figure 2(a)and Figure 3(a),the performance can be im- BERT-pooling can directly store the vector representations proved by increasing B to 5.We can find that HAS is not of answers with a low memory cost,which doesn't have the sensitive to B in the range of [5,101.When B is fixed to 5.the time cost and memory cost problem.But the accuracy of performance of different choices of 6 is shown in Figure 2(b) BERT-pooling is much lower than BERT-attention and HAS. and Figure 3(b).We can find that HAS is not sensitive to 6 BERT-attention (recal.)denotes a BERT-attention variant in the range of [le-7,le-5]. which recalculates the matrix representations of answers for each question,and BERT-attention (store)denotes a BERT- Conclusion attention variant which stores the matrix representations of answers in memory.BERT-attention(recal.)does not need to In this paper,we propose a novel answer selection method store the matrix representations of answers in memory,and called hashing based answer selection (HAS).HAS adopts BERT-attention(store)does not need recalculation.The time hashing to learn binary matrix representations for answers, cost is 4.19 seconds per question for BERT-attention(recal.), which can dramatically reduce memory cost for storing the which is about 15 times slower than HAS.Although BERT- matrix outputs of encoders in answer selection.When de- attention(store)has low time cost as that of HAS,the mem- ployed for prediction,HAS is fast with a low memory cost. ory cost of it is 14.29 GB,which is about 32 times larger This is particularly meaningful when the model needs to than that of HAS. be deployed at embedded or mobile systems.Experimental We also compare HAS with other baselines in existing results on three popular datasets show that HAS can out- works.KAN and MULT do question-answer interaction be- perform existing methods to achieve state-of-the-art perfor- fore encoding layer or during encoding layer,and the out- mance
Table 5: Comparison with BERT-based models. insuranceQA yahooQA wikiQA Model P@1 (Test1) P@1 (Test2) P@1 MRR MAP MRR BERT-pooling 74.52 71.97 73.49 81.93 77.22 78.27 BERT-attention 76.12 74.12 74.78 82.68 80.65 81.83 HAS 76.38 73.71 73.89 82.10 81.01 82.22 Table 6: Comparison of accuracy, time cost and memory cost on insuranceQA. Each question has 500 candidate answers. “Memory Cost ” is the memory cost for storing representations of answers. Model P@1 (Test1) P@1 (Test2) Time Cost per Question Memory Cost BERT-pooling 74.52 71.97 0.03s 0.07 GB BERT-attention (recal.) 76.12 74.12 4.19s 0.02 GB BERT-attention (store) 76.12 74.12 0.28s 14.29 GB HAS 76.38 73.71 0.28s 0.45 GB Multihop-Sequential-LSTM 70.50 66.90 0.13s 5.25 GB AP-CNN 69.80 66.30 — 7.44 GB AP-BiLSTM 71.70 66.40 — 5.25 GB HAS can achieve comparable accuracy as BERT-attention. But BERT-attention has either speed (time cost) problem or memory cost problem, which will be shown in the following subsection. We also find that HAS can improve the results of BERTattention on insuranceQA and wikiQA. One reason might be that hashing can act as a regularization (constrained to be binary) for feature representation learning, and hence reduce the model complexity and increase the generalization ability when the model already has enough capacity. The wikiQA dataset is a relatively small dataset on which deep models are easy to overfit. HAS outperforms BERT-attention on wikiQA in terms of both MAP and MRR, which is consistent with our view about generalization. Time Cost and Memory Cost To further prove the effectiveness of HAS, we compare HAS with baselines on insuranceQA in terms of time cost and memory cost when the model is deployed for prediction. The results are shown in Table 6. All experiments are run on a TitanXP GPU. BERT-pooling can directly store the vector representations of answers with a low memory cost, which doesn’t have the time cost and memory cost problem. But the accuracy of BERT-pooling is much lower than BERT-attention and HAS. BERT-attention (recal.) denotes a BERT-attention variant which recalculates the matrix representations of answers for each question, and BERT-attention (store) denotes a BERTattention variant which stores the matrix representations of answers in memory. BERT-attention (recal.) does not need to store the matrix representations of answers in memory, and BERT-attention (store) does not need recalculation. The time cost is 4.19 seconds per question for BERT-attention (recal.), which is about 15 times slower than HAS. Although BERTattention (store) has low time cost as that of HAS, the memory cost of it is 14.29 GB, which is about 32 times larger than that of HAS. We also compare HAS with other baselines in existing works. KAN and MULT do question-answer interaction before encoding layer or during encoding layer, and the outputs of the encoding layer for an answer are different for different questions. Thus, these two models cannot store representations for reusing. We compare HAS with MultihopSequential-LSTM, AP-CNN, and AP-BiLSTM. The memory cost of these three models is 5.25 GB, 7.44 GB, and 5.25 GB, respectively, which are 11.75, 16.67, 11.75 times larger than that of HAS. Other baselines are not adopted for comparison, but almost all baselines with question-answer interaction mechanisms have either time cost problem or memory cost problem as that in BERT-attention. We can find that our HAS is fast with a low memory cost, which also makes HAS have promising potential for embedded or mobile applications. Sensitivity Analysis of δ and β In this section, we study the sensitivity of the two important hyper-parameters in HAS, which are the coefficient δ of J c (a) and the value of β in tanh(βx). We design a sensitivity study of these two hyper-parameters on insuranceQA and wikiQA. As shown in Figure 2(a) and Figure 3(a), the performance can be improved by increasing β to 5. We can find that HAS is not sensitive to β in the range of [5, 10]. When β is fixed to 5, the performance of different choices of δ is shown in Figure 2(b) and Figure 3(b). We can find that HAS is not sensitive to δ in the range of [1e −7 , 1e −5 ]. Conclusion In this paper, we propose a novel answer selection method called hashing based answer selection (HAS). HAS adopts hashing to learn binary matrix representations for answers, which can dramatically reduce memory cost for storing the matrix outputs of encoders in answer selection. When deployed for prediction, HAS is fast with a low memory cost. This is particularly meaningful when the model needs to be deployed at embedded or mobile systems. Experimental results on three popular datasets show that HAS can outperform existing methods to achieve state-of-the-art performance
--Test1 77 。-Tgs12 76 76 75 141 7 73 0 10 15 20 0 1e7 1e6 (a)Sensitivity of B when 6 =le-6 (b)Sensitivity of 6 when B=5 Figure 2:Sensitivity analysis on insuranceQA. 83 83 -MAP 。-MAP 82.5 -MRR MRR 82 82 E81 80 80.5 80 79 5 10 15 20 1e7 1e6 1e5 0 (a)Sensitivity of B when 6=1e-6 (b)Sensitivity of 6 when B=5 Figure 3:Sensitivity analysis on wikiQA. HAS is flexible to integrate other encoders and question- Chen,Q.;Hu,Q.;Huang,J.X.;and He,L.2018.CA- answer interaction mechanisms.Furthermore,the idea to RNN:Using context-aligned recurrent neural networks for adopt hashing for binary representation learning in HAS can modeling sentence similarity.In Proceedings of the AAAl also be used for other NLP tasks.All these possible exten- Conference on Artificial Intelligence,265-273. sions will be pursued in our future work. Chung,Y.;Lee,H.;and Glass,J.R.2018.Supervised and unsupervised transfer learning for question answering.In Acknowledgments Proceedings of the Annual Conference of the North Ameri- This work is supported by the NSFC-NRF Joint Research can Chapter of the Association for Computational Linguis- Project(No.61861146001). tics:Human Language Technologies,1585-1594. Cui,H.;Sun,R.;Li,K.;Kan,M.;and Chua,T.2005. References Question answering passage retrieval using dependency re- Bahdanau,D.;Cho,K.;and Bengio,Y.2015.Neural ma- lations.In Proceedings of the Annual International ACM chine translation by jointly learning to align and translate. SIGIR Conference on Research and Development in Infor- In Proceedings of the International Conference on Learning mation Retrieval,400-407. Representations. Deng,Y.;Shen,Y.;Yang,M.;Li,Y.;Du,N.;Fan,W.;and Cao,Z.;Long,M.;Wang,J.;and Yu,P.S.2017.HashNet: Lei,K.2018.Knowledge as A bridge:Improving cross- Deep learning to hash by continuation.In Proceedings of the domain answer selection with external knowledge.In Pro- IEEE International Conference on Computer Vision,5609- ceedings of the International Conference on Computational 5618. Linguistics,3295-3305. Chen,Q.;Hu,Q.;Huang,J.X.;He,L.;and An,W.2017. Devlin,J.;Chang,M.;Lee,K.;and Toutanova,K.2018. Enhancing recurrent neural networks with positional atten- BERT:pre-training of deep bidirectional transformers for tion for question answering.In Proceedings of the Interna- language understanding.In Proceedings of the Annual Con- tional ACM SIGIR Conference on Research and Develop- ference of the North American Chapter of the Association ment in Information Retrieval,993-996. for Computational Linguistics
0 5 10 15 20 β 73 74 75 76 77 Precision@1 Test1 Test2 (a) Sensitivity of β when δ = 1e −6 0 1e-7 1e-6 1e-5 1e-4 δ 73 74 75 76 77 Precision@1 Test1 Test2 (b) Sensitivity of δ when β = 5 Figure 2: Sensitivity analysis on insuranceQA. 0 5 10 15 20 β 80 80.5 81 81.5 82 82.5 83 performance MAP MRR (a) Sensitivity of β when δ = 1e −6 0 1e-7 1e-6 1e-5 1e-4 δ 79 80 81 82 83 performance MAP MRR (b) Sensitivity of δ when β = 5 Figure 3: Sensitivity analysis on wikiQA. HAS is flexible to integrate other encoders and questionanswer interaction mechanisms. Furthermore, the idea to adopt hashing for binary representation learning in HAS can also be used for other NLP tasks. All these possible extensions will be pursued in our future work. Acknowledgments This work is supported by the NSFC-NRF Joint Research Project (No. 61861146001). References Bahdanau, D.; Cho, K.; and Bengio, Y. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of the International Conference on Learning Representations. Cao, Z.; Long, M.; Wang, J.; and Yu, P. S. 2017. HashNet: Deep learning to hash by continuation. In Proceedings of the IEEE International Conference on Computer Vision, 5609– 5618. Chen, Q.; Hu, Q.; Huang, J. X.; He, L.; and An, W. 2017. Enhancing recurrent neural networks with positional attention for question answering. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval, 993–996. Chen, Q.; Hu, Q.; Huang, J. X.; and He, L. 2018. CARNN: Using context-aligned recurrent neural networks for modeling sentence similarity. In Proceedings of the AAAI Conference on Artificial Intelligence, 265–273. Chung, Y.; Lee, H.; and Glass, J. R. 2018. Supervised and unsupervised transfer learning for question answering. In Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 1585–1594. Cui, H.; Sun, R.; Li, K.; Kan, M.; and Chua, T. 2005. Question answering passage retrieval using dependency relations. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 400–407. Deng, Y.; Shen, Y.; Yang, M.; Li, Y.; Du, N.; Fan, W.; and Lei, K. 2018. Knowledge as A bridge: Improving crossdomain answer selection with external knowledge. In Proceedings of the International Conference on Computational Linguistics, 3295–3305. Devlin, J.; Chang, M.; Lee, K.; and Toutanova, K. 2018. BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics
Fan,L.;Jiang,Q.-Y.;Yu,Y.-Q.;and Li,W.-J.2019.Deep Tay,Y.;Tuan,L.A.;and Hui,S.C.2018b.Hyperbolic hashing for speaker identification and retrieval.In Proceed- representation learning for fast and efficient neural question ings of the Annual Conference of the International Speech answering.In Proceedings of the ACM International Con- Communication Association,2908-2912 ference on Web Search and Data Mining,583-591. Feng,M.;Xiang,B.;Glass,M.R.;Wang,L.;and Zhou,B. Tellez-Valero,A.;Montes-y-Gomez,M.;Pineda,L.V.;and 2015.Applying deep learning to answer selection:A study Padilla,A.P.2011.Learning to select the correct answer in and an open task.In Proceedings of the IEEE Workshop multi-stream question answering.Information Processing on Automatic Speech Recognition and Understanding,813- and Management 47(6):856-869. 820. Tran,N.K.,and Niederee,C.2018.Multihop attention Hubara,I.:Courbariaux,M.:Soudry,D.:El-Yaniv,R.:and networks for question answer matching.In Proceedings of Bengio,Y.2016.Binarized neural networks.In Proceedings the International ACM SIGIR Conference on Research and of the Annual Conference on Neural Information Processing Development in Information Retrieval,325-334. Systems,4107-_4115. Tymoshenko,K.,and Moschitti,A.2018.Cross-pair text Li,W.-J.;Wang,S.;and Kang,W.-C.2016.Feature learning representations for answer sentence selection.In Proceed- based deep supervised hashing with pairwise labels.In Pro- ings of the Conference on Empirical Methods in Natural ceedings of the International Joint Conference on Artificial Language Processing,2162-2173. Intelligence,1711-1717. Vaswani,A.;Shazeer,N.;Parmar,N.;Uszkoreit,J.;Jones, Min.S.:Seo.M.J.:and Hajishirzi.H.2017.Question an- L.;Gomez,A.N.;Kaiser,E.;and Polosukhin.I.2017.At- swering through transfer learning from large fine-grained su- tention is all you need.In Proceedings of the Annual Con- pervision data.In Proceedings of the Annual Meeting of the ference on Neural Information Processing Systems,6000- Association for Computational Linguistics,510-517. 6010. Narayan,S.;Cardenas,R.;Papasarantopoulos,N.;Cohen, Wan,S.;Lan,Y.;Guo,J.;Xu,J.;Pang,L.;and Cheng,X. S.B.;Lapata,M.;Yu,J.;and Chang,Y.2018.Document 2016.A deep architecture for semantic matching with mul- modeling with external attention for sentence extraction.In tiple positional sentence representations.In Proceedings of Proceedings of the Annual Meeting of the Association for the AAAl Conference on Artificial Intelligence,2835-2841. Computational Linguistics,2020-2030. Wang,S.,and Jiang,J.2017.A compare-aggregate model Radford,A.;Wu,J.;Child,R.;Luan,D.;Amodei,D.;and for matching text sequences/semantic linking in convolu- Sutskever,I.2019.Language models are unsupervised mul- tional neural networks for answer sentence selection.In Pro- titask learners.Computing Research Repository. ceedings of the International Conference on Learning Rep- resentations. Robertson,S.E.:Walker,S.:Jones,S.:Hancock-Beaulieu. M.;and Gatford,M.1994.Okapi at TREC-3.In Proceed- Wang,M.,and Manning,C.D.2010.Probabilistic tree-edit ings of the Text REtrieval Conference.109-126. models with structured latent variables for textual entailment and question answering.In Proceedings of the International Santos,C.d.;Tan,M.;Xiang,B.;and Zhou,B.2016.At- Conference on Computational Linguistics,1164-1172 tentive pooling networks.arXiv preprint arXiv:1602.03609. Wang,B.:Liu,K.:and Zhao,J.2016.Inner attention based Sha,L.;Zhang,X.;Qian,F.;Chang,B.;and Sui,Z.2018. recurrent neural networks for answer selection.In Proceed- A multi-view fusion neural network for answer selection ings of the Annual Meeting of the Association for Computa- In Proceedings of the AAAl Conference on Artificial Intelli- tional Linguistics,1288-1297. gence,5422-5429. Wiese,G.;Weissenborn,D.;and Neves,M.L.2017.Neural Tan,M.;dos Santos,C.N.;Xiang,B.;and Zhou,B. domain adaptation for biomedical question answering.In 2016a.Improved representation learning for question an- Proceedings of the Conference on Computational Natural swer matching.In Proceedings of the Annual Meeting of the Language Learning,281-289. Association for Computational Linguistics,464-473. Yang,Y.;Yih,W.;and Meek,C.2015.WikiQA:A chal- Tan,M.;Santos,C.d.;Xiang,B.;and Zhou,B.2016b. lenge dataset for open-domain question answering.In Pro- LSTM-based deep learning models for non-factoid answer ceedings of the Conference on Empirical Methods in Natural selection.In Proceedings of the International Conference Language Processing,2013-2018. on Learning Representations. Yih,W.;Chang,M.;Meek,C.;and Pastusiak,A.2013. Tay,Y.;Phan,M.C.;Luu,A.T.;and Hui,S.C.2017. Question answering using enhanced lexical semantic mod- Learning to rank question answer pairs with holographic els.In Proceedings of the Annual Meeting of the Association dual LSTM architecture.In Proceedings of the International for Computational Linguistics,1744-1753. ACM SIGIR Conference on Research and Development in Yu,J.;Qiu,M.;Jiang,J.;Huang,J.;Song,S.;Chu,W.;and Information Retrieval,695-704. Chen,H.2018.Modelling domain relationships for trans- Tay,Y.;Tuan,L.A.;and Hui,S.C.2018a.Cross tem- fer learning on retrieval-based question answering systems poral recurrent networks for ranking question answer pairs in E-commerce.In Proceedings of the ACM International In Proceedings of the AAAl Conference on Artificial Intelli- Conference on Web Search and Data Mining,682-690. gence,5512-5519
Fan, L.; Jiang, Q.-Y.; Yu, Y.-Q.; and Li, W.-J. 2019. Deep hashing for speaker identification and retrieval. In Proceedings of the Annual Conference of the International Speech Communication Association, 2908–2912. Feng, M.; Xiang, B.; Glass, M. R.; Wang, L.; and Zhou, B. 2015. Applying deep learning to answer selection: A study and an open task. In Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding, 813– 820. Hubara, I.; Courbariaux, M.; Soudry, D.; El-Yaniv, R.; and Bengio, Y. 2016. Binarized neural networks. In Proceedings of the Annual Conference on Neural Information Processing Systems, 4107–4115. Li, W.-J.; Wang, S.; and Kang, W.-C. 2016. Feature learning based deep supervised hashing with pairwise labels. In Proceedings of the International Joint Conference on Artificial Intelligence, 1711–1717. Min, S.; Seo, M. J.; and Hajishirzi, H. 2017. Question answering through transfer learning from large fine-grained supervision data. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, 510–517. Narayan, S.; Cardenas, R.; Papasarantopoulos, N.; Cohen, S. B.; Lapata, M.; Yu, J.; and Chang, Y. 2018. Document modeling with external attention for sentence extraction. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, 2020–2030. Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; and Sutskever, I. 2019. Language models are unsupervised multitask learners. Computing Research Repository. Robertson, S. E.; Walker, S.; Jones, S.; Hancock-Beaulieu, M.; and Gatford, M. 1994. Okapi at TREC-3. In Proceedings of the Text REtrieval Conference, 109–126. Santos, C. d.; Tan, M.; Xiang, B.; and Zhou, B. 2016. Attentive pooling networks. arXiv preprint arXiv:1602.03609. Sha, L.; Zhang, X.; Qian, F.; Chang, B.; and Sui, Z. 2018. A multi-view fusion neural network for answer selection. In Proceedings of the AAAI Conference on Artificial Intelligence, 5422–5429. Tan, M.; dos Santos, C. N.; Xiang, B.; and Zhou, B. 2016a. Improved representation learning for question answer matching. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, 464–473. Tan, M.; Santos, C. d.; Xiang, B.; and Zhou, B. 2016b. LSTM-based deep learning models for non-factoid answer selection. In Proceedings of the International Conference on Learning Representations. Tay, Y.; Phan, M. C.; Luu, A. T.; and Hui, S. C. 2017. Learning to rank question answer pairs with holographic dual LSTM architecture. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval, 695–704. Tay, Y.; Tuan, L. A.; and Hui, S. C. 2018a. Cross temporal recurrent networks for ranking question answer pairs. In Proceedings of the AAAI Conference on Artificial Intelligence, 5512–5519. Tay, Y.; Tuan, L. A.; and Hui, S. C. 2018b. Hyperbolic representation learning for fast and efficient neural question answering. In Proceedings of the ACM International Conference on Web Search and Data Mining, 583–591. Tellez-Valero, A.; Montes-y-G ´ omez, M.; Pineda, L. V.; and ´ Padilla, A. P. 2011. Learning to select the correct answer in multi-stream question answering. Information Processing and Management 47(6):856–869. Tran, N. K., and Niederee, C. 2018. Multihop attention ´ networks for question answer matching. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval, 325–334. Tymoshenko, K., and Moschitti, A. 2018. Cross-pair text representations for answer sentence selection. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2162–2173. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. Attention is all you need. In Proceedings of the Annual Conference on Neural Information Processing Systems, 6000– 6010. Wan, S.; Lan, Y.; Guo, J.; Xu, J.; Pang, L.; and Cheng, X. 2016. A deep architecture for semantic matching with multiple positional sentence representations. In Proceedings of the AAAI Conference on Artificial Intelligence, 2835–2841. Wang, S., and Jiang, J. 2017. A compare-aggregate model for matching text sequences / semantic linking in convolutional neural networks for answer sentence selection. In Proceedings of the International Conference on Learning Representations. Wang, M., and Manning, C. D. 2010. Probabilistic tree-edit models with structured latent variables for textual entailment and question answering. In Proceedings of the International Conference on Computational Linguistics, 1164–1172. Wang, B.; Liu, K.; and Zhao, J. 2016. Inner attention based recurrent neural networks for answer selection. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, 1288–1297. Wiese, G.; Weissenborn, D.; and Neves, M. L. 2017. Neural domain adaptation for biomedical question answering. In Proceedings of the Conference on Computational Natural Language Learning, 281–289. Yang, Y.; Yih, W.; and Meek, C. 2015. WikiQA: A challenge dataset for open-domain question answering. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2013–2018. Yih, W.; Chang, M.; Meek, C.; and Pastusiak, A. 2013. Question answering using enhanced lexical semantic models. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, 1744–1753. Yu, J.; Qiu, M.; Jiang, J.; Huang, J.; Song, S.; Chu, W.; and Chen, H. 2018. Modelling domain relationships for transfer learning on retrieval-based question answering systems in E-commerce. In Proceedings of the ACM International Conference on Web Search and Data Mining, 682–690