正在加载图片...
Table 2:Results on insuranceOA.The results of models Table 3:Results on yahooQA.The results of models marked marked with are reported from (Tran and Niederee 2018) with are reported from (Tay,Tuan,and Hui 2018a).Other Other results marked with o are reported from their origi- results marked with o are reported from their original paper. nal paper.P@1 is adopted as evaluation metric by following P@l and MRR are adopted as evaluation metrics by follow- previous works.'our impl.'denotes our implementation. ing previous works. Model P@1 (Test1)P@1(Test2) Model P@1 MRR CNN* 62.80 59.20 Random Guess 20.0045.86 CNN with GESD* 65.30 61.00 NTN-LSTM 54.50 73.10 OA-LSTM (our impl.) 66.08 62.63 HD-LSTM* 55.70 73.50 AP-LSTM★ 69.00 64.80 AP-CNN* 56.00 72.60 IARNN-GATE 70.10 62.80 AP-BiLSTM* 56.80 73.10 Multihop-Sequential-LSTM* 70.50 66.90 CTRN* 60.10 75.50 AP-CNN 69.80 66.30 HyperQA 68.30 80.10 AP-BiLSTM 71.70 66.40 KAN (Tgt-Only)o 67.20 80.30 MULT◆ 75.20 73.40 KAN 74.40 84.00 KAN(Tgt-Only)o 71.50 68.80 HAS 73.89 82.10 KAN 75.20 72.50 HAS 76.38 73.71 Table 4:Results on wikiOA.The results marked with o are reported from their original paper.MAP and MRR are The state-of-the-art baselines on three datasets are dif- adopted as evaluation metrics by following previous works. ferent.Hence,we adopt different baselines for compar- Model MAP MRR ison on different datasets according to previous works. AP-CNN 68.86 69.57 AP-BiLSTM 67.05 68.42 Baselines using single model without extra knowledge in- RNN-POA◇ 72.12 73.12 clude:CNN,CNN with GESD (Feng et al.2015),QA- Multihop-Sequential-LSTM 72.20 73.80 LSTM (Tan et al.2016b).AP-LSTM (Tran and Niederee IARNN-GATE 72.58 73.94 2018),Multihop-Sequential-LSTM (Tran and Niederee CA-RNN 73.58 74.50 2018),IARNN-GATE (Wang,Liu,and Zhao 2016),NTN- MULT◇ 74.33 75.45 LSTM,HD-LSTM (Tay et al.2017),HyperQA (Tay, MV-FNN 74.62 75.76 Tuan.and Hui 2018b).AP-CNN(Santos et al.2016).AP- SUMBASE,PTK◇ 75.59 77.00 BiLSTM(Santos et al.2016),CTRN (Tay,Tuan,and Hui LRXNET◇ 76.57 75.10 2018a),CA-RNN (Chen et al.2018),RNN-POA (Chen et HAS 81.0182.22 al.2017),MULT (Wang and Jiang 2017),MV-FNN (Sha et al.2018).Single models with external knowledge in- clude:KAN (Deng et al.2018).Ensemble models include: 2018),which utilizes external knowledge,is the state-of-the- LRXNET (Narayan et al.2018),SUMBASE.PTK (Ty- art model on this dataset.HAS outperforms all baselines moshenko and Moschitti 2018). except KAN.The performance gain of KAN mainly owes Because HAS adopts BERT as encoder,we also construct to the external knowledge,by pre-training on a source QA two BERT-based baselines for comparison.BERT-pooling is dataset SQuAD-T.Please note that HAS does not adopt ex- a model in which both questions and answers are composed ternal QA dataset for pre-training.HAS can outperform the into vectors by pooling.BERT-attention is a model which target-only version of KAN,denoted as KAN (Tgt-Only). adopts attention as the composition module.Both BERT- which is only trained on yahooQA without SQuAD-T.Once pooling and BERT-attention use BERT as the encoder,and again,the result on yahooQA verifies the effectiveness of hashing is not adopted in them. HAS. Experimental Results Results on wikiOA Table 4 shows the results on wik- Results on insuranceQA We compare HAS with base- iQA dataset.SUMBASE.PTK (Tymoshenko and Moschitti lines on insuranceQA dataset.The results are shown in Ta- 2018)and LRXNET (Narayan et al.2018)are two ensemble ble 2.MULT (Wang and Jiang 2017)and KAN (Deng et models which represent the state-of-the-art results on this al.2018)are two strong baselines which represent the state- dataset.HAS outperforms all the baselines again,which fur- of-the-art results on this dataset.Here,KAN adopts exter- ther proves the effectiveness of our HAS nal knowledge for performance improvement.KAN (Tgt- Only)denotes the KAN variant without external knowledge We can find that HAS outperforms all the baselines,which Comparison with BERT-based Models We compare proves the effectiveness of HAS. HAS with BERT-pooling and BERT-attention on three datasets.As shown in Table 5,BERT-attention and HAS outperform BERT-pooling on all three datasets,which veri- Results on yahooQa We also evaluate HAS and baselines fies that question-answer interaction mechanisms have bet- on yahooQA.Table 3 shows the results.KAN (Deng et al. ter performance than pooling.Furthermore,we can find thatTable 2: Results on insuranceQA. The results of models marked with ? are reported from (Tran and Niederee 2018). ´ Other results marked with  are reported from their origi￾nal paper. P@1 is adopted as evaluation metric by following previous works. ‘our impl.’ denotes our implementation. Model P@1 (Test1) P@1 (Test2) CNN ? 62.80 59.20 CNN with GESD ? 65.30 61.00 QA-LSTM (our impl.) 66.08 62.63 AP-LSTM ? 69.00 64.80 IARNN-GATE ? 70.10 62.80 Multihop-Sequential-LSTM ? 70.50 66.90 AP-CNN  69.80 66.30 AP-BiLSTM  71.70 66.40 MULT  75.20 73.40 KAN (Tgt-Only)  71.50 68.80 KAN  75.20 72.50 HAS 76.38 73.71 The state-of-the-art baselines on three datasets are dif￾ferent. Hence, we adopt different baselines for compar￾ison on different datasets according to previous works. Baselines using single model without extra knowledge in￾clude: CNN, CNN with GESD (Feng et al. 2015), QA￾LSTM (Tan et al. 2016b), AP-LSTM (Tran and Niederee´ 2018), Multihop-Sequential-LSTM (Tran and Niederee´ 2018), IARNN-GATE (Wang, Liu, and Zhao 2016), NTN￾LSTM, HD-LSTM (Tay et al. 2017), HyperQA (Tay, Tuan, and Hui 2018b), AP-CNN (Santos et al. 2016), AP￾BiLSTM (Santos et al. 2016), CTRN (Tay, Tuan, and Hui 2018a), CA-RNN (Chen et al. 2018), RNN-POA (Chen et al. 2017), MULT (Wang and Jiang 2017), MV-FNN (Sha et al. 2018). Single models with external knowledge in￾clude: KAN (Deng et al. 2018). Ensemble models include: LRXNET (Narayan et al. 2018), SUMBASE,P TK (Ty￾moshenko and Moschitti 2018). Because HAS adopts BERT as encoder, we also construct two BERT-based baselines for comparison. BERT-pooling is a model in which both questions and answers are composed into vectors by pooling. BERT-attention is a model which adopts attention as the composition module. Both BERT￾pooling and BERT-attention use BERT as the encoder, and hashing is not adopted in them. Experimental Results Results on insuranceQA We compare HAS with base￾lines on insuranceQA dataset. The results are shown in Ta￾ble 2. MULT (Wang and Jiang 2017) and KAN (Deng et al. 2018) are two strong baselines which represent the state￾of-the-art results on this dataset. Here, KAN adopts exter￾nal knowledge for performance improvement. KAN (Tgt￾Only) denotes the KAN variant without external knowledge. We can find that HAS outperforms all the baselines, which proves the effectiveness of HAS. Results on yahooQA We also evaluate HAS and baselines on yahooQA. Table 3 shows the results. KAN (Deng et al. Table 3: Results on yahooQA. The results of models marked with ? are reported from (Tay, Tuan, and Hui 2018a). Other results marked with  are reported from their original paper. P@1 and MRR are adopted as evaluation metrics by follow￾ing previous works. Model P@1 MRR Random Guess 20.00 45.86 NTN-LSTM ? 54.50 73.10 HD-LSTM ? 55.70 73.50 AP-CNN ? 56.00 72.60 AP-BiLSTM ? 56.80 73.10 CTRN ? 60.10 75.50 HyperQA  68.30 80.10 KAN (Tgt-Only)  67.20 80.30 KAN  74.40 84.00 HAS 73.89 82.10 Table 4: Results on wikiQA. The results marked with  are reported from their original paper. MAP and MRR are adopted as evaluation metrics by following previous works. Model MAP MRR AP-CNN  68.86 69.57 AP-BiLSTM  67.05 68.42 RNN-POA  72.12 73.12 Multihop-Sequential-LSTM  72.20 73.80 IARNN-GATE  72.58 73.94 CA-RNN  73.58 74.50 MULT  74.33 75.45 MV-FNN  74.62 75.76 SUMBASE,P TK  75.59 77.00 LRXNET  76.57 75.10 HAS 81.01 82.22 2018), which utilizes external knowledge, is the state-of-the￾art model on this dataset. HAS outperforms all baselines except KAN. The performance gain of KAN mainly owes to the external knowledge, by pre-training on a source QA dataset SQuAD-T. Please note that HAS does not adopt ex￾ternal QA dataset for pre-training. HAS can outperform the target-only version of KAN, denoted as KAN (Tgt-Only), which is only trained on yahooQA without SQuAD-T. Once again, the result on yahooQA verifies the effectiveness of HAS. Results on wikiQA Table 4 shows the results on wik￾iQA dataset. SUMBASE,P TK (Tymoshenko and Moschitti 2018) and LRXNET (Narayan et al. 2018) are two ensemble models which represent the state-of-the-art results on this dataset. HAS outperforms all the baselines again, which fur￾ther proves the effectiveness of our HAS. Comparison with BERT-based Models We compare HAS with BERT-pooling and BERT-attention on three datasets. As shown in Table 5, BERT-attention and HAS outperform BERT-pooling on all three datasets, which veri- fies that question-answer interaction mechanisms have bet￾ter performance than pooling. Furthermore, we can find that
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有