正在加载图片...
Poster Session 5 MM'21.October 20-24,2021,Virtual Event,China Table 3:Comparisons with other approaches on CSL dataset under signer-independent test and unseen-sentences test. SplitΠ Model Split I ROUGE BLEU-1 BLEU-2 BLEU-3 BLEU-4 ROUGE BLEU-1 BLEU-2 BLEU-3 BLEU-4 S2VT[39] 0.904 0.902 0.886 0.879 0.874 0.461 0.466 0.258 0.135 S2VT(3-layer)[39] 0.911 0.911 0.896 0.889 0.884 0.465 0.475 0.265 0.145 HLSTM [15] 0.944 0.942 0.932 0.927 0.922 0.481 0.487 0.315 0.195 HLSTM-attn [15] 0.951 0.948 0.938 0.933 0.928 0.503 0.508 0.330 0.207 HRF-Fusion [14] 0.994 0.993 0.992 0.991 0.990 0.449 0.450 0.238 0.127 SANet 0.996 0.994 0.993 0.992 0.990 0.681 0.697 0.411 0.268 0.181 Table 4:Comparison with other approaches on RWTH-PHOENIX-Weather 2014T dataset PHOENIX14T DEV PHOENIX14T TEST Model ROUGE BLEU-1 BLEU-2 BLEU-3 BLEU-4 ROUGE BLEU-1 BLEU-2 BLEU-3 BLEU-4 TSPNet [22 0.349 0.361 0.231 0.169 0.134 H+M+P[6] 0.459 0.195 0.436 0.183 Sign2Gloss-Goss2Text [5] 0.438 0.411 0.291 0.221 0.179 0.435 0.415 0.295 0.222 0.178 Sign2Gloss2Text [5] 0.441 0.429 0.303 0.230 0.184 0.438 0.433 0.304 0.228 0.181 (Gloss)Sign2Text [7] 0.455 0.326 0.253 0.207 0.453 0.323 0.248 0.202 (Gloss)Sign2(Gloss+Text)[7] 0.473 0.344 0.271 0.224 0.466 0.337 0.262 0.213 DeepHand [25] 0.381 0.385 0.256 0.186 0.146 SANet 0.542 0.566 0.415 0.312 0.235 0.548 0.573 0.424 0.322 0.248 previous encoder-decoder architecture,but it is trained jointly.(5) evaluate the translation quality.For a fair comparison,the reported (Gloss)Sign2Text [7]utilizes a transformer based architecture for results on these metrics are provided by the original paper.Without SLT.(6)(Gloss)Sign2(Gloss+Text )[7]also utilizes a transformer complete original/released codes,we do not make further compar- based architecture,but jointly learns CLSR and SLT at the same isons(e.g.,inference time)with the approaches.However,in future time.(7)DeepHand [25]adopts the encoder-decoder architecture work,we will try to explore more comprehensive comparisons from and introduces a pretrained hand shape recognizer for SLT. different perspectives with the existing approaches. As shown in Table 4,we provide the performance of each ap- Particularity of CSL dataset:The CSL dataset is a special proach on both the validation set(ie.,'DEV)and test set(ie., dataset,where the word order of sign language is temporally con TEST).Whatever on 'DEV'set or TEST'set,the existing ap- sistent with that of spoken language,thus it can be used for SLR proaches often have the lower performance.For example,the best task [18,45]as well as SLT task [14,15].In this paper,CSL dataset is ROUGE and BLEU-4 score achieved by the existing approach is solved from the aspect of SLT and only SLT methods are considered less than 46%and 23%,respectively.This may be because of the for comparisons. high diversity,large size of vocabulary and limited number of train- ing samples in the PHOENIX14T dataset.However,our proposed 6 CONCLUSION SANet further improves the SLT performance and achieves the best performance,e.g,54.2%of ROUGE and 23.5%of BLEU-4 score on In this paper,we propose a Skeleton-Aware neural Network(SANet) for end-to-end SLT.Specifically,we first use a self-contained branch 'DEV'set while 54.8%of ROUGE and 24.8%of BLEU-4 score on to extract the skeleton from each frame,and then enhance the fea- TEST set,which outperform all the existing approaches. ture representation of a clip by adding the skeleton channel and scaling (i.e.,weighting the importance)the feature vector with a DISCUSSION AND FUTURE WORK designed skeleton-based GCN.Besides,we design a joint optimiza- Keypoints of skeleton:In this paper,we extract 14 keypoints to tion strategy for training.The experimental results on two large represent the skeleton,while ignoring the fine-grained keypoints scale datasets demonstrate the effectiveness of SANet. in fingers,mouth,etc.The main reason is that it is difficult to accu- rately extract fine-grained keypoints from the limited-resolution 7 ACKNOWLEDGMENTS frames in SLT dataset(e.g.,the frame in PHOENIX14T dataset is This work is supported by National Key R&D Program of China only 210x260 pixels).However,we consider that accurately extract- under Grant No.2018AAA0102302;National Natural Science Foun- ing more fine-grained keypoints can advance SLT performance.In dation of China under Grant Nos.61802169,61872174,61832008, the future,we will make further research on skeleton extraction. 61906085,61902175;JiangSu Natural Science Foundation under More comparisons from different perspectives:In the ex- Grant Nos.BK20180325,BK20190293.This work is partially sup- periment,we compare the proposed SANet with the existing ap- ported by Collaborative Innovation Center of Novel Software Tech- proaches in terms of ROUGE and BLEU-1,2,3,4,which are used to nology and Industrialization. 4360Table 3: Comparisons with other approaches on CSL dataset under signer-independent test and unseen-sentences test. Model Split I Split II ROUGE BLEU-1 BLEU-2 BLEU-3 BLEU-4 ROUGE BLEU-1 BLEU-2 BLEU-3 BLEU-4 S2VT [39] 0.904 0.902 0.886 0.879 0.874 0.461 0.466 0.258 0.135 - S2VT(3-layer) [39] 0.911 0.911 0.896 0.889 0.884 0.465 0.475 0.265 0.145 - HLSTM [15] 0.944 0.942 0.932 0.927 0.922 0.481 0.487 0.315 0.195 - HLSTM-attn [15] 0.951 0.948 0.938 0.933 0.928 0.503 0.508 0.330 0.207 - HRF-Fusion [14] 0.994 0.993 0.992 0.991 0.990 0.449 0.450 0.238 0.127 - SANet 0.996 0.994 0.993 0.992 0.990 0.681 0.697 0.411 0.268 0.181 Table 4: Comparison with other approaches on RWTH-PHOENIX-Weather 2014T dataset Model PHOENIX14T DEV PHOENIX14T TEST ROUGE BLEU-1 BLEU-2 BLEU-3 BLEU-4 ROUGE BLEU-1 BLEU-2 BLEU-3 BLEU-4 TSPNet [22] - - - - - 0.349 0.361 0.231 0.169 0.134 H+M+P [6] 0.459 - - - 0.195 0.436 - - - 0.183 Sign2Gloss→Goss2Text [5] 0.438 0.411 0.291 0.221 0.179 0.435 0.415 0.295 0.222 0.178 Sign2Gloss2Text [5] 0.441 0.429 0.303 0.230 0.184 0.438 0.433 0.304 0.228 0.181 (Gloss)Sign2Text [7] - 0.455 0.326 0.253 0.207 - 0.453 0.323 0.248 0.202 (Gloss)Sign2(Gloss+Text) [7] - 0.473 0.344 0.271 0.224 - 0.466 0.337 0.262 0.213 DeepHand [25] - - - - - 0.381 0.385 0.256 0.186 0.146 SANet 0.542 0.566 0.415 0.312 0.235 0.548 0.573 0.424 0.322 0.248 previous encoder-decoder architecture, but it is trained jointly. (5) (Gloss)Sign2Text [7] utilizes a transformer based architecture for SLT. (6) (Gloss)Sign2(Gloss+Text ) [7] also utilizes a transformer based architecture, but jointly learns CLSR and SLT at the same time. (7) DeepHand [25] adopts the encoder-decoder architecture and introduces a pretrained hand shape recognizer for SLT. As shown in Table 4, we provide the performance of each ap￾proach on both the validation set (i.e., ‘DEV’) and test set (i.e., ‘TEST’). Whatever on ‘DEV’ set or ‘TEST’ set, the existing ap￾proaches often have the lower performance. For example, the best ROUGE and BLEU-4 score achieved by the existing approach is less than 46% and 23%, respectively. This may be because of the high diversity, large size of vocabulary and limited number of train￾ing samples in the PHOENIX14T dataset. However, our proposed SANet further improves the SLT performance and achieves the best performance, e.g., 54.2% of ROUGE and 23.5% of BLEU-4 score on ‘DEV’ set while 54.8% of ROUGE and 24.8% of BLEU-4 score on ‘TEST’ set, which outperform all the existing approaches. 5 DISCUSSION AND FUTURE WORK Keypoints of skeleton: In this paper, we extract 14 keypoints to represent the skeleton, while ignoring the fine-grained keypoints in fingers, mouth, etc. The main reason is that it is difficult to accu￾rately extract fine-grained keypoints from the limited-resolution frames in SLT dataset (e.g., the frame in PHOENIX14T dataset is only 210×260 pixels). However, we consider that accurately extract￾ing more fine-grained keypoints can advance SLT performance. In the future, we will make further research on skeleton extraction. More comparisons from different perspectives: In the ex￾periment, we compare the proposed SANet with the existing ap￾proaches in terms of ROUGE and BLEU-1,2,3,4, which are used to evaluate the translation quality. For a fair comparison, the reported results on these metrics are provided by the original paper. Without complete original/released codes, we do not make further compar￾isons (e.g., inference time) with the approaches. However, in future work, we will try to explore more comprehensive comparisons from different perspectives with the existing approaches. Particularity of CSL dataset: The CSL dataset is a special dataset, where the word order of sign language is temporally con￾sistent with that of spoken language, thus it can be used for SLR task [18, 45] as well as SLT task [14, 15]. In this paper, CSL dataset is solved from the aspect of SLT and only SLT methods are considered for comparisons. 6 CONCLUSION In this paper, we propose a Skeleton-Aware neural Network (SANet) for end-to-end SLT. Specifically, we first use a self-contained branch to extract the skeleton from each frame, and then enhance the fea￾ture representation of a clip by adding the skeleton channel and scaling (i.e., weighting the importance) the feature vector with a designed skeleton-based GCN. Besides, we design a joint optimiza￾tion strategy for training. The experimental results on two large scale datasets demonstrate the effectiveness of SANet. 7 ACKNOWLEDGMENTS This work is supported by National Key R&D Program of China under Grant No. 2018AAA0102302; National Natural Science Foun￾dation of China under Grant Nos. 61802169, 61872174, 61832008, 61906085, 61902175; JiangSu Natural Science Foundation under Grant Nos. BK20180325, BK20190293. This work is partially sup￾ported by Collaborative Innovation Center of Novel Software Tech￾nology and Industrialization. Poster Session 5 MM ’21, October 20–24, 2021, Virtual Event, China 4360
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有