Poster Session 5 MM'21.October 20-24,2021,Virtual Event,China ROUGE w/o all-ske am atlantik zicht ein neucs hoch uber und bringt uns zum und morgenstunden heftige schneefalle auch teil auch mit regen 55.0% w/o ske-chl a山m atlantik zicht ein kraftiges hoch zur und lenkt uns zum morgen morgenstunden heftige im teil auch regen 50.0% wo ske-eph westen zieht ein kraftiges tief das und lenkt uns zum den morgenstunden heftige schneefalle heute teil auch 市t regen 65.0% w/o img-fea am westen zicht ein neues hoch das und lenkt uns zum den morgenstunden sonntag schneefalle zum teil auch gefrierenden regen 60.0% wlo clip-fea am westen zieht ein neues hoch das und lenkt uns zum den morgenstunden sonntag schneefalle zum teil auch gefrierenden regen 60.0% wlo MC am utlantik zieht ein tieftief heran und bringt uns zum und morgenstunden heftige schauer im teil auch gefrierenden regen 65.0% SANet am atlantik zicht ein kraftiges tief heran und bringt uns zum den morgenstunden heftige schneefalle bringt teil auch gefrierenden regen 80.0% Label vom nordmeer zicht ein kraftiges tief heran und bringt uns ab den morgenstunden heftige schneefalle zum teil auch gefrierenden regen Figure 8:A qualitative analysis of different components used for SLT on the example from PHOENIX14T test set.(Note:the word order in sign language and that in spoken language may be not temporally consistent.) dataset respectively.It indicates that the designed MemoryCell can the signers have no overlaps.(b)Split II-unseen sentences test: efficiently change the dimensions of hidden states between BiL- we select the sign language videos corresponding to 94 sentences STM layers and provide the appropriate input for LSTM Decoder, as the training set and the videos corresponding to the remaining 6 thus contributes to a certain performance improvement for SLT. sentences as the test set.The signers and vocabulary of the training Therefore,our proposed skeleton-related designs bring a good con- set and test set are the same,while the sentences have no overlaps tribution for SLT,while the overall framework of SANet and other In Table 3,we show the SLT performance on Split I and Split II,and designed components also benefit SLT. also compare our SANet with the following existing approaches:(1) Inference Time Analysis:To evaluate the time efficiency of S2VT [39]belongs to a standard two-layers stacked LSTM archi- designed components,in Table 1 and Table 2,we show the inference tecture which is used to translate video to text.(2)S2VT(3-layer) time of SLT by removing one type of designed component.Here, extends the S2VT from two-layers LSTM to three-layers LSTM.(3) inference time means the duration of translating the sign language HLSTM [15]is a hierarchical LSTM based encoder-decoder model in a video to the spoken language.In the experiment,we evaluate for SLT.(4)HLSTM-attn [15]adds the attention mechanisms over the inference time of a sign language video with 300 frames on CSL the previous HLSTM.(5)HRF-Fusion [14]is a hierarchical adap- dataset and a sign language video with 200 frames on PHOENIX14T tive recurrent network which mines variable-length key clips and dataset on a single GPU,and then average the time of 100 runs as the applies attention mechanisms,by using both RGB videos and skele- reported inference time.When keeping all components of SANet, ton data from Kinect. the average inference time is 0.499 seconds for CSL dataset and 0.383 As shown in Table 3,on Split I,whatever the performance metric seconds for PHOENIX14T dataset.When removing any designed is,the proposed SANet achieves the best performance.Specifically component,the interference time decreases.According to Table 1 the proposed SANet achieves 99.6%.99.4%,99.3%,99.2%and 99.0% and Table 2,the clip-level feature extraction introduces much time on ROUGE,BLEU-1,BLEU-2,BLEU-3 and BLEU-4 respectively. cost,since 3D convolutions have expensive computational cost which outperform the existing approaches.When compared with While for the skeleton-related components and MemoryCell,they S2VT,the SANet can increase the ROUGE,BLEU-1,BLEU-2,BLEU- only have a little effect on the overall inference time. 3 and BLEU-4 by 9.2%,9.2%,10.7%,11.3%and 11.6%,respectively. Qualitative Analysis:In this experiment,we show an example When moving to Split II,the performance drops a lot.Take our of SLT on a sign language video sample from PHOENIX14T test set. SANet as an example,the ROUGE score on Split II drops by 31.5%. to provide a qualitative analysis.As shown in Figure 8,the last row This is because translating the unseen sentences,i.e.,the words in means the ground truth of SLT,the second row from the bottom the sentence exist in the training set while the sentence does not means the SLT result of the proposed SANet,while other rows mean occur in the training set,can be more challenging and it is difficult the SLT results by removing the designed components of SANet. for SLT.However,our SANet still outperforms the state-of-the-art If the ith word in the SLT result is different from the ith ground- approaches and achieves 68.1%,69.7%,41.1%,26.8%and 18.1%on truth word,it is an error and marked with orange color.According ROUGE,BLEU-1,BLEU-2,BLEU-3 and BLEU-4,respectively.When to Figure 8,the proposed SANet achieves the best performance, compared with the existing approach HRF-Fusion [14],our SANet while removing any designed component will lead to more errors. can increase ROUGE by 23.2%. It demonstrates that the proposed framework and all the designed Evaluation on PHOENIX14T dataset:We compare SANet components contribute to a higher performance for SLT. with the following existing approaches:(1)TSPNet [22]intro- duces multiple segments of sign language video in different gran- 4.4 Comparisons ularities and uses Transformer decoder for SLT.(2)H+M+P [6] Evalutation on CSL dataset:We compare SANet with existing ap- uses a multi-channel transformer architecture,by fusing hand- proaches on two settings.(a)Split I-signer independent test:we area videos,mouth-area videos and extracted poses from videos. (3)Sign2Gloss-Goss2Text [5]adopts a classic encoder-decoder select the sign language videos generated by 40 signers and that gen- erated by the other 10 signers as the training set and test set,respec- architecture,and it is trained in two stages(ie.,gloss stage and sentence stage)independently.(4)Sign2Gloss2Text [5]adopts the tively.The sentences of training set and test set are the same,but 4359vom nordmeer zieht ein kräftiges tief heran und bringt uns ab den morgenstunden heftige schneefälle zum teil auch gefrierenden regen am am am am am am am atlantik atlantik westen westen westen atlantik atlantik zieht zieht zieht zieht zieht zieht zieht ein ein ein ein ein ein ein kräftiges kräftiges neues neues neues tief kräftiges tief tief tief hoch hoch hoch hoch über heran zur das das das heran und und und und und und und bringt bringt bringt lenkt lenkt lenkt lenkt uns uns uns uns uns uns uns zum zum zum zum zum zum zum und morgen den den den und den morgenstunden morgenstunden morgenstunden morgenstunden morgenstunden morgenstunden morgenstunden heftige heftige heftige sonntag sonntag heftige heftige schneefälle schauer schneefälle schneefälle schneefälle schauer schneefälle auch im heute zum zum im bringt teil teil teil teil teil teil teil auch auch auch auch auch auch auch gefrierenden gefrierenden gefrierenden gefrierenden mit kräftigen mit regen regen regen regen regen regen regen Label SANet w/o all-ske w/o ske-gph w/o ske-chl w/o MC w/o img-fea w/o clip-fea ROUGE 55.0% 50.0% 65.0% 60.0% 60.0% 65.0% 80.0% Figure 8: A qualitative analysis of different components used for SLT on the example from PHOENIX14T test set. (Note: the word order in sign language and that in spoken language may be not temporally consistent.) dataset respectively. It indicates that the designed MemoryCell can efficiently change the dimensions of hidden states between BiLSTM layers and provide the appropriate input for LSTM Decoder, thus contributes to a certain performance improvement for SLT. Therefore, our proposed skeleton-related designs bring a good contribution for SLT, while the overall framework of SANet and other designed components also benefit SLT. Inference Time Analysis: To evaluate the time efficiency of designed components, in Table 1 and Table 2, we show the inference time of SLT by removing one type of designed component. Here, inference time means the duration of translating the sign language in a video to the spoken language. In the experiment, we evaluate the inference time of a sign language video with 300 frames on CSL dataset and a sign language video with 200 frames on PHOENIX14T dataset on a single GPU, and then average the time of 100 runs as the reported inference time. When keeping all components of SANet, the average inference time is 0.499 seconds for CSL dataset and 0.383 seconds for PHOENIX14T dataset. When removing any designed component, the interference time decreases. According to Table 1 and Table 2, the clip-level feature extraction introduces much time cost, since 3D convolutions have expensive computational cost. While for the skeleton-related components and MemoryCell, they only have a little effect on the overall inference time. Qualitative Analysis: In this experiment, we show an example of SLT on a sign language video sample from PHOENIX14T test set, to provide a qualitative analysis. As shown in Figure 8, the last row means the ground truth of SLT, the second row from the bottom means the SLT result of the proposed SANet, while other rows mean the SLT results by removing the designed components of SANet. If the ith word in the SLT result is different from the ith groundtruth word, it is an error and marked with orange color. According to Figure 8, the proposed SANet achieves the best performance, while removing any designed component will lead to more errors. It demonstrates that the proposed framework and all the designed components contribute to a higher performance for SLT. 4.4 Comparisons Evalutation on CSL dataset: We compare SANet with existing approaches on two settings. (a) Split I - signer independent test: we select the sign language videos generated by 40 signers and that generated by the other 10 signers as the training set and test set, respectively. The sentences of training set and test set are the same, but the signers have no overlaps. (b) Split II - unseen sentences test: we select the sign language videos corresponding to 94 sentences as the training set and the videos corresponding to the remaining 6 sentences as the test set. The signers and vocabulary of the training set and test set are the same, while the sentences have no overlaps. In Table 3, we show the SLT performance on Split I and Split II, and also compare our SANet with the following existing approaches: (1) S2VT [39] belongs to a standard two-layers stacked LSTM architecture which is used to translate video to text. (2) S2VT(3-layer) extends the S2VT from two-layers LSTM to three-layers LSTM. (3) HLSTM [15] is a hierarchical LSTM based encoder-decoder model for SLT. (4) HLSTM-attn [15] adds the attention mechanisms over the previous HLSTM. (5) HRF-Fusion [14] is a hierarchical adaptive recurrent network which mines variable-length key clips and applies attention mechanisms, by using both RGB videos and skeleton data from Kinect. As shown in Table 3, on Split I, whatever the performance metric is, the proposed SANet achieves the best performance. Specifically, the proposed SANet achieves 99.6%, 99.4%, 99.3%, 99.2% and 99.0% on ROUGE, BLEU-1, BLEU-2, BLEU-3 and BLEU-4 respectively, which outperform the existing approaches. When compared with S2VT, the SANet can increase the ROUGE, BLEU-1, BLEU-2, BLEU- 3 and BLEU-4 by 9.2%, 9.2%, 10.7%, 11.3% and 11.6%, respectively. When moving to Split II, the performance drops a lot. Take our SANet as an example, the ROUGE score on Split II drops by 31.5%. This is because translating the unseen sentences, i.e., the words in the sentence exist in the training set while the sentence does not occur in the training set, can be more challenging and it is difficult for SLT. However, our SANet still outperforms the state-of-the-art approaches and achieves 68.1%, 69.7%, 41.1%, 26.8% and 18.1% on ROUGE, BLEU-1, BLEU-2, BLEU-3 and BLEU-4, respectively. When compared with the existing approach HRF-Fusion [14], our SANet can increase ROUGE by 23.2%. Evaluation on PHOENIX14T dataset: We compare SANet with the following existing approaches: (1) TSPNet [22] introduces multiple segments of sign language video in different granularities and uses Transformer decoder for SLT. (2) H+M+P [6] uses a multi-channel transformer architecture, by fusing handarea videos, mouth-area videos and extracted poses from videos. (3) Sign2Gloss→Goss2Text [5] adopts a classic encoder-decoder architecture, and it is trained in two stages (i.e., gloss stage and sentence stage) independently. (4) Sign2Gloss2Text [5] adopts the Poster Session 5 MM ’21, October 20–24, 2021, Virtual Event, China 4359