Poster Session 5 MM'21,October 20-24,2021,Virtual Event,China Skeleton-Aware Neural Sign Language Translation Shiwei Gan,Yafeng Yin*,Zhiwei Jiang,Lei Xie,Sanglu Lu State Key Laboratory for Novel Software Technology,Nanjing University,China gsw@smail.nju.edu.cn.yafeng@nju.edu.cn.jzw@nju.edu.cn,lxie@nju.edu.cn,sanglu@nju.edu.cn ABSTRACT As an essential communication way for deaf-mutes,sign languages are expressed by human actions.To distinguish human actions for sign language understanding.the skeleton which contains position information of human pose can provide an important cue,since different actions usually correspond to different poses/skeletons. However,skeleton has not been fully studied for Sign Language (a)Sign:doing (b)Sign:liver Translation (SLT),especially for end-to-end SLT.Therefore,in this paper,we propose a novel end-to-end Skeleton-Aware neural Net- Figure 1:In sign languages,the same hand gesture in differ- work(SANet)for video-based SLT.Specifically,to achieve end-to- ent positions can have different meanings.The blue points end SLT,we design a self-contained branch for skeleton extraction. and lines represent the skeleton. To efficiently guide the feature extraction from video with skele- tons,we concatenate the skeleton channel and RGB channels of 1 INTRODUCTION each frame for feature extraction.To distinguish the importance of clips,we construct a skeleton-based Graph Convolutional Network Sign language has been widely adopted as a communication way (GCN)for feature scaling.ie..giving importance weight for each for deaf-mutes.To build the bridge between deaf-mutes and hear- clip.The scaled features of each clip are then sent to a decoder ing people,the research work on sign language understanding module to generate spoken language.In our SANet,a joint training emerged and the existing work can mainly be categorized as Sign strategy is designed to optimize skeleton extraction and sign lan- Language Recognition(SLR)[21,29,45]and Sign Language Transla- guage translation jointly.Experimental results on two large scale tion(SLT)[5,15,25].Earlier,the sign language-related work usually SLT datasets demonstrate the effectiveness of our approach,which focused on SLR,which aims at recognizing isolated sign as word outperforms the state-of-the-art methods.Our code is available at or expression [12,24,37],or recognizing continuous signs as cor- https://github.com/SignLanguageCode/SANet. responding word sequence [4,10,13,21].However,the SLR work neglected the difference between sign language and spoken lan- CCS CONCEPTS guage on grammatical rules,i.e.,the recognized word sequence may be not grammatically correct,thus hindering the understanding Computing methodologies-Activity recognition and un- of sign language.Recently,due to the advancement of annotated derstanding. dataset and deep learning technology.SLT has attracted people's attention.SLT is a more challenging task and its objective is to trans- KEYWORDS late sign language to spoken language,while requiring that the Sign Language Translation;Skeleton;Neural Network translation results conform to the grammatical rules and linguistic characteristics of the target spoken language. ACM Reference Format: In regard to SLT,the prior work tended to decompose SLT into Shiwei Gan,Yafeng Yin,Zhiwei Jiang,Lei Xie,Sanglu Lu.2021.Skeleton- two stages,i.e.,recognizing continuous signs as word sequence Aware Neural Sign Language Translation.In Proceedings of the 29th ACM and then utilizing language models to construct sentences with International Conference on Multimedia(MM'21).October 20-24,2021.Virtual the words [3,9].However,the two-stage methods usually required Event,China.ACM,New York,NY,USA,9 pages.https://doi.org/10.1145/ gloss!annotation,which was a labor-intensive task and needed 3474085.3475577 specialists.Recently,due to the development of deep learning tech- nology,Camgoz et al.approached SLT as a neural machine trans- Yafeng Yin is the corresponding author. lation task [5],and introduced the encoder-decoder network and attention mechanism for end-to-end SLT for the first time.After that,Camgoz et al.introduced transformer networks for end-to- Permission to make digital or hard copies of all or part of this work for personal or end SLT from videos [7].When considering the modality difference classroom use is granted without fee p ovided that copies are not made or distributed between video and language in SLT,feature representation was for profit or commercial advantage and that copies bear this notice and the full citation on the first page.Copyrights for components of this work owned by others than ACM adopted,i.e.,the video is represented as features which are later must be honored.Abstracting with credit is permitted.To copy otherwise,or republish translated to language.However,in the existing neural-based meth- to post on servers or to redistribute to lists,requires prior specific permission and/or a ods,the feature representation of video was mainly consisted of fee.Request permissions from permissions@acm.org. MM'21,October 20-24,2021,Virtual Event,China full-frame [5,15,22]or local-area [6]features,while the skeleton 2021 Association for Computing Machinery. information which reflects the important spatial structure of human ACM1SBN978-1-4503-8651-7/21/10..$15.00 https:/doi.org/10.1145/3474085.3475577 Here,'gloss'means a gesture with its closest meaning in natural languages [101. 4353
Skeleton-Aware Neural Sign Language Translation Shiwei Gan, Yafeng Yin∗ , Zhiwei Jiang, Lei Xie, Sanglu Lu State Key Laboratory for Novel Software Technology, Nanjing University, China gsw@smail.nju.edu.cn,yafeng@nju.edu.cn,jzw@nju.edu.cn,lxie@nju.edu.cn,sanglu@nju.edu.cn ABSTRACT As an essential communication way for deaf-mutes, sign languages are expressed by human actions. To distinguish human actions for sign language understanding, the skeleton which contains position information of human pose can provide an important cue, since different actions usually correspond to different poses/skeletons. However, skeleton has not been fully studied for Sign Language Translation (SLT), especially for end-to-end SLT. Therefore, in this paper, we propose a novel end-to-end Skeleton-Aware neural Network (SANet) for video-based SLT. Specifically, to achieve end-toend SLT, we design a self-contained branch for skeleton extraction. To efficiently guide the feature extraction from video with skeletons, we concatenate the skeleton channel and RGB channels of each frame for feature extraction. To distinguish the importance of clips, we construct a skeleton-based Graph Convolutional Network (GCN) for feature scaling, i.e., giving importance weight for each clip. The scaled features of each clip are then sent to a decoder module to generate spoken language. In our SANet, a joint training strategy is designed to optimize skeleton extraction and sign language translation jointly. Experimental results on two large scale SLT datasets demonstrate the effectiveness of our approach, which outperforms the state-of-the-art methods. Our code is available at https://github.com/SignLanguageCode/SANet. CCS CONCEPTS • Computing methodologies → Activity recognition and understanding. KEYWORDS Sign Language Translation; Skeleton; Neural Network ACM Reference Format: Shiwei Gan, Yafeng Yin, Zhiwei Jiang, Lei Xie, Sanglu Lu. 2021. SkeletonAware Neural Sign Language Translation. In Proceedings of the 29th ACM International Conference on Multimedia (MM ’21), October 20–24, 2021, Virtual Event, China. ACM, New York, NY, USA, 9 pages. https://doi.org/10.1145/ 3474085.3475577 ∗Yafeng Yin is the corresponding author. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. MM ’21, October 20–24, 2021, Virtual Event, China © 2021 Association for Computing Machinery. ACM ISBN 978-1-4503-8651-7/21/10. . . $15.00 https://doi.org/10.1145/3474085.3475577 (a) Sign: doing (b) Sign: liver Figure 1: In sign languages, the same hand gesture in different positions can have different meanings. The blue points and lines represent the skeleton. 1 INTRODUCTION Sign language has been widely adopted as a communication way for deaf-mutes. To build the bridge between deaf-mutes and hearing people, the research work on sign language understanding emerged and the existing work can mainly be categorized as Sign Language Recognition (SLR) [21, 29, 45] and Sign Language Translation (SLT) [5, 15, 25]. Earlier, the sign language-related work usually focused on SLR, which aims at recognizing isolated sign as word or expression [12, 24, 37], or recognizing continuous signs as corresponding word sequence [4, 10, 13, 21]. However, the SLR work neglected the difference between sign language and spoken language on grammatical rules, i.e., the recognized word sequence may be not grammatically correct, thus hindering the understanding of sign language. Recently, due to the advancement of annotated dataset and deep learning technology, SLT has attracted people’s attention. SLT is a more challenging task and its objective is to translate sign language to spoken language, while requiring that the translation results conform to the grammatical rules and linguistic characteristics of the target spoken language. In regard to SLT, the prior work tended to decompose SLT into two stages, i.e., recognizing continuous signs as word sequence, and then utilizing language models to construct sentences with the words [3, 9]. However, the two-stage methods usually required gloss1 annotation, which was a labor-intensive task and needed specialists. Recently, due to the development of deep learning technology, Camgoz et al. approached SLT as a neural machine translation task [5], and introduced the encoder-decoder network and attention mechanism for end-to-end SLT for the first time. After that, Camgoz et al. introduced transformer networks for end-toend SLT from videos [7]. When considering the modality difference between video and language in SLT, feature representation was adopted, i.e., the video is represented as features which are later translated to language. However, in the existing neural-based methods, the feature representation of video was mainly consisted of full-frame [5, 15, 22] or local-area [6] features, while the skeleton information which reflects the important spatial structure of human 1Here, ‘gloss’ means a gesture with its closest meaning in natural languages [10]. Poster Session 5 MM ’21, October 20–24, 2021, Virtual Event, China 4353
Poster Session 5 MM'21.October 20-24,2021,Virtual Event,China pose in sign languages has not been fully studied.In fact,the skele- 2 RELATED WORK ton can be used to distinguish signs with different human poses The existing research work on sign languages can be mainly cate- (i.e.,different relative positions of hands,arms,etc),especially for gorized into SLR and SLT,where SLR can be further classified into the signs which use the same hand gesture in different positions to isolated SLR and continuous SLR.In this section,we review the represent different meanings,as shown in Figure 1.Therefore,it is related work on isolated SLR,continuous SLR,and SLT. meaningful to introduce skeleton information into SLT. Isolated Sign Language Recognition (ISLR):The isolated To utilize skeletons for SLT,there emerged some work recently SLR aims at recognizing one sign as word or expression [2,12,24] [6,14],which advanced the research of skeleton assisted SLT.How- which is similar to gesture recognition [27,42]and action recog- ever,the existing research often had the following problems.First, nition [32,40].The early methods tended to select features from obtaining the skeletons often required an external device [14]or videos manually,and introduced Hidden Markov Model(HMM) extra offline preprocessing [6],which hindered the end-to-end SLT [12,16]to analyze the gesture sequence of a sign(i.e.,human ac- from videos.Second,the videos and skeletons were used as two tion)for recognition.However,the manually-selected features may independent data sources for feature extraction,i.e.,not fused at limit the recognition performance.Therefore,in recent years,the the initial stage of feature extraction,thus the video-based feature deep learning-based approaches were introduced for isolated SLR. extraction may be not efficiently guided/enhanced with the skele- The approaches utilized neural networks to automatically extract ton information.Third,each clip(i.e.a short segmented video)was features from videos [17]),Kinect's sensor data [36],or moving tra- usually treated equally,while neglecting the different importance jectories of skeleton joints [24]for isolated SLR,and often achieved of meaningful (e.g.sign-related)clips and unmeaninngful (e.g.end a better performance. state)clips.Among the problems,the third one exists not only in Continuous Sign Language Recognition(CSLR):Continu- skeleton-assisted SLT,but also in much SLT work ous SLR aims at recognizing a sequence of signs to the correspond- To address the above three problems,we propose a Skeleton- ing word sequence [21,28,45],thus continuous SLR is more chal- Aware neural Network(SANet).Firstly,to achieve end-to-end SLT, lenging than isolated SLR.To realize CSLR,the traditional methods SANet designs a self-contained branch for skeleton extraction.Sec- like DTW-HMM [44]and CNN-HMM [21]introduced temporal ondly,to guide the video-based feature extraction with skeletons. segmentation and Hidden Markov Model(HMM)to transform con- SANet concatenates the skeleton channel and RGB channels for tinuous SLR to isolated SLR.Considering the possible errors and each frame,thus the features extracted from images/videos will be annotation burden in temporal segmentation,the recent deep learn- affected by skeleton.Thirdly,to distinguish the importance of clips. ing based methods [18]applied sequence to sequence learning for SANet constructs a skeleton-based Graph Convolutional Network continuous SLR.They learned the correspondence between two (GCN)for feature scaling,ie giving importance weight for each sequences from weakly annotated data in an end-to-end manner clip.Specifically,SANet consists of four components,ie.,FrmSke, However,many approaches tended to adopt Connectionist Tem- ClipRep,ClipScl,LangGen.At first,FrmSke is used to extract skele- poral Classification(CTC)loss [13,20,43,45]which requires that ton from each frame and frame-level features for a clip by convo- source and target sequences have the same order.In fact,the sign lutions and deconvolutions.Then,ClipRep is used to enhance clip sequence in sign language and the word sequence in spoken lan- representation by adding skeleton channel.After that,ClipScl is guage can be different [5].thus the approaches for continuous SLR used to scale the clip representation by a skeleton-based Graph are not suitable for SLT. Convolutional Network(GCN).Finally,with the scaled features of Sign Language Translation(SLT):SLT aims to translate sign clips,LangGen is used to generate spoken language with sequence languages into spoken languages.Traditional methods [3,9]usually to sequence learning.In addition,we design a joint optimization decomposed SLT into two stages,i.e.,continuous SLR and text-to- strategy for model training and achieve end-to-end SLT. text translation.The two-stage methods had both gloss annotations We make the following contributions in this paper. and sentence annotations,thus can be optimized in two stages for a better performance [5].However,annotating glosses requires spe- We propose a Skeleton-Aware neural Network(SANet)for cialists and is a labor-intensive task.Recently,due to the advance- end-to-end SLT,where a self-contained branch is designed ment of public datasets in sentence-level annotations [5,11,18 for skeleton extraction and a joint training strategy is de- and deep learning technology,there emerged a few end-to-end signed to optimize skeleton extraction and sign language SLT approaches.Camgoz et al [5]introduced the encoder-decoder translation jointly. framework to realize end-to-end SLT.Guo et al.[14,15]proposed We concatenate the extracted skeleton channel and RGB the hierarchical-LSTM model for end-to-end SLT.Camgoz et al.uti- channels in source data level,thus can highlight human lized the transformer networks [38]to jointly solve SLR and SLT[7] pose-related features and enhance the clip representation. Li et al.developed a temporal semantic pyramid encoder and a We construct skeleton-based graphs and use graph convolu- transformer decoder for SLT [22].These neural-based approaches tional network to scale the clip representation,ie,weighting often adopted encoder and decoder for SLT. the importance of each clip,thus can highlight meaningful To represent the sign languages,the existing neural-based SLT clips while weakening unmeaningful clips. methods mainly focused on extracting full-frame [5,15,22]or local- We conduct extensive experiments on two large-scale public area features [6,46]from the video.There was only a little work SLT datasets.The experimental results demonstrate that our paying attention on skeleton information(i.e.,human pose)for SANet outperforms the state-of-the-art methods. SLT.Specifically,HRF [14]collected skeletons with a depth camera, 4354
pose in sign languages has not been fully studied. In fact, the skeleton can be used to distinguish signs with different human poses (i.e., different relative positions of hands, arms, etc), especially for the signs which use the same hand gesture in different positions to represent different meanings, as shown in Figure 1. Therefore, it is meaningful to introduce skeleton information into SLT. To utilize skeletons for SLT, there emerged some work recently [6, 14], which advanced the research of skeleton assisted SLT. However, the existing research often had the following problems. First, obtaining the skeletons often required an external device [14] or extra offline preprocessing [6], which hindered the end-to-end SLT from videos. Second, the videos and skeletons were used as two independent data sources for feature extraction, i.e., not fused at the initial stage of feature extraction, thus the video-based feature extraction may be not efficiently guided/enhanced with the skeleton information. Third, each clip (i.e., a short segmented video) was usually treated equally, while neglecting the different importance of meaningful (e.g, sign-related) clips and unmeaninngful (e.g, end state) clips. Among the problems, the third one exists not only in skeleton-assisted SLT, but also in much SLT work. To address the above three problems, we propose a SkeletonAware neural Network (SANet). Firstly, to achieve end-to-end SLT, SANet designs a self-contained branch for skeleton extraction. Secondly, to guide the video-based feature extraction with skeletons, SANet concatenates the skeleton channel and RGB channels for each frame, thus the features extracted from images/videos will be affected by skeleton. Thirdly, to distinguish the importance of clips, SANet constructs a skeleton-based Graph Convolutional Network (GCN) for feature scaling, i.e., giving importance weight for each clip. Specifically, SANet consists of four components, i.e., FrmSke, ClipRep, ClipScl, LangGen. At first, FrmSke is used to extract skeleton from each frame and frame-level features for a clip by convolutions and deconvolutions. Then, ClipRep is used to enhance clip representation by adding skeleton channel. After that, ClipScl is used to scale the clip representation by a skeleton-based Graph Convolutional Network (GCN). Finally, with the scaled features of clips, LangGen is used to generate spoken language with sequence to sequence learning. In addition, we design a joint optimization strategy for model training and achieve end-to-end SLT. We make the following contributions in this paper. • We propose a Skeleton-Aware neural Network (SANet) for end-to-end SLT, where a self-contained branch is designed for skeleton extraction and a joint training strategy is designed to optimize skeleton extraction and sign language translation jointly. • We concatenate the extracted skeleton channel and RGB channels in source data level, thus can highlight human pose-related features and enhance the clip representation. • We construct skeleton-based graphs and use graph convolutional network to scale the clip representation, i.e., weighting the importance of each clip, thus can highlight meaningful clips while weakening unmeaningful clips. • We conduct extensive experiments on two large-scale public SLT datasets. The experimental results demonstrate that our SANet outperforms the state-of-the-art methods. 2 RELATED WORK The existing research work on sign languages can be mainly categorized into SLR and SLT, where SLR can be further classified into isolated SLR and continuous SLR. In this section, we review the related work on isolated SLR, continuous SLR, and SLT. Isolated Sign Language Recognition (ISLR): The isolated SLR aims at recognizing one sign as word or expression [2, 12, 24] which is similar to gesture recognition [27, 42] and action recognition [32, 40]. The early methods tended to select features from videos manually, and introduced Hidden Markov Model (HMM) [12, 16] to analyze the gesture sequence of a sign (i.e., human action) for recognition. However, the manually-selected features may limit the recognition performance. Therefore, in recent years, the deep learning-based approaches were introduced for isolated SLR. The approaches utilized neural networks to automatically extract features from videos [17]), Kinect’s sensor data [36], or moving trajectories of skeleton joints [24] for isolated SLR, and often achieved a better performance. Continuous Sign Language Recognition (CSLR): Continuous SLR aims at recognizing a sequence of signs to the corresponding word sequence [21, 28, 45], thus continuous SLR is more challenging than isolated SLR. To realize CSLR, the traditional methods like DTW-HMM [44] and CNN-HMM [21] introduced temporal segmentation and Hidden Markov Model (HMM) to transform continuous SLR to isolated SLR. Considering the possible errors and annotation burden in temporal segmentation, the recent deep learning based methods [18] applied sequence to sequence learning for continuous SLR. They learned the correspondence between two sequences from weakly annotated data in an end-to-end manner. However, many approaches tended to adopt Connectionist Temporal Classification (CTC) loss [13, 20, 43, 45] which requires that source and target sequences have the same order. In fact, the sign sequence in sign language and the word sequence in spoken language can be different [5], thus the approaches for continuous SLR are not suitable for SLT. Sign Language Translation (SLT): SLT aims to translate sign languages into spoken languages. Traditional methods [3, 9] usually decomposed SLT into two stages, i.e., continuous SLR and text-totext translation. The two-stage methods had both gloss annotations and sentence annotations, thus can be optimized in two stages for a better performance [5]. However, annotating glosses requires specialists and is a labor-intensive task. Recently, due to the advancement of public datasets in sentence-level annotations [5, 11, 18] and deep learning technology, there emerged a few end-to-end SLT approaches. Camgoz et al. [5] introduced the encoder-decoder framework to realize end-to-end SLT. Guo et al. [14, 15] proposed the hierarchical-LSTM model for end-to-end SLT. Camgoz et al. utilized the transformer networks [38] to jointly solve SLR and SLT[7]. Li et al. developed a temporal semantic pyramid encoder and a transformer decoder for SLT [22]. These neural-based approaches often adopted encoder and decoder for SLT. To represent the sign languages, the existing neural-based SLT methods mainly focused on extracting full-frame [5, 15, 22] or localarea features [6, 46] from the video. There was only a little work paying attention on skeleton information (i.e., human pose) for SLT. Specifically, HRF [14] collected skeletons with a depth camera, Poster Session 5 MM ’21, October 20–24, 2021, Virtual Event, China 4354
Poster Session 5 MM'21,October 20-24,2021,Virtual Event,China LangGen Inference FrmSke BiLSTM wife tentio cher ide LSTM ⊙Concatenation ClipRep MemoryCel Frame-level features 0 Clip-level features ClipSel ○Scale factor on graph 目Scaled eatre Figure 2:The SANet consists of FrmSke,ClipRep,ClipScl and LangGen,which are used for extracting skeletons and frame-level features,enhancing clip representation,scaling features and generating sentences. while Camgoz et al.[6]extracted skeleton from video with exiting tool in an offline stage.Then,they parallelly input RGB videos with skeletons to neural network for feature extraction.The external device or offline preprocessing of skeleton extraction hindered the end-to-end SLT from videos.Besides,the existing approaches tended to fuse videos and skeletons in extracted features and give frame-loeri features the same importance of each clip,which may limit the performance of feature representation.Differently,we extract skeleton with a self-contained branch to achieve end-to-end SLT.Besides,we ① fuse videos and skeletons in source data by concatenating skeleton channel and RGB channels,and construct a skeleton-based GCN to Figure 3:FrmSke extracts skeleton from each image and weight the importance of each clip. frame-level features of each clip. 3 PROPOSED APPROACH of VGG model [33]as the backbone network for FrmSke.In the In SLT,when given a sign language video X=(fi.f2....,fu)with compressed VGG model,the number of channels in convolutional uframes,our objective is to learn the conditional probability p(YX) layers is reduced to one fourth of the original one,to reduce the of generating a spoken language sentence Y =(wi,w2,...,wo) memory requirement and make the model work on our platform. with v words.The sentence with highest probability p(YX)is Skeleton extraction:As shown in Figrue 3,to extract the skele- chosen as the translated spoken language. ton map from a frame,two parallel deconvolutional networks are To realize end-to-end SLT,we propose a Skeleton-Aware neural used to upsample high-to-low resolution representations [34]after Network(SANet),which consists of FrmSke,ClipRep,ClipScl and the Conv3 and Conv4 layers.Specifically,Deconv1 layer adopts LangGen.As shown in Figure 2,a sign language video is segmented one 3x3 deconvolution with the stride 2 for 2x upsampling,while into consecutive equal-length clips with 50%overlap.For a clip.we Deconv2 layer adopts two consecutive 3x3 deconvolutions with first use FrmSke to extract skeleton from each frame and frame-level the stride 2 for 4x upsampling.Then,the pointwise convolutional clip features.Then,we concatenate the skeleton channel and RGB layer and element-sum operation are added after deconvolutional channels of each frame in the clip.and adopt ClipRep to extract clip- layers to generateK heatmaps,where each heatmap Mk[1.K] level features.The frame-level features and clip-level features are contains one keypoint(with the highest heatvalue)of the skeleton added to get the fused features of a clip.Meanwhile,we utilize the After that,we generate the skeleton(ie.,a 2D matrix)MS by adding skeletons in a clip to construct a spatial-temporal graph and adopt the corresponding elements in K heatmaps.Here,K is set to 14 and ClipScl to calculate the scale factor,which will be multiplied with means the number of keypoints from nose,neck,both eyes,both the fused features to get the scaled feature vector of a clip.Finally. ears,both shoulders,both elbows,both wrists and both hips. the scaled features of all clips are sent to LangGen for generating Frame-level clip representation:As shown in Figure 3,the the spoken language. convolutions Convl to Conv5 are first used to extract feature maps from each frame of a clip.Then,the feature maps are concatenated, 3.1 Frame-Level Skeleton Extraction flatten and sent to a fully connected layer to get a feature vector A frame can capture the specific gesture in sign language,thus Fm with Nm =4096 elements of the clip. containing the spatial structure of human pose and detailed infor- mation in face,hands,fingers,etc.Therefore,we split the clip into 3.2 Channel Extended Clip Representation frames,and propose FrmSke module to extract skeleton and frame- A video clip with several consecutive frames can capture the short level features.As shown in Figure 3,we select a compressed variant action(i.e.,continuous/dynamic gestures)during sign languages. 4355
Video RGB channels 𝑾𝑾 × 𝑯𝑯 × 𝟑𝟑 … … ClipRep ClipScl Clip i … Clip 1 … Clip m 𝒄𝒄 × 𝑾𝑾 × 𝑯𝑯 × 𝟑𝟑 𝑾𝑾 × 𝑯𝑯 × 𝟏𝟏 𝒄𝒄 × 𝑾𝑾 × 𝑯𝑯 × 𝟒𝟒 i1 i2 ic Skeleton graph Scaling Clip i … My wife is a teacher Inference C C Concatenation Frame-level features Clip-level features Scale factor MC MemoryCell BiLSTM MC 3× LSTM Attention LangGen Skeleton channel Fused features FrmSke Scaled features RGBS channels 𝒊𝒊𝒄𝒄 Frames i1 i2 ic Adding Figure 2: The SANet consists of FrmSke, ClipRep, ClipScl and LangGen, which are used for extracting skeletons and frame-level features, enhancing clip representation, scaling features and generating sentences. while Camgoz et al. [6] extracted skeleton from video with exiting tool in an offline stage. Then, they parallelly input RGB videos with skeletons to neural network for feature extraction. The external device or offline preprocessing of skeleton extraction hindered the end-to-end SLT from videos. Besides, the existing approaches tended to fuse videos and skeletons in extracted features and give the same importance of each clip, which may limit the performance of feature representation. Differently, we extract skeleton with a self-contained branch to achieve end-to-end SLT. Besides, we fuse videos and skeletons in source data by concatenating skeleton channel and RGB channels, and construct a skeleton-based GCN to weight the importance of each clip. 3 PROPOSED APPROACH In SLT, when given a sign language video X = (f1, f2, . . . , fu ) with u frames, our objective is to learn the conditional probability p(Y |X) of generating a spoken language sentence Y = (w1,w2, . . . ,wv ) with v words. The sentence with highest probability p(Y |X) is chosen as the translated spoken language. To realize end-to-end SLT, we propose a Skeleton-Aware neural Network (SANet), which consists of FrmSke, ClipRep, ClipScl and LangGen. As shown in Figure 2, a sign language video is segmented into consecutive equal-length clips with 50% overlap. For a clip, we first use FrmSke to extract skeleton from each frame and frame-level clip features. Then, we concatenate the skeleton channel and RGB channels of each frame in the clip, and adopt ClipRep to extract cliplevel features. The frame-level features and clip-level features are added to get the fused features of a clip. Meanwhile, we utilize the skeletons in a clip to construct a spatial-temporal graph and adopt ClipScl to calculate the scale factor, which will be multiplied with the fused features to get the scaled feature vector of a clip. Finally, the scaled features of all clips are sent to LangGen for generating the spoken language. 3.1 Frame-Level Skeleton Extraction A frame can capture the specific gesture in sign language, thus containing the spatial structure of human pose and detailed information in face, hands, fingers, etc. Therefore, we split the clip into frames, and propose FrmSke module to extract skeleton and framelevel features. As shown in Figure 3, we select a compressed variant 𝑾𝑾 × 𝑯𝑯 × 𝟑𝟑 Frame i1 Frame i2 Frame ic … F … 𝑾𝑾 × 𝑯𝑯 × 𝟏𝟏 i1 i2 ic Convolutional layer Deconvolutional layer Fully connected layer Pointwise convolutional layer Deconv1 Deconv2 Conv1 Conv2 Conv3 Conv4 Conv5 P P F P Frame-level features Skeletons Figure 3: FrmSke extracts skeleton from each image and frame-level features of each clip. of VGG model [33] as the backbone network for FrmSke. In the compressed VGG model, the number of channels in convolutional layers is reduced to one fourth of the original one, to reduce the memory requirement and make the model work on our platform. Skeleton extraction: As shown in Figrue 3, to extract the skeleton map from a frame, two parallel deconvolutional networks are used to upsample high-to-low resolution representations [34] after the Conv3 and Conv4 layers. Specifically, Deconv1 layer adopts one 3×3 deconvolution with the stride 2 for 2× upsampling, while Deconv2 layer adopts two consecutive 3×3 deconvolutions with the stride 2 for 4× upsampling. Then, the pointwise convolutional layer and element-sum operation are added after deconvolutional layers to generate K heatmaps, where each heatmap MH k , k ∈ [1,K] contains one keypoint (with the highest heatvalue) of the skeleton. After that, we generate the skeleton (i.e., a 2D matrix) MS by adding the corresponding elements in K heatmaps. Here, K is set to 14 and means the number of keypoints from nose, neck, both eyes, both ears, both shoulders, both elbows, both wrists and both hips. Frame-level clip representation: As shown in Figure 3, the convolutions Conv1 to Conv5 are first used to extract feature maps from each frame of a clip. Then, the feature maps are concatenated, flatten and sent to a fully connected layer to get a feature vector Fm with Nm = 4096 elements of the clip. 3.2 Channel Extended Clip Representation A video clip with several consecutive frames can capture the short action (i.e., continuous/dynamic gestures) during sign languages. Poster Session 5 MM ’21, October 20–24, 2021, Virtual Event, China 4355
Poster Session 5 MM'21,October 20-24,2021,Virtual Event,China AAverage pooling (a)Intermediate feature maps 'NOT'using skeleton ⑥Fally conected layer Clip-level features Figure 4:ClipRep extracts clip-level features,where each frame of the clip is extended to four channels by concate- nating skeleton channel and RGB channels. (b)Intermediate feature maps using skeleton Figure 5:Intermediate feature maps after the first P3D block Therefore,we propose ClipRep module to track the dynamic changes without or with using skeleton channel.In each case,we of human pose and extract the clip representation.As shown in show 4 examples selected from 16 frames in a clip. Figure 4,we first extend the channels of each frame by concate- nating the extracted skeleton channel and original RGB channels, and then adopt Pseudo 3D Residual Networks(P3D)[30]to extract clip-level features. ST-GCN Unit Channel extension with skeleton:We use the skeleton map (i.e.,a 2D matrix)as the fourth channel,and concatenate it with original RGB channels of each frame,to get the RGBS frame with M)Max pooling four channels,as shown in Figure 4.After that,the clip with RGBS frames will be used for clip-level feature extraction. Enhanced clip representation:Based on the RGBS frame se- quence of a clip,we introduce the P3D block [30]to extract the features of the clip,where the P3D is first adopted for SLT.For the P3D block,it is consisted of one(2D)spatial filer(1x 3 x 3),one (1D)temporal filter(3x1×1),and two pointwise filters(1×1×1) Combining the filters in different ways can get different modules (i.e.,P3D-A,P3D-B,P3D-C)for P3D.In ClipRep(shown in Figure 4), after 3D-convolution and P3D blocks,the residual units,average Figure 6:ClipScl constructs skeleton-based graph and uses pooling and fully connected layer will be used to get the feature GCN to calculate the scale factor,which is used for weight- vector Fc with Ne 4096 elements for the clip. ing the importance of each clip. To verify whether the added skeleton channel can enhance fea- ture representation,we visualize the intermediate feature map(i.e.. after the first P3D block)in ClipRep without or with using skeleton edge set E.Suppose the keypoints of the ith skeleton in a clip are channel in Figure 5(a)and Figure 5(b),respectively.The areas with =(wi,i2,,vik,i∈[1,cl.Here,vi,j∈[l,K]means the brighter colors in Figure 5(b)indicate that the added skeleton chan- jth keypoint/node in the ith skeleton,while K =14 means the nel can highlight the features related to sign language,e.g.,gesture number of keypoints in a skeleton.Then we can get the node set changes and human pose,thus enhancing clip representation. V=(vi,ie[1,cl.je[1,K]).In regard to the edge set,it includes the intra-skeleton edge set Ea =vipvil(p.q)ES)where S means 3.3 Skeleton-Aware Clip Scaling the set of naturally connected body joints in a skeleton and the In a short-time clip,the human action can correspond to a mean- inter-skeleton edge set Ee =(vipvjp li.j [1.c].li-jl =1)(i.e.. ingful sign,a less important transition action,an unmeaningful end edge between the corresponding nodes of two adjacent skeletons), state,etc.Thus the importance of each clip for SLT can be different. as shown in Figure 6.For each node in the constructed skeleton- To track the human action in a clip and weight the importance based graph,its coordinate vector (x,y)in the frame is used as its of each clip,we propose ClipScl module,which first constructs a initial feature vector skeleton-based graph and applies a Graph Convolutional Network With the skeleton-based graph,we then adopt Graph Convolu- (GCN)[41]to generate a scale factor,and then scales the feature tional Network(GCN)to calculate the scale factor(ie.,importance vector of each clip with the scale factor. weight)of a clip.Specifically,we design ClipScl,which consists of Skeleton-based GCN:To track the dynamic changes of hu- 7 layers of spatial-temporal graph convolution(ST-GCN)units [41]. man action in a short clip,we construct the skeleton-based graph, while decreasing the channel number of ST-GCN by a factor of which can describe the moving trajectories of keypoints in the 0.25 to reduce the memory requirement of the model,as shown in skeleton [31,41].Specifically,for a clip with c frames,we first con- Figure 6.Then,we use the max pooling and a fully connected layer struct a skeleton-based graph G=(V,E)with the node set V and to get the feature vector,which will be passed to a sigmoid function 4356
Clip i 𝒄𝒄 × 𝑾𝑾 × 𝑯𝑯 × 𝟑𝟑 … 𝑾𝑾 × 𝑯𝑯 × 𝟏𝟏 i1 i2 ic C 𝒄𝒄 × 𝑾𝑾 × 𝑯𝑯 × 𝟒𝟒 Conv 3D P3D P3D-A 𝟒𝟒 × Residual Unit 𝟑𝟑 × A F Frame i Frame k ik RGB RGBS Fully connected layer Average pooling Clip-level features Skeletons P3D-A P3D-B P3D-C F A Figure 4: ClipRep extracts clip-level features, where each frame of the clip is extended to four channels by concatenating skeleton channel and RGB channels. Therefore, we propose ClipRep module to track the dynamic changes of human pose and extract the clip representation. As shown in Figure 4, we first extend the channels of each frame by concatenating the extracted skeleton channel and original RGB channels, and then adopt Pseudo 3D Residual Networks (P3D) [30] to extract clip-level features. Channel extension with skeleton: We use the skeleton map (i.e., a 2D matrix) as the fourth channel, and concatenate it with original RGB channels of each frame, to get the RGBS frame with four channels, as shown in Figure 4. After that, the clip with RGBS frames will be used for clip-level feature extraction. Enhanced clip representation: Based on the RGBS frame sequence of a clip, we introduce the P3D block [30] to extract the features of the clip, where the P3D is first adopted for SLT. For the P3D block, it is consisted of one (2D) spatial filer (1× 3 × 3), one (1D) temporal filter (3× 1× 1), and two pointwise filters (1× 1 × 1). Combining the filters in different ways can get different modules (i.e., P3D-A, P3D-B, P3D-C) for P3D. In ClipRep (shown in Figure 4), after 3D-convolution and P3D blocks, the residual units, average pooling and fully connected layer will be used to get the feature vector Fc with Nc = 4096 elements for the clip. To verify whether the added skeleton channel can enhance feature representation, we visualize the intermediate feature map (i.e., after the first P3D block) in ClipRep without or with using skeleton channel in Figure 5(a) and Figure 5(b), respectively. The areas with brighter colors in Figure 5(b) indicate that the added skeleton channel can highlight the features related to sign language, e.g., gesture changes and human pose, thus enhancing clip representation. 3.3 Skeleton-Aware Clip Scaling In a short-time clip, the human action can correspond to a meaningful sign, a less important transition action, an unmeaningful end state, etc. Thus the importance of each clip for SLT can be different. To track the human action in a clip and weight the importance of each clip, we propose ClipScl module, which first constructs a skeleton-based graph and applies a Graph Convolutional Network (GCN) [41] to generate a scale factor, and then scales the feature vector of each clip with the scale factor. Skeleton-based GCN: To track the dynamic changes of human action in a short clip, we construct the skeleton-based graph, which can describe the moving trajectories of keypoints in the skeleton [31, 41]. Specifically, for a clip with c frames, we first construct a skeleton-based graph G = (V, E) with the node set V and (a) Intermediate feature maps ‘NOT’ using skeleton (b) Intermediate feature maps using skeleton Figure 5: Intermediate feature maps after the first P3D block without or with using skeleton channel. In each case, we show 4 examples selected from 16 frames in a clip. ST-GCN Unit M F S 𝟕𝟕 × Conv2d GraphConv Conv2d Fully connected layer Max pooling F M S Sigmoid Scale factor Skeleton graph Figure 6: ClipScl constructs skeleton-based graph and uses GCN to calculate the scale factor, which is used for weighting the importance of each clip. edge set E. Suppose the keypoints of the ith skeleton in a clip are Vi = (υi1 ,υi2 , . . . ,υiK ),i ∈ [1,c]. Here, υij , j ∈ [1,K] means the jth keypoint/node in the ith skeleton, while K = 14 means the number of keypoints in a skeleton. Then we can get the node set V = {υij ,i ∈ [1,c], j ∈ [1,K]}. In regard to the edge set, it includes the intra-skeleton edge set Ea = {υip υiq |(p, q) ∈ S} where S means the set of naturally connected body joints in a skeleton and the inter-skeleton edge set Ee = {υip υjp |i, j ∈ [1,c], |i − j| = 1} (i.e., edge between the corresponding nodes of two adjacent skeletons), as shown in Figure 6. For each node in the constructed skeletonbased graph, its coordinate vector (x,y) in the frame is used as its initial feature vector υ f . With the skeleton-based graph,we then adopt Graph Convolutional Network (GCN) to calculate the scale factor (i.e., importance weight) of a clip. Specifically, we design ClipScl, which consists of 7 layers of spatial-temporal graph convolution (ST-GCN) units [41], while decreasing the channel number of ST-GCN by a factor of 0.25 to reduce the memory requirement of the model, as shown in Figure 6. Then, we use the max pooling and a fully connected layer to get the feature vector, which will be passed to a sigmoid function Poster Session 5 MM ’21, October 20–24, 2021, Virtual Event, China 4356
Poster Session 5 MM'21.October 20-24,2021,Virtual Event,China 21 Clip sequence Figure 7:Visualization of the scale factor for each clip.Meaningful clips with signs have larger scale factors,while unmean- ingful clips have lower scale factors. to get the scale factor sf,i.e.,a value belonging to [0,1]. At the tth time step,the decoder predicts the word t.The decoder stops decoding until the occurrence of the symbol "[EOS]". sf Sigmoid(ST-GCN+(Vf,E)) (1) Here,Vf is feature vector set of node set V,ST-GCN+(-)denotes 3.5 Joint Loss Optimization the combination of 7 layers ST-GCN units,max pooling and a fully To optimize skeleton extraction and sign language translation jointly, connected layer,Sigmoid()denotes the sigmoid function. we design a joint loss L.which consists of skeleton extraction loss Fused feature scaling:For each clip,we get the frame-level Lske and SLT loss Lslt(y.). feature vector Fm,clip-level feature vector Fe and the scale factor The skeleton extraction loss Lke is calculated as follows,where sf.First,we fuse Fm and Fe with element addition to get the MHER3,MG ER3 denote the predicted heatmap and the ground- fused feature vector.Then,we scale the fused feature vector with truth heatmap.respectively.Here,K.h,w are the number of key- multiplication operation to get the scaled feature vector Ff,as points,the height and width of a heatmap.For each ground-truth shown below. heatmap,it contains a heating area,which is generated by applying Ff=(Em⊕Fc)⑧sf (2) a 2D Gaussian function with 1-pixel standard deviation [34]on the In Figure 7,we show the calculated scale factor sf for each clip. keypoint estimated by OpenPose [8]. where clips with meaningful signs(i.e.,clips in the green rectangle) have larger sf.It means that the designed ClipScl module can effi- K h w 1 ciently track the dynamic changes of human pose with skeletons Lske (MH (4) k i j and distinguish the importance of different clips,i.e.,ClipScl can highlight meaningful clips while weakening unmeaningful clips. The SLT loss Lslt(y,is the cross entropy loss function,is the predicted word sequence and y is the ground-truth word sequence 3.4 Spoken Language Generation (ie.,labels).The calculation of Lslt(y,)is shown below,where After getting the scaled feature vector of each clip,we propose T means the max number of time steps in decoder and V means the number of words in vocabulary.yt.i is an indicator,when the LangGen,which adopts the encoder-decoder framework [35]and attention mechanism [1]to generate the spoken language,as shown ground-truth word at the tth time step is the jth word in vocabulary. in Figure 2. yt.j=1.Otherwise,yt.j=0.pt.j means the probability that the predicted word it at the tth time step is the jth word in vocabulary. BiLSTM Encoder with MemoryCell:We use three-layered BiLSTMs and propose a novel MemoryCell to connect adjacent BiL- STM layers for encoding.Specifically,given a sequence of scaled Lsi(y.) yt.jlog(pt.i) (5) clips'feature vectors zn,we first get the hidden states H= (...)after the Ith BiLSTM layer.Then,we design Memo- Based on Lske and Lslt(y.),we can calculate the joint lossL ryCell to change the dimensions of hidden states and provide the as follows,where a is a hyper-parameter and used to balance the appropriate input for the following layer,as shown below. ratio of Lske and Lslt.We set a to 1 at the beginning of training ht1=tanh(w·hh+b) (3) and change it to 0.5 in the middle of training. where W and b are weight and bias of the fully connected layer,h L=aLske Lslt(y.) (6) is the final output hidden state of the Ith layer,is input as the 4 EXPERIMENT initial hidden state of the (I+1)th layer. LSTM Decoder:We use one LSTM layer as the decoder to de- 4.1 Datasets code the word step by step.Specifically,the decoder utilizes LSTM There are two public SLT datasets that are often used,one is the cells,a fully-connected layer and a softmax layer to output the CSL dataset [18]which contains 25K labeled videos with 100 chi- prediction probability pt.j,ie..the probability that the predicted nese sentences filmed by 50 signers,and the other one is a German word yt at the tth time step is the ith word in vocabulary.At the be- sign language dataset:the RWTH-PHOENIX-Weather 2014T [5] ginning of decoding.to is initialized with the start symbol"[SOS]". which contains 8257 weather forecast samples from 9 signers.The 4357
… … Clip sequence Initial state Transition clip Meaningful clips with signs Transition clip End state Padded clips Figure 7: Visualization of the scale factor for each clip. Meaningful clips with signs have larger scale factors, while unmeaningful clips have lower scale factors. to get the scale factor sf , i.e., a value belonging to [0, 1]. sf = Siдmoid(ST − GCN+ (V f , E)) (1) Here, V f is feature vector set of node set V , ST −GCN+(·) denotes the combination of 7 layers ST-GCN units, max pooling and a fully connected layer, Siдmoid(·) denotes the sigmoid function. Fused feature scaling: For each clip, we get the frame-level feature vector Fm, clip-level feature vector Fc and the scale factor sf . First, we fuse Fm and Fc with element addition ⊕ to get the fused feature vector. Then, we scale the fused feature vector with multiplication operation ⊗ to get the scaled feature vector Ff , as shown below. Ff = (Fm ⊕ Fc ) ⊗ sf (2) In Figure 7, we show the calculated scale factor sf for each clip, where clips with meaningful signs (i.e., clips in the green rectangle) have larger sf . It means that the designed ClipScl module can efficiently track the dynamic changes of human pose with skeletons and distinguish the importance of different clips, i.e., ClipScl can highlight meaningful clips while weakening unmeaningful clips. 3.4 Spoken Language Generation After getting the scaled feature vector of each clip, we propose LangGen, which adopts the encoder-decoder framework [35] and attention mechanism [1] to generate the spoken language, as shown in Figure 2. BiLSTM Encoder with MemoryCell: We use three-layered BiLSTMs and propose a novel MemoryCell to connect adjacent BiLSTM layers for encoding. Specifically, given a sequence of scaled clips’ feature vectors z1:n, we first get the hidden states H l = (h l 1 , h l 2 , . . . , h l n ) after the lth BiLSTM layer. Then, we design MemoryCell to change the dimensions of hidden states and provide the appropriate input for the following layer, as shown below. h l+1 0 = tanh(W · h l n + b) (3) where W and b are weight and bias of the fully connected layer, h l n is the final output hidden state of the lth layer, h l+1 0 is input as the initial hidden state of the (l + 1)th layer. LSTM Decoder: We use one LSTM layer as the decoder to decode the word step by step. Specifically, the decoder utilizes LSTM cells, a fully-connected layer and a softmax layer to output the prediction probability pt,j , i.e., the probability that the predicted word yˆt at the tth time step is the jth word in vocabulary. At the beginning of decoding, yˆ0 is initialized with the start symbol “[SOS]”. At the tth time step, the decoder predicts the word yˆt . The decoder stops decoding until the occurrence of the symbol “[EOS]”. 3.5 Joint Loss Optimization To optimize skeleton extraction and sign language translation jointly, we design a joint loss L, which consists of skeleton extraction loss Lske and SLT loss Lsl t (y,yˆ). The skeleton extraction loss Lske is calculated as follows, where MH ∈ R 3 , MG ∈ R 3 denote the predicted heatmap and the groundtruth heatmap, respectively. Here, K, h, w are the number of keypoints, the height and width of a heatmap. For each ground-truth heatmap, it contains a heating area, which is generated by applying a 2D Gaussian function with 1-pixel standard deviation [34] on the keypoint estimated by OpenPose [8]. Lske = 1 K Õ K k Õ h i Õw j (M H k,i,j − M G k,i,j ) 2 (4) The SLT loss Lsl t (y,yˆ) is the cross entropy loss function, yˆ is the predicted word sequence and y is the ground-truth word sequence (i.e., labels). The calculation of Lsl t (y,yˆ) is shown below, where T means the max number of time steps in decoder and V means the number of words in vocabulary. yt,j is an indicator, when the ground-truth word at the tth time step is the jth word in vocabulary, yt,j = 1. Otherwise, yt,j = 0. pt,j means the probability that the predicted word yˆt at the tth time step is the jth word in vocabulary. Lsl t (y,yˆ) = − Õ T t=1 Õ V j yt,j loд(pt,j) (5) Based on Lske and Lsl t (y,yˆ), we can calculate the joint loss L as follows, where α is a hyper-parameter and used to balance the ratio of Lske and Lsl t . We set α to 1 at the beginning of training and change it to 0.5 in the middle of training. L = αLske + Lsl t (y,yˆ) (6) 4 EXPERIMENT 4.1 Datasets There are two public SLT datasets that are often used, one is the CSL dataset [18] which contains 25K labeled videos with 100 chinese sentences filmed by 50 signers, and the other one is a German sign language dataset: the RWTH-PHOENIX-Weather 2014T [5] which contains 8257 weather forecast samples from 9 signers. The Poster Session 5 MM ’21, October 20–24, 2021, Virtual Event, China 4357
Poster Session 5 MM'21.October 20-24,2021,Virtual Event,China Table 1:Ablation study on CSL dataset.Note:without: Table 2:Ablation study on PHOENIX14T dataset.Note:with- 'w/o',skeleton:'ske',channel:'chl',graph:'gph',frame: out:'w/o',skeleton:'ske',channel:chl',graph:'gph',frame: frm',MemoryCell:'MC',feature:'fea'.all-ske'refers to all 'frm',MemoryCell:'MC',feature:fea'.'all-ske'refers to all skeleton-related components including skeleton extraction skeleton-related components including skeleton extraction branch,skeleton channel and skeleton-based GCN. branch,skeleton channel and skeleton-based GCN. Model Time(s) ROUGE BLEU-1 BLEU-2 BLEU-3 BLEU-4 Model Time(s) ROUGE BLEU-1 BLEU-2 BLEU-3 BLEU-4 w/o all-ske 0.467 0.951 0.953 0.939 0.927 0.916 w/o all-ske 0.358 0.513 0.542 0391 0.294 0.225 w/o ske-chl 0.489 0.956 0.958 0.946 0.935 0.924 w/o ske-chl 0.370 0.520 0.549 0.403 0.304 0.233 w/o ske-gph 0.472 0.966 0.967 0.957 0.947 0.939 w/o ske-gph 0.371 0.515 0.545 0394 0295 0.226 w/o frmfea 0.480 0.952 0.954 0.941 0.928 0.916 wlo frm-fea 0376 0518 0.547 0396 0.299 0.230 w/o clip-fea 0.242 0.928 0.921 0.898 0.882 0.879 w/o clip-fea 0.214 0.516 0.541 0.390 0.291 0.220 w/o MC 0.489 0.960 0.962 0.950 0.939 0.929 w/oMC 0.369 0.538 0.563 0.418 0.316 0.243 SANet 0.499 0.996 0.995 0.994 0.992 0.990 SANet 0.383 0.548 0.573 0.424 0.322 0.248 PHOENIX14T corpus has two-stage annotations:sign gloss anno- tations with a vocabulary of 1066 different signs for continuous Ablation Study:To estimate the contributions of our designed SLR and German translation annotations with a vocabulary of 2877 components for SLT,we perform the ablation study on two datasets. different words for SLT.For CSL dataset,we split it into 17K,2K Specifically,the components include skeleton-related components and 6K for training.validation and testing respectively.The split (i.e.,all skeletons,skeleton channel,skeleton-based GCN),feature- for CSL dataset was widely adopted in existing work [14,15,18] related components(i.e.,frame-level features,clip-level features) For PHOENIX14T,it has been officially split into 7096,519 and 642 and encoding related component(i.e.,MemoryCell).Here,'all skele- samples for training.validation,and testing respectively tons'means the combination of components related to skeletons, including the branch generating skeletons and the parts using skele- 4.2 Experimental Setting tons (i.e.,skeleton channel and skeleton-based GCN).In the experi- ment,we remove only one type of component at a time and list the In this subsection,we will describe the detailed setting in SANet,in- corresponding performance in Table 1 and Table 2. cluding data preprocessing.module parameters,model training.and According to Table 1 and Table 2,each designed component has model implementation.For the data preprocessing,when given a made positive contribution to higher performance.For the skeleton- sign language video,each frame of the video is reshaped as 200x200 related components,when removing all skeletons,skeleton channel, pixels and Gaussian noises are added for data augmentation.The or skeleton-based GCN,the performance on the metric ROUGE clip is segmented by using a sliding window where the window size score drops by 4.5%,4.0%,3.0%on CSL dataset and 3.5%,2.8%,3.3% c is set to 16 frames and the stride size s is set to 8 frames.Consider- on PHOENIX14T dataset respectively.It indicates that our skeleton- ing that sign language videos are variable-length,the max/default related designs are very helpful in improving performance.That is length of a sign language video is set to 300 frames and 200 frames to say,the proposed self-contained branch can efficiently extract the for CSL dataset and PHOENIX14T dataset,respectively.For the skeleton,while the designed skeleton channel and skeleton-based module LangGen,the dimensions of BiLSTM and LSTM layers are GCN can efficiently enhance the feature representation of each clip both 1024.Dropout with rate 0.5 is used after embedding in De- For the feature-related components,they are the important source coder/LSTM layer.For model training,the batch size is set to 64. for encoding and play an important role in SLT.When frame-level Adam optimizer [19]is used to optimize model parameters with features and clip-level features are removed respectively,the per an initial learning rate of 0.001.The learning rate is decreased by formance on the metric ROUGE score drops by 4.4%,6.8%on CSL a factor of 0.5 every S steps and S is set to the number of training dataset and 3.0%,3.2%on PHOENIX14T dataset respectively.It in- samples.For the model implementation,SANet is implemented dicates that frame-level features and clip-level features also have a with PyTorch1.6 and trained for 100 epochs on 4 NVIDIA Tesla great impact on SLT performance.This is because the frame-level or V100 GPUs. clip-level features contain rich information extracted from frames In regard to performance metrics,we adopt ROUGE-L F1-Score [23] or clips.Thus extracting features from frames/clips is a common and BLEU-1,2,3,4 [26],which are often used to measure the quality approach for feature representation in the existing work.However, of translation in machine translation and also used in the existing instead of only extracting features from frames or clips individu- SLT work [5-7].For a fair comparison,we use the evaluation codes ally,this paper also contributes a design by fusing the frame-level of ROUGE-L and BLEU-1,2,3,4 provided by the RWTH-PHOENIX- features and clip-level features for a clip.Besides,our work makes Weather 2014T dataset [5]. a further step on the existing SLT work by introducing skeletons to enhance the feature representation of clips,thus further improve 4.3 Model Performance the SLT performance. To verify the effectiveness of the proposed Skeleton-Aware neural In regard to the encoding-related component (i.e.,MemoryCell). Network(SANet),we perform ablation study,time analysis and when MemoryCell is removed,the performance on the metric qualitative analysis for SANet. ROUGE score drops by 3.6%on CSL dataset and 1.0%on PHOENIX14T 4358
Table 1: Ablation study on CSL dataset. Note: without: ‘w/o’, skeleton: ‘ske’, channel: ‘chl’, graph: ‘gph’, frame: ‘frm’, MemoryCell: ‘MC’, feature: ‘fea’. ‘all-ske’ refers to all skeleton-related components including skeleton extraction branch, skeleton channel and skeleton-based GCN. Model Time(s) ROUGE BLEU- 1 BLEU-2 BLEU-3 BLEU-4 w/o all-ske 0.467 0.951 0.953 0.939 0.927 0.916 w/o ske-chl 0.489 0.956 0.958 0.946 0.935 0.924 w/o ske-gph 0.472 0.966 0.967 0.957 0.947 0.939 w/o frm-fea 0.480 0.952 0.954 0.941 0.928 0.916 w/o clip-fea 0.242 0.928 0.921 0.898 0.882 0.879 w/o MC 0.489 0.960 0.962 0.950 0.939 0.929 SANet 0.499 0.996 0.995 0.994 0.992 0.990 PHOENIX14T corpus has two-stage annotations: sign gloss annotations with a vocabulary of 1066 different signs for continuous SLR and German translation annotations with a vocabulary of 2877 different words for SLT. For CSL dataset, we split it into 17K, 2K and 6K for training, validation and testing respectively. The split for CSL dataset was widely adopted in existing work [14, 15, 18]. For PHOENIX14T, it has been officially split into 7096, 519 and 642 samples for training, validation, and testing respectively. 4.2 Experimental Setting In this subsection, we will describe the detailed setting in SANet, including data preprocessing, module parameters, model training, and model implementation. For the data preprocessing, when given a sign language video, each frame of the video is reshaped as 200×200 pixels and Gaussian noises are added for data augmentation. The clip is segmented by using a sliding window where the window size c is set to 16 frames and the stride size s is set to 8 frames. Considering that sign language videos are variable-length, the max/default length of a sign language video is set to 300 frames and 200 frames for CSL dataset and PHOENIX14T dataset, respectively. For the module LangGen, the dimensions of BiLSTM and LSTM layers are both 1024. Dropout with rate 0.5 is used after embedding in Decoder/LSTM layer. For model training, the batch size is set to 64. Adam optimizer [19] is used to optimize model parameters with an initial learning rate of 0.001. The learning rate is decreased by a factor of 0.5 every S steps and S is set to the number of training samples. For the model implementation, SANet is implemented with PyTorch1.6 and trained for 100 epochs on 4 NVIDIA Tesla V100 GPUs. In regard to performance metrics, we adopt ROUGE-L F1-Score [23] and BLEU-1,2,3,4 [26], which are often used to measure the quality of translation in machine translation and also used in the existing SLT work [5–7]. For a fair comparison, we use the evaluation codes of ROUGE-L and BLEU-1,2,3,4 provided by the RWTH-PHOENIXWeather 2014T dataset [5]. 4.3 Model Performance To verify the effectiveness of the proposed Skeleton-Aware neural Network (SANet), we perform ablation study, time analysis and qualitative analysis for SANet. Table 2: Ablation study on PHOENIX14T dataset. Note: without: ‘w/o’, skeleton: ‘ske’, channel: ‘chl’, graph: ‘gph’, frame: ‘frm’, MemoryCell: ‘MC’, feature: ‘fea’. ‘all-ske’ refers to all skeleton-related components including skeleton extraction branch, skeleton channel and skeleton-based GCN. Model Time(s) ROUGE BLEU- 1 BLEU-2 BLEU-3 BLEU-4 w/o all-ske 0.358 0.513 0.542 0.391 0.294 0.225 w/o ske-chl 0.370 0.520 0.549 0.403 0.304 0.233 w/o ske-gph 0.371 0.515 0.545 0.394 0.295 0.226 w/o frm-fea 0.376 0.518 0.547 0.396 0.299 0.230 w/o clip-fea 0.214 0.516 0.541 0.390 0.291 0.220 w/o MC 0.369 0.538 0.563 0.418 0.316 0.243 SANet 0.383 0.548 0.573 0.424 0.322 0.248 Ablation Study: To estimate the contributions of our designed components for SLT, we perform the ablation study on two datasets. Specifically, the components include skeleton-related components (i.e., all skeletons, skeleton channel, skeleton-based GCN), featurerelated components (i.e., frame-level features, clip-level features) and encoding related component (i.e., MemoryCell). Here, ‘all skeletons’ means the combination of components related to skeletons, including the branch generating skeletons and the parts using skeletons (i.e., skeleton channel and skeleton-based GCN). In the experiment, we remove only one type of component at a time and list the corresponding performance in Table 1 and Table 2. According to Table 1 and Table 2, each designed component has made positive contribution to higher performance. For the skeletonrelated components, when removing all skeletons, skeleton channel, or skeleton-based GCN, the performance on the metric ROUGE score drops by 4.5%, 4.0%, 3.0% on CSL dataset and 3.5%, 2.8%, 3.3% on PHOENIX14T dataset respectively. It indicates that our skeletonrelated designs are very helpful in improving performance. That is to say, the proposed self-contained branch can efficiently extract the skeleton, while the designed skeleton channel and skeleton-based GCN can efficiently enhance the feature representation of each clip. For the feature-related components, they are the important source for encoding and play an important role in SLT. When frame-level features and clip-level features are removed respectively, the performance on the metric ROUGE score drops by 4.4%, 6.8% on CSL dataset and 3.0%, 3.2% on PHOENIX14T dataset respectively. It indicates that frame-level features and clip-level features also have a great impact on SLT performance. This is because the frame-level or clip-level features contain rich information extracted from frames or clips. Thus extracting features from frames/clips is a common approach for feature representation in the existing work. However, instead of only extracting features from frames or clips individually, this paper also contributes a design by fusing the frame-level features and clip-level features for a clip. Besides, our work makes a further step on the existing SLT work by introducing skeletons to enhance the feature representation of clips, thus further improve the SLT performance. In regard to the encoding-related component (i.e., MemoryCell), when MemoryCell is removed, the performance on the metric ROUGE score drops by 3.6% on CSL dataset and 1.0% on PHOENIX14T Poster Session 5 MM ’21, October 20–24, 2021, Virtual Event, China 4358
Poster Session 5 MM'21.October 20-24,2021,Virtual Event,China ROUGE w/o all-ske am atlantik zicht ein neucs hoch uber und bringt uns zum und morgenstunden heftige schneefalle auch teil auch mit regen 55.0% w/o ske-chl a山m atlantik zicht ein kraftiges hoch zur und lenkt uns zum morgen morgenstunden heftige im teil auch regen 50.0% wo ske-eph westen zieht ein kraftiges tief das und lenkt uns zum den morgenstunden heftige schneefalle heute teil auch 市t regen 65.0% w/o img-fea am westen zicht ein neues hoch das und lenkt uns zum den morgenstunden sonntag schneefalle zum teil auch gefrierenden regen 60.0% wlo clip-fea am westen zieht ein neues hoch das und lenkt uns zum den morgenstunden sonntag schneefalle zum teil auch gefrierenden regen 60.0% wlo MC am utlantik zieht ein tieftief heran und bringt uns zum und morgenstunden heftige schauer im teil auch gefrierenden regen 65.0% SANet am atlantik zicht ein kraftiges tief heran und bringt uns zum den morgenstunden heftige schneefalle bringt teil auch gefrierenden regen 80.0% Label vom nordmeer zicht ein kraftiges tief heran und bringt uns ab den morgenstunden heftige schneefalle zum teil auch gefrierenden regen Figure 8:A qualitative analysis of different components used for SLT on the example from PHOENIX14T test set.(Note:the word order in sign language and that in spoken language may be not temporally consistent.) dataset respectively.It indicates that the designed MemoryCell can the signers have no overlaps.(b)Split II-unseen sentences test: efficiently change the dimensions of hidden states between BiL- we select the sign language videos corresponding to 94 sentences STM layers and provide the appropriate input for LSTM Decoder, as the training set and the videos corresponding to the remaining 6 thus contributes to a certain performance improvement for SLT. sentences as the test set.The signers and vocabulary of the training Therefore,our proposed skeleton-related designs bring a good con- set and test set are the same,while the sentences have no overlaps tribution for SLT,while the overall framework of SANet and other In Table 3,we show the SLT performance on Split I and Split II,and designed components also benefit SLT. also compare our SANet with the following existing approaches:(1) Inference Time Analysis:To evaluate the time efficiency of S2VT [39]belongs to a standard two-layers stacked LSTM archi- designed components,in Table 1 and Table 2,we show the inference tecture which is used to translate video to text.(2)S2VT(3-layer) time of SLT by removing one type of designed component.Here, extends the S2VT from two-layers LSTM to three-layers LSTM.(3) inference time means the duration of translating the sign language HLSTM [15]is a hierarchical LSTM based encoder-decoder model in a video to the spoken language.In the experiment,we evaluate for SLT.(4)HLSTM-attn [15]adds the attention mechanisms over the inference time of a sign language video with 300 frames on CSL the previous HLSTM.(5)HRF-Fusion [14]is a hierarchical adap- dataset and a sign language video with 200 frames on PHOENIX14T tive recurrent network which mines variable-length key clips and dataset on a single GPU,and then average the time of 100 runs as the applies attention mechanisms,by using both RGB videos and skele- reported inference time.When keeping all components of SANet, ton data from Kinect. the average inference time is 0.499 seconds for CSL dataset and 0.383 As shown in Table 3,on Split I,whatever the performance metric seconds for PHOENIX14T dataset.When removing any designed is,the proposed SANet achieves the best performance.Specifically component,the interference time decreases.According to Table 1 the proposed SANet achieves 99.6%.99.4%,99.3%,99.2%and 99.0% and Table 2,the clip-level feature extraction introduces much time on ROUGE,BLEU-1,BLEU-2,BLEU-3 and BLEU-4 respectively. cost,since 3D convolutions have expensive computational cost which outperform the existing approaches.When compared with While for the skeleton-related components and MemoryCell,they S2VT,the SANet can increase the ROUGE,BLEU-1,BLEU-2,BLEU- only have a little effect on the overall inference time. 3 and BLEU-4 by 9.2%,9.2%,10.7%,11.3%and 11.6%,respectively. Qualitative Analysis:In this experiment,we show an example When moving to Split II,the performance drops a lot.Take our of SLT on a sign language video sample from PHOENIX14T test set. SANet as an example,the ROUGE score on Split II drops by 31.5%. to provide a qualitative analysis.As shown in Figure 8,the last row This is because translating the unseen sentences,i.e.,the words in means the ground truth of SLT,the second row from the bottom the sentence exist in the training set while the sentence does not means the SLT result of the proposed SANet,while other rows mean occur in the training set,can be more challenging and it is difficult the SLT results by removing the designed components of SANet. for SLT.However,our SANet still outperforms the state-of-the-art If the ith word in the SLT result is different from the ith ground- approaches and achieves 68.1%,69.7%,41.1%,26.8%and 18.1%on truth word,it is an error and marked with orange color.According ROUGE,BLEU-1,BLEU-2,BLEU-3 and BLEU-4,respectively.When to Figure 8,the proposed SANet achieves the best performance, compared with the existing approach HRF-Fusion [14],our SANet while removing any designed component will lead to more errors. can increase ROUGE by 23.2%. It demonstrates that the proposed framework and all the designed Evaluation on PHOENIX14T dataset:We compare SANet components contribute to a higher performance for SLT. with the following existing approaches:(1)TSPNet [22]intro- duces multiple segments of sign language video in different gran- 4.4 Comparisons ularities and uses Transformer decoder for SLT.(2)H+M+P [6] Evalutation on CSL dataset:We compare SANet with existing ap- uses a multi-channel transformer architecture,by fusing hand- proaches on two settings.(a)Split I-signer independent test:we area videos,mouth-area videos and extracted poses from videos. (3)Sign2Gloss-Goss2Text [5]adopts a classic encoder-decoder select the sign language videos generated by 40 signers and that gen- erated by the other 10 signers as the training set and test set,respec- architecture,and it is trained in two stages(ie.,gloss stage and sentence stage)independently.(4)Sign2Gloss2Text [5]adopts the tively.The sentences of training set and test set are the same,but 4359
vom nordmeer zieht ein kräftiges tief heran und bringt uns ab den morgenstunden heftige schneefälle zum teil auch gefrierenden regen am am am am am am am atlantik atlantik westen westen westen atlantik atlantik zieht zieht zieht zieht zieht zieht zieht ein ein ein ein ein ein ein kräftiges kräftiges neues neues neues tief kräftiges tief tief tief hoch hoch hoch hoch über heran zur das das das heran und und und und und und und bringt bringt bringt lenkt lenkt lenkt lenkt uns uns uns uns uns uns uns zum zum zum zum zum zum zum und morgen den den den und den morgenstunden morgenstunden morgenstunden morgenstunden morgenstunden morgenstunden morgenstunden heftige heftige heftige sonntag sonntag heftige heftige schneefälle schauer schneefälle schneefälle schneefälle schauer schneefälle auch im heute zum zum im bringt teil teil teil teil teil teil teil auch auch auch auch auch auch auch gefrierenden gefrierenden gefrierenden gefrierenden mit kräftigen mit regen regen regen regen regen regen regen Label SANet w/o all-ske w/o ske-gph w/o ske-chl w/o MC w/o img-fea w/o clip-fea ROUGE 55.0% 50.0% 65.0% 60.0% 60.0% 65.0% 80.0% Figure 8: A qualitative analysis of different components used for SLT on the example from PHOENIX14T test set. (Note: the word order in sign language and that in spoken language may be not temporally consistent.) dataset respectively. It indicates that the designed MemoryCell can efficiently change the dimensions of hidden states between BiLSTM layers and provide the appropriate input for LSTM Decoder, thus contributes to a certain performance improvement for SLT. Therefore, our proposed skeleton-related designs bring a good contribution for SLT, while the overall framework of SANet and other designed components also benefit SLT. Inference Time Analysis: To evaluate the time efficiency of designed components, in Table 1 and Table 2, we show the inference time of SLT by removing one type of designed component. Here, inference time means the duration of translating the sign language in a video to the spoken language. In the experiment, we evaluate the inference time of a sign language video with 300 frames on CSL dataset and a sign language video with 200 frames on PHOENIX14T dataset on a single GPU, and then average the time of 100 runs as the reported inference time. When keeping all components of SANet, the average inference time is 0.499 seconds for CSL dataset and 0.383 seconds for PHOENIX14T dataset. When removing any designed component, the interference time decreases. According to Table 1 and Table 2, the clip-level feature extraction introduces much time cost, since 3D convolutions have expensive computational cost. While for the skeleton-related components and MemoryCell, they only have a little effect on the overall inference time. Qualitative Analysis: In this experiment, we show an example of SLT on a sign language video sample from PHOENIX14T test set, to provide a qualitative analysis. As shown in Figure 8, the last row means the ground truth of SLT, the second row from the bottom means the SLT result of the proposed SANet, while other rows mean the SLT results by removing the designed components of SANet. If the ith word in the SLT result is different from the ith groundtruth word, it is an error and marked with orange color. According to Figure 8, the proposed SANet achieves the best performance, while removing any designed component will lead to more errors. It demonstrates that the proposed framework and all the designed components contribute to a higher performance for SLT. 4.4 Comparisons Evalutation on CSL dataset: We compare SANet with existing approaches on two settings. (a) Split I - signer independent test: we select the sign language videos generated by 40 signers and that generated by the other 10 signers as the training set and test set, respectively. The sentences of training set and test set are the same, but the signers have no overlaps. (b) Split II - unseen sentences test: we select the sign language videos corresponding to 94 sentences as the training set and the videos corresponding to the remaining 6 sentences as the test set. The signers and vocabulary of the training set and test set are the same, while the sentences have no overlaps. In Table 3, we show the SLT performance on Split I and Split II, and also compare our SANet with the following existing approaches: (1) S2VT [39] belongs to a standard two-layers stacked LSTM architecture which is used to translate video to text. (2) S2VT(3-layer) extends the S2VT from two-layers LSTM to three-layers LSTM. (3) HLSTM [15] is a hierarchical LSTM based encoder-decoder model for SLT. (4) HLSTM-attn [15] adds the attention mechanisms over the previous HLSTM. (5) HRF-Fusion [14] is a hierarchical adaptive recurrent network which mines variable-length key clips and applies attention mechanisms, by using both RGB videos and skeleton data from Kinect. As shown in Table 3, on Split I, whatever the performance metric is, the proposed SANet achieves the best performance. Specifically, the proposed SANet achieves 99.6%, 99.4%, 99.3%, 99.2% and 99.0% on ROUGE, BLEU-1, BLEU-2, BLEU-3 and BLEU-4 respectively, which outperform the existing approaches. When compared with S2VT, the SANet can increase the ROUGE, BLEU-1, BLEU-2, BLEU- 3 and BLEU-4 by 9.2%, 9.2%, 10.7%, 11.3% and 11.6%, respectively. When moving to Split II, the performance drops a lot. Take our SANet as an example, the ROUGE score on Split II drops by 31.5%. This is because translating the unseen sentences, i.e., the words in the sentence exist in the training set while the sentence does not occur in the training set, can be more challenging and it is difficult for SLT. However, our SANet still outperforms the state-of-the-art approaches and achieves 68.1%, 69.7%, 41.1%, 26.8% and 18.1% on ROUGE, BLEU-1, BLEU-2, BLEU-3 and BLEU-4, respectively. When compared with the existing approach HRF-Fusion [14], our SANet can increase ROUGE by 23.2%. Evaluation on PHOENIX14T dataset: We compare SANet with the following existing approaches: (1) TSPNet [22] introduces multiple segments of sign language video in different granularities and uses Transformer decoder for SLT. (2) H+M+P [6] uses a multi-channel transformer architecture, by fusing handarea videos, mouth-area videos and extracted poses from videos. (3) Sign2Gloss→Goss2Text [5] adopts a classic encoder-decoder architecture, and it is trained in two stages (i.e., gloss stage and sentence stage) independently. (4) Sign2Gloss2Text [5] adopts the Poster Session 5 MM ’21, October 20–24, 2021, Virtual Event, China 4359
Poster Session 5 MM'21.October 20-24,2021,Virtual Event,China Table 3:Comparisons with other approaches on CSL dataset under signer-independent test and unseen-sentences test. SplitΠ Model Split I ROUGE BLEU-1 BLEU-2 BLEU-3 BLEU-4 ROUGE BLEU-1 BLEU-2 BLEU-3 BLEU-4 S2VT[39] 0.904 0.902 0.886 0.879 0.874 0.461 0.466 0.258 0.135 S2VT(3-layer)[39] 0.911 0.911 0.896 0.889 0.884 0.465 0.475 0.265 0.145 HLSTM [15] 0.944 0.942 0.932 0.927 0.922 0.481 0.487 0.315 0.195 HLSTM-attn [15] 0.951 0.948 0.938 0.933 0.928 0.503 0.508 0.330 0.207 HRF-Fusion [14] 0.994 0.993 0.992 0.991 0.990 0.449 0.450 0.238 0.127 SANet 0.996 0.994 0.993 0.992 0.990 0.681 0.697 0.411 0.268 0.181 Table 4:Comparison with other approaches on RWTH-PHOENIX-Weather 2014T dataset PHOENIX14T DEV PHOENIX14T TEST Model ROUGE BLEU-1 BLEU-2 BLEU-3 BLEU-4 ROUGE BLEU-1 BLEU-2 BLEU-3 BLEU-4 TSPNet [22 0.349 0.361 0.231 0.169 0.134 H+M+P[6] 0.459 0.195 0.436 0.183 Sign2Gloss-Goss2Text [5] 0.438 0.411 0.291 0.221 0.179 0.435 0.415 0.295 0.222 0.178 Sign2Gloss2Text [5] 0.441 0.429 0.303 0.230 0.184 0.438 0.433 0.304 0.228 0.181 (Gloss)Sign2Text [7] 0.455 0.326 0.253 0.207 0.453 0.323 0.248 0.202 (Gloss)Sign2(Gloss+Text)[7] 0.473 0.344 0.271 0.224 0.466 0.337 0.262 0.213 DeepHand [25] 0.381 0.385 0.256 0.186 0.146 SANet 0.542 0.566 0.415 0.312 0.235 0.548 0.573 0.424 0.322 0.248 previous encoder-decoder architecture,but it is trained jointly.(5) evaluate the translation quality.For a fair comparison,the reported (Gloss)Sign2Text [7]utilizes a transformer based architecture for results on these metrics are provided by the original paper.Without SLT.(6)(Gloss)Sign2(Gloss+Text )[7]also utilizes a transformer complete original/released codes,we do not make further compar- based architecture,but jointly learns CLSR and SLT at the same isons(e.g.,inference time)with the approaches.However,in future time.(7)DeepHand [25]adopts the encoder-decoder architecture work,we will try to explore more comprehensive comparisons from and introduces a pretrained hand shape recognizer for SLT. different perspectives with the existing approaches. As shown in Table 4,we provide the performance of each ap- Particularity of CSL dataset:The CSL dataset is a special proach on both the validation set(ie.,'DEV)and test set(ie., dataset,where the word order of sign language is temporally con TEST).Whatever on 'DEV'set or TEST'set,the existing ap- sistent with that of spoken language,thus it can be used for SLR proaches often have the lower performance.For example,the best task [18,45]as well as SLT task [14,15].In this paper,CSL dataset is ROUGE and BLEU-4 score achieved by the existing approach is solved from the aspect of SLT and only SLT methods are considered less than 46%and 23%,respectively.This may be because of the for comparisons. high diversity,large size of vocabulary and limited number of train- ing samples in the PHOENIX14T dataset.However,our proposed 6 CONCLUSION SANet further improves the SLT performance and achieves the best performance,e.g,54.2%of ROUGE and 23.5%of BLEU-4 score on In this paper,we propose a Skeleton-Aware neural Network(SANet) for end-to-end SLT.Specifically,we first use a self-contained branch 'DEV'set while 54.8%of ROUGE and 24.8%of BLEU-4 score on to extract the skeleton from each frame,and then enhance the fea- TEST set,which outperform all the existing approaches. ture representation of a clip by adding the skeleton channel and scaling (i.e.,weighting the importance)the feature vector with a DISCUSSION AND FUTURE WORK designed skeleton-based GCN.Besides,we design a joint optimiza- Keypoints of skeleton:In this paper,we extract 14 keypoints to tion strategy for training.The experimental results on two large represent the skeleton,while ignoring the fine-grained keypoints scale datasets demonstrate the effectiveness of SANet. in fingers,mouth,etc.The main reason is that it is difficult to accu- rately extract fine-grained keypoints from the limited-resolution 7 ACKNOWLEDGMENTS frames in SLT dataset(e.g.,the frame in PHOENIX14T dataset is This work is supported by National Key R&D Program of China only 210x260 pixels).However,we consider that accurately extract- under Grant No.2018AAA0102302;National Natural Science Foun- ing more fine-grained keypoints can advance SLT performance.In dation of China under Grant Nos.61802169,61872174,61832008, the future,we will make further research on skeleton extraction. 61906085,61902175;JiangSu Natural Science Foundation under More comparisons from different perspectives:In the ex- Grant Nos.BK20180325,BK20190293.This work is partially sup- periment,we compare the proposed SANet with the existing ap- ported by Collaborative Innovation Center of Novel Software Tech- proaches in terms of ROUGE and BLEU-1,2,3,4,which are used to nology and Industrialization. 4360
Table 3: Comparisons with other approaches on CSL dataset under signer-independent test and unseen-sentences test. Model Split I Split II ROUGE BLEU-1 BLEU-2 BLEU-3 BLEU-4 ROUGE BLEU-1 BLEU-2 BLEU-3 BLEU-4 S2VT [39] 0.904 0.902 0.886 0.879 0.874 0.461 0.466 0.258 0.135 - S2VT(3-layer) [39] 0.911 0.911 0.896 0.889 0.884 0.465 0.475 0.265 0.145 - HLSTM [15] 0.944 0.942 0.932 0.927 0.922 0.481 0.487 0.315 0.195 - HLSTM-attn [15] 0.951 0.948 0.938 0.933 0.928 0.503 0.508 0.330 0.207 - HRF-Fusion [14] 0.994 0.993 0.992 0.991 0.990 0.449 0.450 0.238 0.127 - SANet 0.996 0.994 0.993 0.992 0.990 0.681 0.697 0.411 0.268 0.181 Table 4: Comparison with other approaches on RWTH-PHOENIX-Weather 2014T dataset Model PHOENIX14T DEV PHOENIX14T TEST ROUGE BLEU-1 BLEU-2 BLEU-3 BLEU-4 ROUGE BLEU-1 BLEU-2 BLEU-3 BLEU-4 TSPNet [22] - - - - - 0.349 0.361 0.231 0.169 0.134 H+M+P [6] 0.459 - - - 0.195 0.436 - - - 0.183 Sign2Gloss→Goss2Text [5] 0.438 0.411 0.291 0.221 0.179 0.435 0.415 0.295 0.222 0.178 Sign2Gloss2Text [5] 0.441 0.429 0.303 0.230 0.184 0.438 0.433 0.304 0.228 0.181 (Gloss)Sign2Text [7] - 0.455 0.326 0.253 0.207 - 0.453 0.323 0.248 0.202 (Gloss)Sign2(Gloss+Text) [7] - 0.473 0.344 0.271 0.224 - 0.466 0.337 0.262 0.213 DeepHand [25] - - - - - 0.381 0.385 0.256 0.186 0.146 SANet 0.542 0.566 0.415 0.312 0.235 0.548 0.573 0.424 0.322 0.248 previous encoder-decoder architecture, but it is trained jointly. (5) (Gloss)Sign2Text [7] utilizes a transformer based architecture for SLT. (6) (Gloss)Sign2(Gloss+Text ) [7] also utilizes a transformer based architecture, but jointly learns CLSR and SLT at the same time. (7) DeepHand [25] adopts the encoder-decoder architecture and introduces a pretrained hand shape recognizer for SLT. As shown in Table 4, we provide the performance of each approach on both the validation set (i.e., ‘DEV’) and test set (i.e., ‘TEST’). Whatever on ‘DEV’ set or ‘TEST’ set, the existing approaches often have the lower performance. For example, the best ROUGE and BLEU-4 score achieved by the existing approach is less than 46% and 23%, respectively. This may be because of the high diversity, large size of vocabulary and limited number of training samples in the PHOENIX14T dataset. However, our proposed SANet further improves the SLT performance and achieves the best performance, e.g., 54.2% of ROUGE and 23.5% of BLEU-4 score on ‘DEV’ set while 54.8% of ROUGE and 24.8% of BLEU-4 score on ‘TEST’ set, which outperform all the existing approaches. 5 DISCUSSION AND FUTURE WORK Keypoints of skeleton: In this paper, we extract 14 keypoints to represent the skeleton, while ignoring the fine-grained keypoints in fingers, mouth, etc. The main reason is that it is difficult to accurately extract fine-grained keypoints from the limited-resolution frames in SLT dataset (e.g., the frame in PHOENIX14T dataset is only 210×260 pixels). However, we consider that accurately extracting more fine-grained keypoints can advance SLT performance. In the future, we will make further research on skeleton extraction. More comparisons from different perspectives: In the experiment, we compare the proposed SANet with the existing approaches in terms of ROUGE and BLEU-1,2,3,4, which are used to evaluate the translation quality. For a fair comparison, the reported results on these metrics are provided by the original paper. Without complete original/released codes, we do not make further comparisons (e.g., inference time) with the approaches. However, in future work, we will try to explore more comprehensive comparisons from different perspectives with the existing approaches. Particularity of CSL dataset: The CSL dataset is a special dataset, where the word order of sign language is temporally consistent with that of spoken language, thus it can be used for SLR task [18, 45] as well as SLT task [14, 15]. In this paper, CSL dataset is solved from the aspect of SLT and only SLT methods are considered for comparisons. 6 CONCLUSION In this paper, we propose a Skeleton-Aware neural Network (SANet) for end-to-end SLT. Specifically, we first use a self-contained branch to extract the skeleton from each frame, and then enhance the feature representation of a clip by adding the skeleton channel and scaling (i.e., weighting the importance) the feature vector with a designed skeleton-based GCN. Besides, we design a joint optimization strategy for training. The experimental results on two large scale datasets demonstrate the effectiveness of SANet. 7 ACKNOWLEDGMENTS This work is supported by National Key R&D Program of China under Grant No. 2018AAA0102302; National Natural Science Foundation of China under Grant Nos. 61802169, 61872174, 61832008, 61906085, 61902175; JiangSu Natural Science Foundation under Grant Nos. BK20180325, BK20190293. This work is partially supported by Collaborative Innovation Center of Novel Software Technology and Industrialization. Poster Session 5 MM ’21, October 20–24, 2021, Virtual Event, China 4360
Poster Session 5 MM'21,October 20-24,2021,Virtual Event,China REFERENCES [23]Chin-Yew Lin.2004.Rouge:A package for automatic evaluation of summaries. [1]Dzmitry Bahdanau,Kyunghyun Cho,and Yoshua Bengio.2014.Neural ma- In Text summarization branches out.74-81. chine translation by jointly learning to align and translate. arXiv preprint [24]Tao Liu,Wengang Zhou,and Hougiang Li 2016.Sign language recognition with arXi1409.0473(2014. long short-term memory.In ICIP.IEEE,2871-2875. [2]Kshitij Bantupalli and Ying Xie.2018.American sign language recognition using [25]Alptekin Orbay and Lale Akarun.2020.Neural sign language translation by deep learning and computer vision.In 2018 IEEE Intemational Conference on Big learning tokenization.arXiv preprint arXiv:2002.00479(2020). Data (Big Data)IEEE,4896-4899. [26]Kishore Papineni,Salim Roukos,Todd Ward,and Wei-Jing Zhu.2002.BLEU:a [3]Jan Bungeroth and Hermann Ney.2004.Statistical sign language translation method for automatic evaluation of machine translation.In ACL.311-318. In Workshop on representation and processing of sign languages,LREC,Vol.4. [27]Lionel Pigou,Mieke Van Herreweghe,and Joni Dambre.2017.Gesture and sign Citeseer 105-108 ognition with temporal residual networks.In Proceedings of the IEEE Conference an Computer Vision Workshons 3086 [4]Necati Cihan Camgoz,Simon Hadfield,Oscar Koller,and Richard Bowden.2017. Subunets:End-to-end hand shape and continuous sign language recognition.In [28]Junfu Pu,Wengang Zhou,Hezhen Hu,and Houqiang Li.2020.Boosting Continu- 1CCY.EEE,3075-3084. [5]N.C.Camgoz,S.Hadfield,O.Koller,H.Ney,and R.Bowden.2018.Neural Sign [29]Junfu Pu,Wengang Zhou,and Hougiang Li.2019.Iterative alignment network Language Translation.In CVPR.7784-7793.https://doiorg/10.1109/CVPR.2018. 00812 for continuous sign language recognition.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.4165-4174. [6]Necati Cihan Camgoz,Oscar Koller,Simon Hadfield,and Richard Bowden.2020. Multi-channel Transformers for Multi-articulatory Sign Language Translation. [30]Zhaofan Qiu,Ting Yao,and Tao Mei.2017.Learning spatio-temporal representa- tion with pseudo-3d residual networks.In CVPR.5533-5541. arXiv preprint arXiv:2009.00299 (2020). [7]Necati Cihan Camgoz,Oscar Koller,Simon Hadfield,and Richard Bowden.2020. [31]Lei Shi,Yifan Zhang.Jian Cheng,and Hanging Lu.2019.Skeleton-based action Sign Language Transformers:Joint End-to-end Sign Language Recognition and recognition with directed graph neural networks.In Proceedings of the IEEE Translation.In CVPR.10023-10033. Conference on Computer Vision and Pattern Recognition.7912-7921. [8]Zhe Cao,Tomas Simon,Shih-En Wei,and Yaser Sheikh.2017.Realtime multi- [32]Karen Simonyan and Andrew Zisserman.2014.Two-stream convolutional net- works for action recognition in videos.arXiv preprint arXiv:1406.2199(2014). person 2d pose estimation using part affinity fields.In CVPR.7291-7299. [9]Xiujuan Chai,Guang Li,Yushun Lin,Zhihao Xu.Yili Tang.Xilin Chen,and Ming [33]Karen Simonyan and Andrew Zisserman.2014.Very deep convolutional networks for large-scale image recognition.arXiv preprint arXiv:1409.1556(2014). Zhou.2013.Sign language recognition and translation with kinect.In IEEE Conf. on AFGR Vol 655 4 [34]Ke Sun,Bin Xiao,Dong Liu,and Jingdong Wang.2019.Deep high-resolution [10]Runpeng Cui,Hu Liu,and Changshui Zhang.2019.A deep neural framework representation learning for human pose estimation.In CVPR 5693-5703. for continuous sign language recognition by iterative training.IEEE Transactions [35]Ilya Sutskever,Oriol Vinyals,and Quoc VLe.2014.Sequence to sequence learning on Multimedia21,7(2019,1880-1891. with neural networks.In NIPS.3104-3112. [11]Amanda Duarte,Shruti Palaskar,Lucas Ventura,Deepti Ghadiyaram,Kenneth [36]Ao Tang.Ke Lu,Yufei Wang.Jie Huang.and Houqiang Li.2015.A real-time hand DeHaan.Florian Metze,Jordi Torres,and Xavier Giro-i Nieto.2021.How2Sign posture recognition system using deep neural networks.ACM Transactions on A Large-scale Multimodal Dataset for Continuous American Sign Language.In Intelligent Systems and Technology (TIST)6,2(2015),1-23. [37]Dominique Uebersax,Juergen Gall,Michael Van den Bergh,and Luc Van Gool. Conference on Computer Vision and Pattern Recognition (CVPR). [12]K.Grobel and M.Assan.1997.Isolated sign language recognition using hidden 2011.Real-time sign language letter and word recognition from depth data.In 2011 IEEE intemational conference on computer vision workshops(ICCV Warkshops) Markov models.In SMC,Vol.1.162-167 voL.1.https://doiorg/10.1109/ICSMC 1997.625742 1EEE,383-390. [13]Dan Guo,Shuo Wang.Qi Tian,and Meng Wang.2019.Dense Temporal Convo- [38]Ashish Vaswani,Noam Shazeer,Niki Parmar,Jakob Uszkoreit,Llion Jones lution Network for Sign Language Translation..In ICAL 744-750. Aidan N Gomez,Lukasz Kaiser,and Illia Polosukhin.2017.Attention is all [14]Dan Guo,Wengang Zhou,Anyang Li,Houqiang Li,and Meng Wang.2019. you need.In NIPS.5998-6008. Hierarchical recurrent deep fusion using adaptive clip summarization for sign [39]Subhashini Venugopalan,Marcus Rohrbach,Jeffrey Donahue,Raymond Mooney. language translation.TIP 29 (2019),1575-1590 Trevor Darrell,and Kate Saenko.2015.Sequence to sequence-video to text.In ICCV.4534-4542. [15]Dan Guo,Wengang Zhou,Houqiang Li,and Meng Wang.2018.Hierarchical lstm for sign language translation.In AAAL Vol.32. [40]Limin Wang.Yuanjun Xiong.Zhe Wang.Yu Qiao,Dahua Lin,Xiaoou Tang,and [16]Dan Guo,Wengang Zhou,Meng Wang.and Houqiang Li.2016.Sign language Luc Van Gool 2016.Temporal segment networks:Towards good practices for recognition based on adaptive hmms with data augmentation.In 2016 IEEE deep action recognition.In European conference on computer vision.Springer. 20-36 International Conference on Image Processing (ICIP)IEEE2876-2880. [41]Sijie Yan,Yuanjun Xiong,and Dahua Lin.2018.Spatial Temporal Graph Convo- [17]Jie Huang.Wengang Zhou,Houqiang Li,and Weiping Li.2018.Attention-based lutional Networks for Skeleton-Based Action Recognition.In AAAL 3D-CNNs for large-vocabulary sign language recognition.IEEE Transactions on Circuits and Systems for Video Technology 29,9 (2018),2822-2832. [42]Siyuan Yang.Jun Liu,Shijian Lu,Meng Hwa Er,and Alex C Kot.2020.Col laborative learning of gesture recognition and 3D hand pose estimation with multi-order feature analysis.In European Conference on Com outer Vision.Springer 769-786. [19]Diederik P Kingma and Jimmy Ba.2014.Adam:A method for stochastic opti- mization.arXiv preprint arXiv:14126980(2014). [43]Zhaoyang Yang.Zhenmei Shi,Xiaoyong Shen,and Yu-Wing Tai.2019.SF-Net: [20]Oscar Koller,Cihan Camgoz,Hermann Ney,and Richard Bowden.2019.Weakly Structured Feature Network for Continuous Sign Language Recognition.arXiv supervised learning with multi-stream CNN-LSTM-HMMs to discover sequential preprint arXiv:1908.01341(2019). parallelism in sign language videos.TPAMI(2019). [44]Jihai Zhang.Wengang Zhou,and Houqiang Li.2014.A threshold-based hmm-dtw [21]Oscar Koller,Sepehr Zargaran,and Hermann Ney.2017.Re-sign:Re-aligned approach for continuous sign language recognition.In ICIMCS.237-240. oral end-to-end sequence modelling with deep recurrent CNN-HMMs.In CVPR 4297- [45]Hao Zhou,Wengang Zhou,Yun Zhou,and Houqiang Li.2020.Spatial-Tem 4305. Multi-Cue Network for Continuous Sign Language Recognition.in AAAL 13009 13016. [22]Dongxu Li,Chenchen Xu,Xin Yu,Kaihao Zhang.Benjamin Swift,Hanna Suomi- nen,and Hongdong Li.2020.TSPNet:Hierarchical Feature Learing via Temporal [46]Hao Zhou,Wengang Zhou,Yun Zhou,and Houqiang Li.2021.Spatial-temporal Semantic Pyr multi-cue network for sign language recognition and translation.IEEE Transac- ageranslationn Advances n Neural tions on Multimedia (2021). 4361
REFERENCES [1] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014). [2] Kshitij Bantupalli and Ying Xie. 2018. American sign language recognition using deep learning and computer vision. In 2018 IEEE International Conference on Big Data (Big Data). IEEE, 4896–4899. [3] Jan Bungeroth and Hermann Ney. 2004. Statistical sign language translation. In Workshop on representation and processing of sign languages, LREC, Vol. 4. Citeseer, 105–108. [4] Necati Cihan Camgoz, Simon Hadfield, Oscar Koller, and Richard Bowden. 2017. Subunets: End-to-end hand shape and continuous sign language recognition. In ICCV. IEEE, 3075–3084. [5] N. C. Camgoz, S. Hadfield, O. Koller, H. Ney, and R. Bowden. 2018. Neural Sign Language Translation. In CVPR. 7784–7793. https://doi.org/10.1109/CVPR.2018. 00812 [6] Necati Cihan Camgoz, Oscar Koller, Simon Hadfield, and Richard Bowden. 2020. Multi-channel Transformers for Multi-articulatory Sign Language Translation. arXiv preprint arXiv:2009.00299 (2020). [7] Necati Cihan Camgoz, Oscar Koller, Simon Hadfield, and Richard Bowden. 2020. Sign Language Transformers: Joint End-to-end Sign Language Recognition and Translation. In CVPR. 10023–10033. [8] Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. 2017. Realtime multiperson 2d pose estimation using part affinity fields. In CVPR. 7291–7299. [9] Xiujuan Chai, Guang Li, Yushun Lin, Zhihao Xu, Yili Tang, Xilin Chen, and Ming Zhou. 2013. Sign language recognition and translation with kinect. In IEEE Conf. on AFGR, Vol. 655. 4. [10] Runpeng Cui, Hu Liu, and Changshui Zhang. 2019. A deep neural framework for continuous sign language recognition by iterative training. IEEE Transactions on Multimedia 21, 7 (2019), 1880–1891. [11] Amanda Duarte, Shruti Palaskar, Lucas Ventura, Deepti Ghadiyaram, Kenneth DeHaan, Florian Metze, Jordi Torres, and Xavier Giro-i Nieto. 2021. How2Sign: A Large-scale Multimodal Dataset for Continuous American Sign Language. In Conference on Computer Vision and Pattern Recognition (CVPR). [12] K. Grobel and M. Assan. 1997. Isolated sign language recognition using hidden Markov models. In SMC, Vol. 1. 162–167 vol.1. https://doi.org/10.1109/ICSMC. 1997.625742 [13] Dan Guo, Shuo Wang, Qi Tian, and Meng Wang. 2019. Dense Temporal Convolution Network for Sign Language Translation.. In IJCAI. 744–750. [14] Dan Guo, Wengang Zhou, Anyang Li, Houqiang Li, and Meng Wang. 2019. Hierarchical recurrent deep fusion using adaptive clip summarization for sign language translation. TIP 29 (2019), 1575–1590. [15] Dan Guo, Wengang Zhou, Houqiang Li, and Meng Wang. 2018. Hierarchical lstm for sign language translation. In AAAI, Vol. 32. [16] Dan Guo, Wengang Zhou, Meng Wang, and Houqiang Li. 2016. Sign language recognition based on adaptive hmms with data augmentation. In 2016 IEEE International Conference on Image Processing (ICIP). IEEE, 2876–2880. [17] Jie Huang, Wengang Zhou, Houqiang Li, and Weiping Li. 2018. Attention-based 3D-CNNs for large-vocabulary sign language recognition. IEEE Transactions on Circuits and Systems for Video Technology 29, 9 (2018), 2822–2832. [18] Jie Huang, Wengang Zhou, Qilin Zhang, Houqiang Li, and Weiping Li. 2018. Video-based sign language recognition without temporal segmentation. In AAAI. [19] Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014). [20] Oscar Koller, Cihan Camgoz, Hermann Ney, and Richard Bowden. 2019. Weakly supervised learning with multi-stream CNN-LSTM-HMMs to discover sequential parallelism in sign language videos. TPAMI (2019). [21] Oscar Koller, Sepehr Zargaran, and Hermann Ney. 2017. Re-sign: Re-aligned end-to-end sequence modelling with deep recurrent CNN-HMMs. In CVPR. 4297– 4305. [22] Dongxu Li, Chenchen Xu, Xin Yu, Kaihao Zhang, Benjamin Swift, Hanna Suominen, and Hongdong Li. 2020. TSPNet: Hierarchical Feature Learning via Temporal Semantic Pyramid for Sign Language Translation. In Advances in Neural Information Processing Systems, Vol. 33. [23] Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out. 74–81. [24] Tao Liu, Wengang Zhou, and Houqiang Li. 2016. Sign language recognition with long short-term memory. In ICIP. IEEE, 2871–2875. [25] Alptekin Orbay and Lale Akarun. 2020. Neural sign language translation by learning tokenization. arXiv preprint arXiv:2002.00479 (2020). [26] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In ACL. 311–318. [27] Lionel Pigou, Mieke Van Herreweghe, and Joni Dambre. 2017. Gesture and sign language recognition with temporal residual networks. In Proceedings of the IEEE International Conference on Computer Vision Workshops. 3086–3093. [28] Junfu Pu, Wengang Zhou, Hezhen Hu, and Houqiang Li. 2020. Boosting Continuous Sign Language Recognition via Cross Modality Augmentation. In Proceedings of the 28th ACM International Conference on Multimedia. 1497–1505. [29] Junfu Pu, Wengang Zhou, and Houqiang Li. 2019. Iterative alignment network for continuous sign language recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4165–4174. [30] Zhaofan Qiu, Ting Yao, and Tao Mei. 2017. Learning spatio-temporal representation with pseudo-3d residual networks. In CVPR. 5533–5541. [31] Lei Shi, Yifan Zhang, Jian Cheng, and Hanqing Lu. 2019. Skeleton-based action recognition with directed graph neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7912–7921. [32] Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. arXiv preprint arXiv:1406.2199 (2014). [33] Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014). [34] Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. 2019. Deep high-resolution representation learning for human pose estimation. In CVPR. 5693–5703. [35] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In NIPS. 3104–3112. [36] Ao Tang, Ke Lu, Yufei Wang, Jie Huang, and Houqiang Li. 2015. A real-time hand posture recognition system using deep neural networks. ACM Transactions on Intelligent Systems and Technology (TIST) 6, 2 (2015), 1–23. [37] Dominique Uebersax, Juergen Gall, Michael Van den Bergh, and Luc Van Gool. 2011. Real-time sign language letter and word recognition from depth data. In 2011 IEEE international conference on computer vision workshops (ICCV Workshops). IEEE, 383–390. [38] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NIPS. 5998–6008. [39] Subhashini Venugopalan, Marcus Rohrbach, Jeffrey Donahue, Raymond Mooney, Trevor Darrell, and Kate Saenko. 2015. Sequence to sequence-video to text. In ICCV. 4534–4542. [40] Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. 2016. Temporal segment networks: Towards good practices for deep action recognition. In European conference on computer vision. Springer, 20–36. [41] Sijie Yan, Yuanjun Xiong, and Dahua Lin. 2018. Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition. In AAAI. [42] Siyuan Yang, Jun Liu, Shijian Lu, Meng Hwa Er, and Alex C Kot. 2020. Collaborative learning of gesture recognition and 3D hand pose estimation with multi-order feature analysis. In European Conference on Computer Vision. Springer, 769–786. [43] Zhaoyang Yang, Zhenmei Shi, Xiaoyong Shen, and Yu-Wing Tai. 2019. SF-Net: Structured Feature Network for Continuous Sign Language Recognition. arXiv preprint arXiv:1908.01341 (2019). [44] Jihai Zhang, Wengang Zhou, and Houqiang Li. 2014. A threshold-based hmm-dtw approach for continuous sign language recognition. In ICIMCS. 237–240. [45] Hao Zhou, Wengang Zhou, Yun Zhou, and Houqiang Li. 2020. Spatial-Temporal Multi-Cue Network for Continuous Sign Language Recognition.. In AAAI. 13009– 13016. [46] Hao Zhou, Wengang Zhou, Yun Zhou, and Houqiang Li. 2021. Spatial-temporal multi-cue network for sign language recognition and translation. IEEE Transactions on Multimedia (2021). Poster Session 5 MM ’21, October 20–24, 2021, Virtual Event, China 4361