正在加载图片...
Poster Session 5 MM'21.October 20-24,2021,Virtual Event,China pose in sign languages has not been fully studied.In fact,the skele- 2 RELATED WORK ton can be used to distinguish signs with different human poses The existing research work on sign languages can be mainly cate- (i.e.,different relative positions of hands,arms,etc),especially for gorized into SLR and SLT,where SLR can be further classified into the signs which use the same hand gesture in different positions to isolated SLR and continuous SLR.In this section,we review the represent different meanings,as shown in Figure 1.Therefore,it is related work on isolated SLR,continuous SLR,and SLT. meaningful to introduce skeleton information into SLT. Isolated Sign Language Recognition (ISLR):The isolated To utilize skeletons for SLT,there emerged some work recently SLR aims at recognizing one sign as word or expression [2,12,24] [6,14],which advanced the research of skeleton assisted SLT.How- which is similar to gesture recognition [27,42]and action recog- ever,the existing research often had the following problems.First, nition [32,40].The early methods tended to select features from obtaining the skeletons often required an external device [14]or videos manually,and introduced Hidden Markov Model(HMM) extra offline preprocessing [6],which hindered the end-to-end SLT [12,16]to analyze the gesture sequence of a sign(i.e.,human ac- from videos.Second,the videos and skeletons were used as two tion)for recognition.However,the manually-selected features may independent data sources for feature extraction,i.e.,not fused at limit the recognition performance.Therefore,in recent years,the the initial stage of feature extraction,thus the video-based feature deep learning-based approaches were introduced for isolated SLR. extraction may be not efficiently guided/enhanced with the skele- The approaches utilized neural networks to automatically extract ton information.Third,each clip(i.e.a short segmented video)was features from videos [17]),Kinect's sensor data [36],or moving tra- usually treated equally,while neglecting the different importance jectories of skeleton joints [24]for isolated SLR,and often achieved of meaningful (e.g.sign-related)clips and unmeaninngful (e.g.end a better performance. state)clips.Among the problems,the third one exists not only in Continuous Sign Language Recognition(CSLR):Continu- skeleton-assisted SLT,but also in much SLT work ous SLR aims at recognizing a sequence of signs to the correspond- To address the above three problems,we propose a Skeleton- ing word sequence [21,28,45],thus continuous SLR is more chal- Aware neural Network(SANet).Firstly,to achieve end-to-end SLT, lenging than isolated SLR.To realize CSLR,the traditional methods SANet designs a self-contained branch for skeleton extraction.Sec- like DTW-HMM [44]and CNN-HMM [21]introduced temporal ondly,to guide the video-based feature extraction with skeletons. segmentation and Hidden Markov Model(HMM)to transform con- SANet concatenates the skeleton channel and RGB channels for tinuous SLR to isolated SLR.Considering the possible errors and each frame,thus the features extracted from images/videos will be annotation burden in temporal segmentation,the recent deep learn- affected by skeleton.Thirdly,to distinguish the importance of clips. ing based methods [18]applied sequence to sequence learning for SANet constructs a skeleton-based Graph Convolutional Network continuous SLR.They learned the correspondence between two (GCN)for feature scaling,ie giving importance weight for each sequences from weakly annotated data in an end-to-end manner clip.Specifically,SANet consists of four components,ie.,FrmSke, However,many approaches tended to adopt Connectionist Tem- ClipRep,ClipScl,LangGen.At first,FrmSke is used to extract skele- poral Classification(CTC)loss [13,20,43,45]which requires that ton from each frame and frame-level features for a clip by convo- source and target sequences have the same order.In fact,the sign lutions and deconvolutions.Then,ClipRep is used to enhance clip sequence in sign language and the word sequence in spoken lan- representation by adding skeleton channel.After that,ClipScl is guage can be different [5].thus the approaches for continuous SLR used to scale the clip representation by a skeleton-based Graph are not suitable for SLT. Convolutional Network(GCN).Finally,with the scaled features of Sign Language Translation(SLT):SLT aims to translate sign clips,LangGen is used to generate spoken language with sequence languages into spoken languages.Traditional methods [3,9]usually to sequence learning.In addition,we design a joint optimization decomposed SLT into two stages,i.e.,continuous SLR and text-to- strategy for model training and achieve end-to-end SLT. text translation.The two-stage methods had both gloss annotations We make the following contributions in this paper. and sentence annotations,thus can be optimized in two stages for a better performance [5].However,annotating glosses requires spe- We propose a Skeleton-Aware neural Network(SANet)for cialists and is a labor-intensive task.Recently,due to the advance- end-to-end SLT,where a self-contained branch is designed ment of public datasets in sentence-level annotations [5,11,18 for skeleton extraction and a joint training strategy is de- and deep learning technology,there emerged a few end-to-end signed to optimize skeleton extraction and sign language SLT approaches.Camgoz et al [5]introduced the encoder-decoder translation jointly. framework to realize end-to-end SLT.Guo et al.[14,15]proposed We concatenate the extracted skeleton channel and RGB the hierarchical-LSTM model for end-to-end SLT.Camgoz et al.uti- channels in source data level,thus can highlight human lized the transformer networks [38]to jointly solve SLR and SLT[7] pose-related features and enhance the clip representation. Li et al.developed a temporal semantic pyramid encoder and a We construct skeleton-based graphs and use graph convolu- transformer decoder for SLT [22].These neural-based approaches tional network to scale the clip representation,ie,weighting often adopted encoder and decoder for SLT. the importance of each clip,thus can highlight meaningful To represent the sign languages,the existing neural-based SLT clips while weakening unmeaningful clips. methods mainly focused on extracting full-frame [5,15,22]or local- We conduct extensive experiments on two large-scale public area features [6,46]from the video.There was only a little work SLT datasets.The experimental results demonstrate that our paying attention on skeleton information(i.e.,human pose)for SANet outperforms the state-of-the-art methods. SLT.Specifically,HRF [14]collected skeletons with a depth camera, 4354pose in sign languages has not been fully studied. In fact, the skele￾ton can be used to distinguish signs with different human poses (i.e., different relative positions of hands, arms, etc), especially for the signs which use the same hand gesture in different positions to represent different meanings, as shown in Figure 1. Therefore, it is meaningful to introduce skeleton information into SLT. To utilize skeletons for SLT, there emerged some work recently [6, 14], which advanced the research of skeleton assisted SLT. How￾ever, the existing research often had the following problems. First, obtaining the skeletons often required an external device [14] or extra offline preprocessing [6], which hindered the end-to-end SLT from videos. Second, the videos and skeletons were used as two independent data sources for feature extraction, i.e., not fused at the initial stage of feature extraction, thus the video-based feature extraction may be not efficiently guided/enhanced with the skele￾ton information. Third, each clip (i.e., a short segmented video) was usually treated equally, while neglecting the different importance of meaningful (e.g, sign-related) clips and unmeaninngful (e.g, end state) clips. Among the problems, the third one exists not only in skeleton-assisted SLT, but also in much SLT work. To address the above three problems, we propose a Skeleton￾Aware neural Network (SANet). Firstly, to achieve end-to-end SLT, SANet designs a self-contained branch for skeleton extraction. Sec￾ondly, to guide the video-based feature extraction with skeletons, SANet concatenates the skeleton channel and RGB channels for each frame, thus the features extracted from images/videos will be affected by skeleton. Thirdly, to distinguish the importance of clips, SANet constructs a skeleton-based Graph Convolutional Network (GCN) for feature scaling, i.e., giving importance weight for each clip. Specifically, SANet consists of four components, i.e., FrmSke, ClipRep, ClipScl, LangGen. At first, FrmSke is used to extract skele￾ton from each frame and frame-level features for a clip by convo￾lutions and deconvolutions. Then, ClipRep is used to enhance clip representation by adding skeleton channel. After that, ClipScl is used to scale the clip representation by a skeleton-based Graph Convolutional Network (GCN). Finally, with the scaled features of clips, LangGen is used to generate spoken language with sequence to sequence learning. In addition, we design a joint optimization strategy for model training and achieve end-to-end SLT. We make the following contributions in this paper. • We propose a Skeleton-Aware neural Network (SANet) for end-to-end SLT, where a self-contained branch is designed for skeleton extraction and a joint training strategy is de￾signed to optimize skeleton extraction and sign language translation jointly. • We concatenate the extracted skeleton channel and RGB channels in source data level, thus can highlight human pose-related features and enhance the clip representation. • We construct skeleton-based graphs and use graph convolu￾tional network to scale the clip representation, i.e., weighting the importance of each clip, thus can highlight meaningful clips while weakening unmeaningful clips. • We conduct extensive experiments on two large-scale public SLT datasets. The experimental results demonstrate that our SANet outperforms the state-of-the-art methods. 2 RELATED WORK The existing research work on sign languages can be mainly cate￾gorized into SLR and SLT, where SLR can be further classified into isolated SLR and continuous SLR. In this section, we review the related work on isolated SLR, continuous SLR, and SLT. Isolated Sign Language Recognition (ISLR): The isolated SLR aims at recognizing one sign as word or expression [2, 12, 24] which is similar to gesture recognition [27, 42] and action recog￾nition [32, 40]. The early methods tended to select features from videos manually, and introduced Hidden Markov Model (HMM) [12, 16] to analyze the gesture sequence of a sign (i.e., human ac￾tion) for recognition. However, the manually-selected features may limit the recognition performance. Therefore, in recent years, the deep learning-based approaches were introduced for isolated SLR. The approaches utilized neural networks to automatically extract features from videos [17]), Kinect’s sensor data [36], or moving tra￾jectories of skeleton joints [24] for isolated SLR, and often achieved a better performance. Continuous Sign Language Recognition (CSLR): Continu￾ous SLR aims at recognizing a sequence of signs to the correspond￾ing word sequence [21, 28, 45], thus continuous SLR is more chal￾lenging than isolated SLR. To realize CSLR, the traditional methods like DTW-HMM [44] and CNN-HMM [21] introduced temporal segmentation and Hidden Markov Model (HMM) to transform con￾tinuous SLR to isolated SLR. Considering the possible errors and annotation burden in temporal segmentation, the recent deep learn￾ing based methods [18] applied sequence to sequence learning for continuous SLR. They learned the correspondence between two sequences from weakly annotated data in an end-to-end manner. However, many approaches tended to adopt Connectionist Tem￾poral Classification (CTC) loss [13, 20, 43, 45] which requires that source and target sequences have the same order. In fact, the sign sequence in sign language and the word sequence in spoken lan￾guage can be different [5], thus the approaches for continuous SLR are not suitable for SLT. Sign Language Translation (SLT): SLT aims to translate sign languages into spoken languages. Traditional methods [3, 9] usually decomposed SLT into two stages, i.e., continuous SLR and text-to￾text translation. The two-stage methods had both gloss annotations and sentence annotations, thus can be optimized in two stages for a better performance [5]. However, annotating glosses requires spe￾cialists and is a labor-intensive task. Recently, due to the advance￾ment of public datasets in sentence-level annotations [5, 11, 18] and deep learning technology, there emerged a few end-to-end SLT approaches. Camgoz et al. [5] introduced the encoder-decoder framework to realize end-to-end SLT. Guo et al. [14, 15] proposed the hierarchical-LSTM model for end-to-end SLT. Camgoz et al. uti￾lized the transformer networks [38] to jointly solve SLR and SLT[7]. Li et al. developed a temporal semantic pyramid encoder and a transformer decoder for SLT [22]. These neural-based approaches often adopted encoder and decoder for SLT. To represent the sign languages, the existing neural-based SLT methods mainly focused on extracting full-frame [5, 15, 22] or local￾area features [6, 46] from the video. There was only a little work paying attention on skeleton information (i.e., human pose) for SLT. Specifically, HRF [14] collected skeletons with a depth camera, Poster Session 5 MM ’21, October 20–24, 2021, Virtual Event, China 4354
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有