Poster Session 5 MM'21.October 20-24,2021,Virtual Event,China 21 Clip sequence Figure 7:Visualization of the scale factor for each clip.Meaningful clips with signs have larger scale factors,while unmean- ingful clips have lower scale factors. to get the scale factor sf,i.e.,a value belonging to [0,1]. At the tth time step,the decoder predicts the word t.The decoder stops decoding until the occurrence of the symbol "[EOS]". sf Sigmoid(ST-GCN+(Vf,E)) (1) Here,Vf is feature vector set of node set V,ST-GCN+(-)denotes 3.5 Joint Loss Optimization the combination of 7 layers ST-GCN units,max pooling and a fully To optimize skeleton extraction and sign language translation jointly, connected layer,Sigmoid()denotes the sigmoid function. we design a joint loss L.which consists of skeleton extraction loss Fused feature scaling:For each clip,we get the frame-level Lske and SLT loss Lslt(y.). feature vector Fm,clip-level feature vector Fe and the scale factor The skeleton extraction loss Lke is calculated as follows,where sf.First,we fuse Fm and Fe with element addition to get the MHER3,MG ER3 denote the predicted heatmap and the ground- fused feature vector.Then,we scale the fused feature vector with truth heatmap.respectively.Here,K.h,w are the number of key- multiplication operation to get the scaled feature vector Ff,as points,the height and width of a heatmap.For each ground-truth shown below. heatmap,it contains a heating area,which is generated by applying Ff=(Em⊕Fc)⑧sf (2) a 2D Gaussian function with 1-pixel standard deviation [34]on the In Figure 7,we show the calculated scale factor sf for each clip. keypoint estimated by OpenPose [8]. where clips with meaningful signs(i.e.,clips in the green rectangle) have larger sf.It means that the designed ClipScl module can effi- K h w 1 ciently track the dynamic changes of human pose with skeletons Lske (MH (4) k i j and distinguish the importance of different clips,i.e.,ClipScl can highlight meaningful clips while weakening unmeaningful clips. The SLT loss Lslt(y,is the cross entropy loss function,is the predicted word sequence and y is the ground-truth word sequence 3.4 Spoken Language Generation (ie.,labels).The calculation of Lslt(y,)is shown below,where After getting the scaled feature vector of each clip,we propose T means the max number of time steps in decoder and V means the number of words in vocabulary.yt.i is an indicator,when the LangGen,which adopts the encoder-decoder framework [35]and attention mechanism [1]to generate the spoken language,as shown ground-truth word at the tth time step is the jth word in vocabulary. in Figure 2. yt.j=1.Otherwise,yt.j=0.pt.j means the probability that the predicted word it at the tth time step is the jth word in vocabulary. BiLSTM Encoder with MemoryCell:We use three-layered BiLSTMs and propose a novel MemoryCell to connect adjacent BiL- STM layers for encoding.Specifically,given a sequence of scaled Lsi(y.) yt.jlog(pt.i) (5) clips'feature vectors zn,we first get the hidden states H= (...)after the Ith BiLSTM layer.Then,we design Memo- Based on Lske and Lslt(y.),we can calculate the joint lossL ryCell to change the dimensions of hidden states and provide the as follows,where a is a hyper-parameter and used to balance the appropriate input for the following layer,as shown below. ratio of Lske and Lslt.We set a to 1 at the beginning of training ht1=tanh(w·hh+b) (3) and change it to 0.5 in the middle of training. where W and b are weight and bias of the fully connected layer,h L=aLske Lslt(y.) (6) is the final output hidden state of the Ith layer,is input as the 4 EXPERIMENT initial hidden state of the (I+1)th layer. LSTM Decoder:We use one LSTM layer as the decoder to de- 4.1 Datasets code the word step by step.Specifically,the decoder utilizes LSTM There are two public SLT datasets that are often used,one is the cells,a fully-connected layer and a softmax layer to output the CSL dataset [18]which contains 25K labeled videos with 100 chi- prediction probability pt.j,ie..the probability that the predicted nese sentences filmed by 50 signers,and the other one is a German word yt at the tth time step is the ith word in vocabulary.At the be- sign language dataset:the RWTH-PHOENIX-Weather 2014T [5] ginning of decoding.to is initialized with the start symbol"[SOS]". which contains 8257 weather forecast samples from 9 signers.The 4357… … Clip sequence Initial state Transition clip Meaningful clips with signs Transition clip End state Padded clips Figure 7: Visualization of the scale factor for each clip. Meaningful clips with signs have larger scale factors, while unmeaningful clips have lower scale factors. to get the scale factor sf , i.e., a value belonging to [0, 1]. sf = Siдmoid(ST − GCN+ (V f , E)) (1) Here, V f is feature vector set of node set V , ST −GCN+(·) denotes the combination of 7 layers ST-GCN units, max pooling and a fully connected layer, Siдmoid(·) denotes the sigmoid function. Fused feature scaling: For each clip, we get the frame-level feature vector Fm, clip-level feature vector Fc and the scale factor sf . First, we fuse Fm and Fc with element addition ⊕ to get the fused feature vector. Then, we scale the fused feature vector with multiplication operation ⊗ to get the scaled feature vector Ff , as shown below. Ff = (Fm ⊕ Fc ) ⊗ sf (2) In Figure 7, we show the calculated scale factor sf for each clip, where clips with meaningful signs (i.e., clips in the green rectangle) have larger sf . It means that the designed ClipScl module can efficiently track the dynamic changes of human pose with skeletons and distinguish the importance of different clips, i.e., ClipScl can highlight meaningful clips while weakening unmeaningful clips. 3.4 Spoken Language Generation After getting the scaled feature vector of each clip, we propose LangGen, which adopts the encoder-decoder framework [35] and attention mechanism [1] to generate the spoken language, as shown in Figure 2. BiLSTM Encoder with MemoryCell: We use three-layered BiLSTMs and propose a novel MemoryCell to connect adjacent BiLSTM layers for encoding. Specifically, given a sequence of scaled clips’ feature vectors z1:n, we first get the hidden states H l = (h l 1 , h l 2 , . . . , h l n ) after the lth BiLSTM layer. Then, we design MemoryCell to change the dimensions of hidden states and provide the appropriate input for the following layer, as shown below. h l+1 0 = tanh(W · h l n + b) (3) where W and b are weight and bias of the fully connected layer, h l n is the final output hidden state of the lth layer, h l+1 0 is input as the initial hidden state of the (l + 1)th layer. LSTM Decoder: We use one LSTM layer as the decoder to decode the word step by step. Specifically, the decoder utilizes LSTM cells, a fully-connected layer and a softmax layer to output the prediction probability pt,j , i.e., the probability that the predicted word yˆt at the tth time step is the jth word in vocabulary. At the beginning of decoding, yˆ0 is initialized with the start symbol “[SOS]”. At the tth time step, the decoder predicts the word yˆt . The decoder stops decoding until the occurrence of the symbol “[EOS]”. 3.5 Joint Loss Optimization To optimize skeleton extraction and sign language translation jointly, we design a joint loss L, which consists of skeleton extraction loss Lske and SLT loss Lsl t (y,yˆ). The skeleton extraction loss Lske is calculated as follows, where MH ∈ R 3 , MG ∈ R 3 denote the predicted heatmap and the groundtruth heatmap, respectively. Here, K, h, w are the number of keypoints, the height and width of a heatmap. For each ground-truth heatmap, it contains a heating area, which is generated by applying a 2D Gaussian function with 1-pixel standard deviation [34] on the keypoint estimated by OpenPose [8]. Lske = 1 K Õ K k Õ h i Õw j (M H k,i,j − M G k,i,j ) 2 (4) The SLT loss Lsl t (y,yˆ) is the cross entropy loss function, yˆ is the predicted word sequence and y is the ground-truth word sequence (i.e., labels). The calculation of Lsl t (y,yˆ) is shown below, where T means the max number of time steps in decoder and V means the number of words in vocabulary. yt,j is an indicator, when the ground-truth word at the tth time step is the jth word in vocabulary, yt,j = 1. Otherwise, yt,j = 0. pt,j means the probability that the predicted word yˆt at the tth time step is the jth word in vocabulary. Lsl t (y,yˆ) = − Õ T t=1 Õ V j yt,j loд(pt,j) (5) Based on Lske and Lsl t (y,yˆ), we can calculate the joint loss L as follows, where α is a hyper-parameter and used to balance the ratio of Lske and Lsl t . We set α to 1 at the beginning of training and change it to 0.5 in the middle of training. L = αLske + Lsl t (y,yˆ) (6) 4 EXPERIMENT 4.1 Datasets There are two public SLT datasets that are often used, one is the CSL dataset [18] which contains 25K labeled videos with 100 chinese sentences filmed by 50 signers, and the other one is a German sign language dataset: the RWTH-PHOENIX-Weather 2014T [5] which contains 8257 weather forecast samples from 9 signers. The Poster Session 5 MM ’21, October 20–24, 2021, Virtual Event, China 4357