Dynamic Speed Warping:Similarity-Based One-shot Learning for Device-free Gesture Signals Xun Wang,Ke Sun,Ting Zhao,Wei Wang,and Qing Gu State Key Laboratory for Novel Software Technology,Nanjing University [xunwang,kesun,tingzhao)@smail.nju.edu.cn,kesun@eng.ucsd.edu,[ww.guq}@nju.edu.cn Abstract-In this paper,we propose a Dynamic Speed Warping that learns information about object categories from one,or (DSW)algorithm to enable one-shot learning for device-free only a few,training samples.Thus,it reduces both the data gesture signals performed by different users.The design of DSW collection and training cost.The design of the DSW algorithm is based on the observation that the gesture type is determined by the trajectory of hand components rather than the movement is based on the critical observation that the gesture type is speed.By dynamically scaling the speed distribution and tracking determined by the trajectory of hand components,e.g.,fingers the movement distance along the trajectory,DSW can effectively and the palm,rather than the movement speed.We show that match gesture signals from different domains that have a ten-fold the similarity in trajectory leads to the similarity in the total difference in speeds.Our experimental results show that DSW movement distances and the scaled speed distributions.The can achieve a recognition accuracy of 97%for gestures performed by unknown users,while only use one training sample of each total movement distances are similar because considering the gesture type from four training users. specific trajectory for a gesture,e.g.,click,the starting and the ending postures of the hand remain the same,no matter I.INTRODUCTION how fast the user performs the gesture.The scaled speed Device-free gesture recognition systems use the Radio Fre- distributions are similar because when the user changes the quency (RF)[1]-[9]or sound signals [10]-[15]to detect movement speed,the speeds of different parts of the hand, and recognize human movements.By analyzing the signal such as the fingers and the palm,changes proportionally reflection of the hand,device-free sensing allows users to Therefore,the speed distribution of different components of a interact with their devices freely without wearing any sensor. fast gesture movement can be matched to the distribution of a Such natural and unconstrained interaction paradigm would slow gesture movement of the same type,when we scale down become a vital component for the next generation Human- speeds of all components by the same factor.Based on this Computer Interaction (HCD)solutions. observation,we design a dynamical programming algorithm, One of the key challenges for device-free sensing is to which is inspired by Dynamic Time Warping(DTW)[20],to robustly recognize gesture signals for different users and in calculate the similarity of gesture signals in terms of the total different environments.Traditional machine learning methods movement distance and the scaled speed distribution. use large datasets and intensive training process to extract domain-independent features from gesture signals.For ex- The DSW similarity measure leads to new ways to explore ample,one can collect gesture samples in different domains the gesture recognition problem.First,the robust gesture and use Generative Adversarial Networks (GANs)to reduce matching algorithm can be combined with kNN to serve as the impact of domain-specific features [16],[17].However, a similarity-based one-shot learning scheme that only requires due to the insufficient understanding of the machine-generated a small number of training samples.As the DSW algorithm models,the performance of these domain-independent models can adapt to different gesture speeds,it dramatically reduces under a new environment cannot be guaranteed.Fine-tuning the data collection/labeling cost and can incrementally tune the model in a new domain may require a large number of the system without retraining.Second,the DSW similarity new samples to be collected and labeled by the end-user in measure can serve as the basis for unsupervised or semi- the new environment.Even if virtual samples can be produced supervised learning systems.The DSW algorithm can auto- via geometric models using a small number of gestures in matically derive the type of gestures of unlabeled samples by the target domain [18],the retraining process still incurs clustering them using the speed-independent measure. formidable costs for mobile systems. We perform extensive evaluations of DSW using ultrasound- In this paper,rather than extracting domain-independent fea- based gesture signals.Our experimental results show that ture,we use Dynamic Speed Warping(DSW)to derive a sim- DSW can achieve a recognition accuracy of 97%for gestures ilarity measure between device-free gesture signals.As users performed by unknown users,while using only one training may perform the gesture with different speeds and the Doppler sample of each gesture type from four training users.DSW shift largely depends on the environment [19],speed variations also outperforms the DTW algorithm in all three external lead to severe robustness issues in the widely-used speed-based indicators for clustering performance.Therefore,DSW sim- gesture features [1]-[3].By removing the speed variation,the ilarity can serve as a powerful tool for both supervised and DSW similarity enables domain-independent one-shot learning unsupervised learning tasks
Dynamic Speed Warping: Similarity-Based One-shot Learning for Device-free Gesture Signals Xun Wang, Ke Sun, Ting Zhao, Wei Wang, and Qing Gu State Key Laboratory for Novel Software Technology, Nanjing University {xunwang,kesun,tingzhao}@smail.nju.edu.cn, kesun@eng.ucsd.edu, {ww,guq}@nju.edu.cn Abstract—In this paper, we propose a Dynamic Speed Warping (DSW) algorithm to enable one-shot learning for device-free gesture signals performed by different users. The design of DSW is based on the observation that the gesture type is determined by the trajectory of hand components rather than the movement speed. By dynamically scaling the speed distribution and tracking the movement distance along the trajectory, DSW can effectively match gesture signals from different domains that have a ten-fold difference in speeds. Our experimental results show that DSW can achieve a recognition accuracy of 97% for gestures performed by unknown users, while only use one training sample of each gesture type from four training users. I. INTRODUCTION Device-free gesture recognition systems use the Radio Frequency (RF) [1]–[9] or sound signals [10]–[15] to detect and recognize human movements. By analyzing the signal reflection of the hand, device-free sensing allows users to interact with their devices freely without wearing any sensor. Such natural and unconstrained interaction paradigm would become a vital component for the next generation HumanComputer Interaction (HCI) solutions. One of the key challenges for device-free sensing is to robustly recognize gesture signals for different users and in different environments. Traditional machine learning methods use large datasets and intensive training process to extract domain-independent features from gesture signals. For example, one can collect gesture samples in different domains and use Generative Adversarial Networks (GANs) to reduce the impact of domain-specific features [16], [17]. However, due to the insufficient understanding of the machine-generated models, the performance of these domain-independent models under a new environment cannot be guaranteed. Fine-tuning the model in a new domain may require a large number of new samples to be collected and labeled by the end-user in the new environment. Even if virtual samples can be produced via geometric models using a small number of gestures in the target domain [18], the retraining process still incurs formidable costs for mobile systems. In this paper, rather than extracting domain-independent feature, we use Dynamic Speed Warping (DSW) to derive a similarity measure between device-free gesture signals. As users may perform the gesture with different speeds and the Doppler shift largely depends on the environment [19], speed variations lead to severe robustness issues in the widely-used speed-based gesture features [1]–[3]. By removing the speed variation, the DSW similarity enables domain-independent one-shot learning that learns information about object categories from one, or only a few, training samples. Thus, it reduces both the data collection and training cost. The design of the DSW algorithm is based on the critical observation that the gesture type is determined by the trajectory of hand components, e.g., fingers and the palm, rather than the movement speed. We show that the similarity in trajectory leads to the similarity in the total movement distances and the scaled speed distributions. The total movement distances are similar because considering the specific trajectory for a gesture, e.g., click, the starting and the ending postures of the hand remain the same, no matter how fast the user performs the gesture. The scaled speed distributions are similar because when the user changes the movement speed, the speeds of different parts of the hand, such as the fingers and the palm, changes proportionally. Therefore, the speed distribution of different components of a fast gesture movement can be matched to the distribution of a slow gesture movement of the same type, when we scale down speeds of all components by the same factor. Based on this observation, we design a dynamical programming algorithm, which is inspired by Dynamic Time Warping (DTW) [20], to calculate the similarity of gesture signals in terms of the total movement distance and the scaled speed distribution. The DSW similarity measure leads to new ways to explore the gesture recognition problem. First, the robust gesture matching algorithm can be combined with kNN to serve as a similarity-based one-shot learning scheme that only requires a small number of training samples. As the DSW algorithm can adapt to different gesture speeds, it dramatically reduces the data collection/labeling cost and can incrementally tune the system without retraining. Second, the DSW similarity measure can serve as the basis for unsupervised or semisupervised learning systems. The DSW algorithm can automatically derive the type of gestures of unlabeled samples by clustering them using the speed-independent measure. We perform extensive evaluations of DSW using ultrasoundbased gesture signals. Our experimental results show that DSW can achieve a recognition accuracy of 97% for gestures performed by unknown users, while using only one training sample of each gesture type from four training users. DSW also outperforms the DTW algorithm in all three external indicators for clustering performance. Therefore, DSW similarity can serve as a powerful tool for both supervised and unsupervised learning tasks
The main contributions of our work are as follows: III.DEVICE-FREE GESTURE MATCHING We propose a new similarity measure that can adapt to the In this section,we first summarize the state-of-the-art ges- speed variations in gesture signals of different domains. ture matching methods and their limitations.We then describe We formally prove the properties of the speed adaptive signal our insight on the characteristics of device-free gesture signals. matching scheme and show that the result of DSW is a valid Finally,we demonstrate the benefits of using such speed- similarity measure. adaptive characteristics for gesture similarity calculation. Using real-world ultrasound gesture signals,we show that A.Gesture Matching Methods the DSW algorithm can serve as a solution for both one-shot Device-free gesture recognition systems collect radio/sound learning in supervised gesture recognition and unsupervised signals reflected by the hand to perform gesture recognition. gesture clustering tasks. We call these radio/sound signals gesture signals.The most- widely used gesture signals are complex-valued baseband signals that have Doppler frequencies corresponding to the II.RELATED WORKS hand movement speeds [1].[3],[25].For instance,Figure 1(a) Existing works that are closely related to our approach can and l(e)show the ultrasonic baseband signals of two samples be categorized into three areas:domain-independent feature of the writing "W"gesture,where the user writes the letter "W"in-the-air by moving hand back-and-forth twice.The I- extraction,cross-domain adaptation,and DTW-based schemes. component and the Q-component are the real and imaginary Domain-independent feature extraction.Early device-free parts of the gesture signal.Meanwhile,Figure 1(b)and 1(f) gesture recognition systems use statistical values (mean,vari- show the corresponding spectrogram calculated through Short- ance,etc.)of the signals [21]-[23]or Doppler speeds [24],[25] time Fourier transform (STFT).We can observe that the as the gesture features.However,it's well known that these gesture signal has a negative Doppler frequency when the hand features are dependent on the user,the location of devices. is moving away and has a positive frequency when the hand is and multi-path conditions introduced by the environment. moving back.Thus,the gesture signals show specific patterns There are two major approaches to extract domain-independent that can be matched with gesture movements and actions features for device-free gesture signals.The first approach is to At early stages of device-free gesture recognition,research- use an adversarial network as domain discriminator to help the ers collect statistical parameters from gesture signals as fea- feature-extracting network in generating domain-independent tures,such as mean,variance,and standard deviation.Such features [16.However,the training process requires huge statistical features cannot adapt to speed variations and often datasets from multiple domains,which leads to high data lead to models that are not robust enough for real-world collection costs.The second approach is to use geometric applications.Some other unsuccessful efforts indicate that models to recombine signals measured through multiple links linear transformation is inherently insufficient for handling the into a domain-independent body-coordinate velocity profile complicated nonlinearity fluctuation in patterns [25]. [19].However,this domain-adaption method uses multiple The DTW algorithm is a pattern matching algorithm with devices and assumes that accurate user locations are known. a nonlinear time-normalization effect.While directly applying Cross-domain adaptation.Instead of using domain- DTW on the gesture waveforms have been applied in device- independent features,we can also transfer a domain-specific free keystroke matching,it only works when gestures start gesture recognition model into the target domain.One ap- from a fixed point and is not robust enough for daily gesture proach is to use transfer learning schemes to retrain the model recognition tasks [32].As an example,consider the gesture using a small number of samples in the target domain [26]- samples in Figure 1(a)and 1(e),which have durations of [28].Another way is to use neural networks or geometric 1.5 and 3 seconds,respectively.Besides the differences in models to transfer the samples in the source domain to the frequencies caused by different movement speeds,the two target domain in order to boost the number of training samples waveforms also have different initial phases and different low- in the target domain [18],[29].Compared to these approaches frequency components.The initial phase and low-frequency that need samples in the target domain for bootstrapping,the components of the gesture signal depend on the starting DSW scheme can evaluate the similarity of gestures from position and small body movements [3],which are noise for unknown target domains gesture recognition.However,these noisy factors dominate the Dynamic Time Warping schemes.The DTW algorithm similarity calculated by DTW so that DTW could not find a is originally designed for matching speech signals that have suitable matching for these two samples in the time domain. different lengths in time [20].As human activities also have Fortunately,STFT and other time-frequency analysis allow variable durations,DTW has been adopted for various types of us to focus on the frequency domain without being distracted activity recognition systems [30],[31].In device-free gesture by phases and low-frequency components.The spectrograms recognition,DTW has been applied for matching either the raw in Figure 1(b)and Figure 1(f)clearly show how the dis- gesture signals [32].[33]or the extracted features [21].[34].tribution of Doppler frequency changes over time,which However,these DTW applications only consider the scaling is directly connected to changes in the movement speed. in time rather than the scaling of speed distribution and the However,matching spectrograms is challenging since gesture consistency of movement distance. speed changes introduce both frequency shifts and gesture
The main contributions of our work are as follows: • We propose a new similarity measure that can adapt to the speed variations in gesture signals of different domains. • We formally prove the properties of the speed adaptive signal matching scheme and show that the result of DSW is a valid similarity measure. • Using real-world ultrasound gesture signals, we show that the DSW algorithm can serve as a solution for both one-shot learning in supervised gesture recognition and unsupervised gesture clustering tasks. II. RELATED WORKS Existing works that are closely related to our approach can be categorized into three areas: domain-independent feature extraction, cross-domain adaptation, and DTW-based schemes. Domain-independent feature extraction. Early device-free gesture recognition systems use statistical values (mean, variance, etc.) of the signals [21]–[23] or Doppler speeds [24], [25] as the gesture features. However, it’s well known that these features are dependent on the user, the location of devices, and multi-path conditions introduced by the environment. There are two major approaches to extract domain-independent features for device-free gesture signals. The first approach is to use an adversarial network as domain discriminator to help the feature-extracting network in generating domain-independent features [16]. However, the training process requires huge datasets from multiple domains, which leads to high data collection costs. The second approach is to use geometric models to recombine signals measured through multiple links into a domain-independent body-coordinate velocity profile [19]. However, this domain-adaption method uses multiple devices and assumes that accurate user locations are known. Cross-domain adaptation. Instead of using domainindependent features, we can also transfer a domain-specific gesture recognition model into the target domain. One approach is to use transfer learning schemes to retrain the model using a small number of samples in the target domain [26]– [28]. Another way is to use neural networks or geometric models to transfer the samples in the source domain to the target domain in order to boost the number of training samples in the target domain [18], [29]. Compared to these approaches that need samples in the target domain for bootstrapping, the DSW scheme can evaluate the similarity of gestures from unknown target domains. Dynamic Time Warping schemes. The DTW algorithm is originally designed for matching speech signals that have different lengths in time [20]. As human activities also have variable durations, DTW has been adopted for various types of activity recognition systems [30], [31]. In device-free gesture recognition, DTW has been applied for matching either the raw gesture signals [32], [33] or the extracted features [21], [34]. However, these DTW applications only consider the scaling in time rather than the scaling of speed distribution and the consistency of movement distance. III. DEVICE-FREE GESTURE MATCHING In this section, we first summarize the state-of-the-art gesture matching methods and their limitations. We then describe our insight on the characteristics of device-free gesture signals. Finally, we demonstrate the benefits of using such speedadaptive characteristics for gesture similarity calculation. A. Gesture Matching Methods Device-free gesture recognition systems collect radio/sound signals reflected by the hand to perform gesture recognition. We call these radio/sound signals gesture signals. The mostwidely used gesture signals are complex-valued baseband signals that have Doppler frequencies corresponding to the hand movement speeds [1], [3], [25]. For instance, Figure 1(a) and 1(e) show the ultrasonic baseband signals of two samples of the writing “W” gesture, where the user writes the letter “W” in-the-air by moving hand back-and-forth twice. The Icomponent and the Q-component are the real and imaginary parts of the gesture signal. Meanwhile, Figure 1(b) and 1(f) show the corresponding spectrogram calculated through Shorttime Fourier transform (STFT). We can observe that the gesture signal has a negative Doppler frequency when the hand is moving away and has a positive frequency when the hand is moving back. Thus, the gesture signals show specific patterns that can be matched with gesture movements and actions. At early stages of device-free gesture recognition, researchers collect statistical parameters from gesture signals as features, such as mean, variance, and standard deviation. Such statistical features cannot adapt to speed variations and often lead to models that are not robust enough for real-world applications. Some other unsuccessful efforts indicate that linear transformation is inherently insufficient for handling the complicated nonlinearity fluctuation in patterns [25]. The DTW algorithm is a pattern matching algorithm with a nonlinear time-normalization effect. While directly applying DTW on the gesture waveforms have been applied in devicefree keystroke matching, it only works when gestures start from a fixed point and is not robust enough for daily gesture recognition tasks [32]. As an example, consider the gesture samples in Figure 1(a) and 1(e), which have durations of 1.5 and 3 seconds, respectively. Besides the differences in frequencies caused by different movement speeds, the two waveforms also have different initial phases and different lowfrequency components. The initial phase and low-frequency components of the gesture signal depend on the starting position and small body movements [3], which are noise for gesture recognition. However, these noisy factors dominate the similarity calculated by DTW so that DTW could not find a suitable matching for these two samples in the time domain. Fortunately, STFT and other time-frequency analysis allow us to focus on the frequency domain without being distracted by phases and low-frequency components. The spectrograms in Figure 1(b) and Figure 1(f) clearly show how the distribution of Doppler frequency changes over time, which is directly connected to changes in the movement speed. However, matching spectrograms is challenging since gesture speed changes introduce both frequency shifts and gesture
400 200 Time (s) Time(s) Index after Scaling (a)Sample A (writing W.1.5 seconds). (b)STFT result for sample A. (c)DTW over STFT for sample A. (d)DSW over STFT for sample A. 50 60 2 20 40 0 10 Time (s) Time(e倒 Index after Scaling Index after Scaling (e)Sample B(writing W,3 seconds). (f)STFT result for sample B. (g)DTW over STFT for sample B.(h)DSW over STFT for sample B. Figure 1.Two samples of writing W and corresponding matching results with different methods. duration changes.For example,the slower gesture sample B total movement distance along the trajectory.In this way,we in Figure 1(f)has a smaller frequency variation that lasts for a can stretch and match the movement stages of gestures with long time in the spectrogram.It is challenging to find the right different speeds,as shown in Figure 1(d)and Figure 1(h). scaling factor in both time and frequency domain to match the Definitions: spectrogram of sample A with sample B.Figure 1(c)and 1(g) Lets(t)= >pepsp(t)be the baseband gesture signal, show the stretched results when we directly apply DTW on where P is the set of signal propagation paths and sp(t)is the STFT spectrogram.We observe that the DTW algorithm the complex-valued signal along path p.We define the function does not scale and match the right stages for sample A and B. D(f,t),which is the square-root of the power spectral density Furthermore,the speed distributions in each stage of the two of the gesture signal at time t: results are still quite different. et+△t Neural networks,such as CNNs,can be used in classifying D(f,)= s(T)e-jwr (1) spectrograms with different scaling factors when trained by a large number of samples [7],[35].However,the CNN does not where At is the length for a time frame.We define the consider underlying physical models of gesture spectrogram so trajectory that the hand moves through within At as a micro- that the training samples should exhaustively cover all speed unit.Consider two gesture signal samples sA(t)and sB(t)of and duration combinations of the same gesture.This incurs the same gesture type.We are looking for a mapping function formidable costs in the data collection and training process. 6:R×Rt→R×R that maps the time and frequency of the In summary,we need to find a new nonlinear pattern match- gesture samples,so that map(DA(f,t),6)/DB(f,t)=a(t). ing algorithm that can compare gesture signals with different where a(t)is a constant factor that only depends on time. durations without the labelling information.Additionally,the Assumptions: algorithm needs to accommodate different speed distributions We use the generalized coordinates r to denote the location while keeping track of the movement distance. of different parts of the hand,e.g.,r is the coordinates of each finger and the palm concatenated into a single vector.Thus, B.Gesture speed adaptation different parts of the hand may go along different trajectories Before we formally define and prove the properties for in the same gesture.In this section,we first consider the ideal gesture speed adaptation,we first use an example to show the case where the trajectories of the same gesture are exactly intuition of our matching algorithm design.Our key insight is same so that the trace of each part is fixed.This assumption that the type of the gesture is determined by the movement is relaxed in Section IV. trajectory of the hand rather than the movement speed.Given We assume that At is small so that the hand moves for a a certain movement stage on the trajectory,the user may move short distance during one micro-unit.We further assumes that the hand at different speeds,but the different parts of the the signal amplitude and the movement speed is not changed hand speed up/slow down proportionally.For example,the for one micro-unit.In practice,we set At to 40 milliseconds so thumb and the index fingers move at different speeds towards that the hand only moves for less than 4 cm in one micro-unit. each other in the click gesture.However,when the user clicks So,this assumption is reasonable for real-world applications. slowly,both fingers slow down by the same factor.Therefore, All of the gesture signals are treated as continuous signals we can scale the speed distribution of a slow gesture to match during the following problem description and proofs. with a fast gesture at the same stage.Furthermore,the stages Speed Adaptation Properties: of the gesture movement are determined by the position of hand on the trajectory.Therefore,we can track the movement Property 1.For two signal samples sA(t)and sB(t)of stages of gestures with different speeds by cumulating the the same gesture type,consider the single micro-unit from
01234 Time (s) 200 400 600 800 1000 I\Q (normalized) I Q (a) Sample A (writing W, 1.5 seconds). Energy 1234 Time (s) 50 25 0 -25 -50 Frequency Shift (Hz) 0 1 2 3 4 5 6 104 (b) STFT result for sample A. Energy (normalized) 20 40 60 80 Index after Scaling 50 25 0 -25 50 Frequency Shift (Hz) 0.1 0.2 0.3 0.4 (c) DTW over STFT for sample A. Energy (normalized) 5 10 15 20 25 30 Index after Scaling 50 25 0 -25 -50 Frequency Shift (Hz) 0.1 0.2 0.3 0.4 (d) DSW over STFT for sample A. 01234 Time (s) 200 400 600 800 1000 I\Q (normalized) I Q (e) Sample B (writing W, 3 seconds). Energy 1234 Time (s) 50 25 0 -25 -50 Frequency Shift (Hz) 0 5 10 15 104 (f) STFT result for sample B. Energy (normalized) 20 40 60 80 Index after Scaling 50 25 0 -25 50 Frequency Shift (Hz) 0 0.1 0.2 0.3 0.4 (g) DTW over STFT for sample B. Energy (normalized) 5 10 15 20 25 Index after Scaling 50 25 0 -25 -50 Frequency Shift (Hz) 0 0.1 0.2 0.3 0.4 (h) DSW over STFT for sample B. Figure 1. Two samples of writing W and corresponding matching results with different methods. duration changes. For example, the slower gesture sample B in Figure 1(f) has a smaller frequency variation that lasts for a long time in the spectrogram. It is challenging to find the right scaling factor in both time and frequency domain to match the spectrogram of sample A with sample B. Figure 1(c) and 1(g) show the stretched results when we directly apply DTW on the STFT spectrogram. We observe that the DTW algorithm does not scale and match the right stages for sample A and B. Furthermore, the speed distributions in each stage of the two results are still quite different. Neural networks, such as CNNs, can be used in classifying spectrograms with different scaling factors when trained by a large number of samples [7], [35]. However, the CNN does not consider underlying physical models of gesture spectrogram so that the training samples should exhaustively cover all speed and duration combinations of the same gesture. This incurs formidable costs in the data collection and training process. In summary, we need to find a new nonlinear pattern matching algorithm that can compare gesture signals with different durations without the labelling information. Additionally, the algorithm needs to accommodate different speed distributions while keeping track of the movement distance. B. Gesture speed adaptation Before we formally define and prove the properties for gesture speed adaptation, we first use an example to show the intuition of our matching algorithm design. Our key insight is that the type of the gesture is determined by the movement trajectory of the hand rather than the movement speed. Given a certain movement stage on the trajectory, the user may move the hand at different speeds, but the different parts of the hand speed up/slow down proportionally. For example, the thumb and the index fingers move at different speeds towards each other in the click gesture. However, when the user clicks slowly, both fingers slow down by the same factor. Therefore, we can scale the speed distribution of a slow gesture to match with a fast gesture at the same stage. Furthermore, the stages of the gesture movement are determined by the position of hand on the trajectory. Therefore, we can track the movement stages of gestures with different speeds by cumulating the total movement distance along the trajectory. In this way, we can stretch and match the movement stages of gestures with different speeds, as shown in Figure 1(d) and Figure 1(h). Definitions: Let s(t) = P p2P sp(t) be the baseband gesture signal, where P is the set of signal propagation paths and sp(t) is the complex-valued signal along path p. We define the function D(f, t), which is the square-root of the power spectral density of the gesture signal at time t: D(f, t) = Z t+t t s(⌧ )e j!⌧ d⌧ , (1) where t is the length for a time frame. We define the trajectory that the hand moves through within t as a microunit. Consider two gesture signal samples sA(t) and sB(t) of the same gesture type. We are looking for a mapping function : R⇥R+ 0 ! R⇥R+ 0 that maps the time and frequency of the gesture samples, so that map (DA(f, t), ) /DB(f, t) = ↵(t), where ↵(t) is a constant factor that only depends on time. Assumptions: • We use the generalized coordinates r to denote the location of different parts of the hand, e.g., r is the coordinates of each finger and the palm concatenated into a single vector. Thus, different parts of the hand may go along different trajectories in the same gesture. In this section, we first consider the ideal case where the trajectories of the same gesture are exactly same so that the trace of each part is fixed. This assumption is relaxed in Section IV. • We assume that t is small so that the hand moves for a short distance during one micro-unit. We further assumes that the signal amplitude and the movement speed is not changed for one micro-unit. In practice, we set t to 40 milliseconds so that the hand only moves for less than 4 cm in one micro-unit. So, this assumption is reasonable for real-world applications. All of the gesture signals are treated as continuous signals during the following problem description and proofs. Speed Adaptation Properties: Property 1. For two signal samples sA(t) and sB(t) of the same gesture type, consider the single micro-unit from
location rs to rt,where the movement period for sample A is map(DA(f,tA),6)=DA(f/a(tA),tA),where a(t)is the [tA:tA+TA]and for sample B is [tB,tB+TBl.Without loss scaling factor function.Then the mapping function satisfies of generality,we assume TA TB. the following relationship: Then,the two spectrograms DA(f,t)and DB(f,t)satisfy N,f-1t≤tA3a=-二 the following relationship: map(DA(f,t),5)=DA(f/a(t),t)=a(t)DB(f,t),(6) DA(f,tA)=aDB(af,tB), (2) where o(1)is called scaling ratio of the current where t is in (ti,ti. speed distribution. Proof.Consider the ith micro-unit ui of the trajectory.Ac- cording to property 1,there exists oo =(tAi-tAi-1)/(tBi- Proof.We first consider a single part h of the hand,e.g.,the tBi-1)and we can scale the distribution of sample A at oo to index finger,that is moved by a small distance of dh from get the same distribution as sample B.So, location rs to rt.By our assumption that the movement speed DA(f/ao,t)=aoDB(f,t) (7) is constant for the short time duration,we have the movement speed of h,vh.A dh/TA,in sample A and vh,B=dh/TB Note that every micro-unit is small enough that the distribution in sample B.Therefore,we have Uh.A =Un.B/a,where a= in frequency domain is unchanged during this interval.From TA/TB. the above,we can get the complete expression of the scaling In the gesture signal,suppose there is a path p corresponding ratio,which is to the reflection from part h of the hand.Using the signal models in [3],we have: a(t)=a0= tAi-tA-1,tE(tAi-1t,iEN.(8) tBi-tBi-1 Sp(t)=Ape-j(wunt/c+0p) (3) ▣ where w is the carrier frequency of the passband signal.As Property 1 and 2 show that we can achieve speed alignment sample A and sample B start from the same location,we can between different samples by dynamically scaling the frequen- use the same Ap and p in Eq.(3). cies while ensuring that the aligned segments represent the Now consider the Fourier transform of gesture signal in same trajectory.Note that we can find the exact match between sample A.Sp.A(f,t)=F[sp.A(t)}: frequency distributions when the hand precisely follows the Sp.A(f,tA)=FApe-i(wh.A()/e0) trajectory r(t).In reality,the hand may deviate from the trajectory r(t)so that our goal is to minimize the difference FApe-joh((-i)/a)/e) between the matched frequency distributions. aSp.B(af,tB), (4) IV.DYNAMIC SPEED WARPING In this section,we design an optimization algorithm for where t1 E [tA,tA+TA]and t2 E [tB,tB+TB].In the last measuring similarity of gesture signals based on speed adapt- step,we use the time scaling property of the Fourier transform, ation properties derived in Section III.Our optimization prob- F()(f),where the capitalized function X(f) lem considers small variations in gesture trajectories so that is the Fourier transforms of the non-capitalized function x(t).our objective is to minimize the difference between gesture As Eq.(4)holds for all different parts of the hand,we spectrograms instead of finding the exact mapping functions can use the linearity of continuous-time Fourier transform as in Section III.We then present a dynamic programming Ffax(t)+by(t)}ax(f,t)+bY(f,t).to get: algorithm,which is similar to the DTW algorithm.to find the optimal solution that satisfies both the speed and the cumulative movement distance constraints.Finally,we show Da(f.ta) 下{图-图 the micro-benchmark of this similarity measure on gesture signals and discuss the impact of parameters for the algorithm. DB(af,tB) (5) A.Problem Formulation ◇ In reality,the gesture spectrogram is discretized in both the time and the frequency domain.Given two spectrograms Property 2.Consider that two gesture signal samples A X[n],n 1...N and Y[m];m 1...M,where[n]is the and B of the same gesture type.We can divide the entire spectral of time-frame n.Consider a discrete warping function trajectory into micro-units u1,u2,....Assume that the time F[k=c().k 1...K,where c(k)=(i(k),j()),such that sample A passes through these micro-units in turn is that maps the spectral of X[i(k)]onto that of Y[j(k)]at the tA1,tA2,...while the time for sample B is tB1,tB2,.... kth warping.Here,i()and j(k)represents the frame index There exists a mapping function:R×Rd→R×Rd that of X and Y,respectively.The warping function should be: IIf there are more than one path corresponds to a single hand component. 2Note that (t)can be greater than 1 here,which means that in the current we can treat each path as a separate component with a different path speed micro-unit,the sample B is mapped to the sample A at the scaling ratio of and the above result still holds. 1/a(t)
location rs to rt, where the movement period for sample A is [tA, tA + TA] and for sample B is [tB, tB + TB]. Without loss of generality, we assume TA TB. Then, the two spectrograms DA(f, t) and DB(f, t) satisfy the following relationship: DA(f, tA) = ↵DB(↵f, tB), (2) where ↵ (↵ = TA TB 1) is called scaling ratio of the current speed distribution. Proof. We first consider a single part h of the hand, e.g., the index finger, that is moved by a small distance of dh from location rs to rt. By our assumption that the movement speed is constant for the short time duration, we have the movement speed of h, vh,A = dh/TA, in sample A and vh,B = dh/TB in sample B. Therefore, we have vh,A = vh,B/↵, where ↵ = TA/TB. In the gesture signal, suppose there is a path p corresponding to the reflection from part h of the hand1. Using the signal models in [3], we have: sp(t) = Apej(!vht/c+✓p) , (3) where ! is the carrier frequency of the passband signal. As sample A and sample B start from the same location, we can use the same Ap and ✓p in Eq. (3). Now consider the Fourier transform of gesture signal in sample A, Sp,A(f, t) = F{sp,A(t)}: Sp,A(f, tA) = F n Apej(!vh,A(t1tA)/c+✓p) o = F n Apej(!vh,B((t2tB)/↵)/c+✓p) o = ↵Sp,B(↵f, tB), (4) where t1 2 [tA, tA + TA] and t2 2 [tB, tB + TB]. In the last step, we use the time scaling property of the Fourier transform, F{x(kt)} = 1 |k|X(f /k), where the capitalized function X(f) is the Fourier transforms of the non-capitalized function x(t). As Eq. (4) holds for all different parts of the hand, we can use the linearity of continuous-time Fourier transform F{ax(t) + by(t)} = aX(f, t) + bY (f, t), to get: DA(f, tA) = F (X p2P sp,A(tA) ) = X p2P Sp,A(f, tA) = X p2P ↵Sp,B(↵f, tB) = ↵DB(↵f, tB) (5) Property 2. Consider that two gesture signal samples A and B of the same gesture type. We can divide the entire trajectory into micro-units u1, u2, ··· . Assume that the time that sample A passes through these micro-units in turn is tA1, tA2, ··· while the time for sample B is tB1, tB2, ··· . There exists a mapping function : R ⇥ R+ 0 ! R ⇥ R+ 0 that 1If there are more than one path corresponds to a single hand component, we can treat each path as a separate component with a different path speed and the above result still holds. map(DA(f, tA), ) = DA(f /↵(tA), tA), where ↵(t) is the scaling factor function. Then the mapping function satisfies the following relationship: 8i 2 N⇤, if tAi1 t tAi, 9 ↵(t) = tAitAi1 tBitBi1 , map(DA(f, t), ) = DA(f /↵(t), t) = ↵(t)DB(f, t0 ), (6) where t 0 is in (tBi1, tBi] 2. Proof. Consider the ith micro-unit ui of the trajectory. According to property 1, there exists ↵0 = (tAi tAi1)/(tBi tBi1) and we can scale the distribution of sample A at ↵0 to get the same distribution as sample B. So, DA(f /↵0, t) = ↵0DB(f, t0 ) (7) Note that every micro-unit is small enough that the distribution in frequency domain is unchanged during this interval. From the above, we can get the complete expression of the scaling ratio, which is ↵(t) = ↵0 = tAi tAi1 tBi tBi1 , t 2 (tAi1 , tAi] , i 2 N⇤ . (8) Property 1 and 2 show that we can achieve speed alignment between different samples by dynamically scaling the frequencies while ensuring that the aligned segments represent the same trajectory. Note that we can find the exact match between frequency distributions when the hand precisely follows the trajectory r(t). In reality, the hand may deviate from the trajectory r(t) so that our goal is to minimize the difference between the matched frequency distributions. IV. DYNAMIC SPEED WARPING In this section, we design an optimization algorithm for measuring similarity of gesture signals based on speed adaptation properties derived in Section III. Our optimization problem considers small variations in gesture trajectories so that our objective is to minimize the difference between gesture spectrograms instead of finding the exact mapping functions as in Section III. We then present a dynamic programming algorithm, which is similar to the DTW algorithm, to find the optimal solution that satisfies both the speed and the cumulative movement distance constraints. Finally, we show the micro-benchmark of this similarity measure on gesture signals and discuss the impact of parameters for the algorithm. A. Problem Formulation In reality, the gesture spectrogram is discretized in both the time and the frequency domain. Given two spectrograms X[n], n = 1 ...N and Y [m], m = 1 ...M, where X[n] is the spectral of time-frame n. Consider a discrete warping function F[k] = c(k),k = 1 ...K, where c(k)=(i(k), j(k)), such that maps the spectral of X[i(k)] onto that of Y [j(k)] at the kth warping. Here, i(k) and j(k) represents the frame index of X and Y , respectively. The warping function should be: 2Note that ↵(t) can be greater than 1 here, which means that in the current micro-unit, the sample B is mapped to the sample A at the scaling ratio of 1/↵(t)
monotonic,i.e.,i(k-1)<i(k),j(k-1)<j(k),continuous, Algorithm 1:Basic Dynamic Speed Warping i.e.,i()-i(k-1)≤1,j(k)-j(k-1)≤1,and meeting Input:Two spectrograms Xnl,n 1,...,N and the boundary conditions,i.e.,i(1)=1,j(1)=1,i(K)= Y[m,m=1,·,M,scaling ratio list ou,v=l,·,V N,j(K)=M as in traditional DTW [20]. 0 utput::min(dDsw[N,M,u,4l,v=1,…,V,μ=0,1. /dpswn,m,v,u:4d array recording distance We define the scaling operation a(k)based on Property 1 between two spectrograms at each middle state. to perform speed adaption.For example,performing an a(k)- //Isn,m,v,ul:4d array recording differences of times scaling on X[i(k)]means linearly stretching its spectral duration for scaled X and Y from initial to current state. in the frequency domain by 1/a(k)times while shortening /Initialization. * duration of this frame to a(k)in the time domain.In the 1 for n 1,...,N,m =1,...,M.v =1,...,V do following discussion,we always have a(k)<1 and only scale 2 dpsw[n,m,u,l←o 3 dDsw0,0,u,4←0 one of the X or Y.We define the indicator function I(k)to 4 lsn,m,,4←0 represent the object of the scaling operation: 5 end 6 for n 1,...,N,m 1,...,M.v =1,...,V do 0 Scale x[i(k)], /scale Xn */ I()= (9) 7 1 Scale Y[i(k)] if=0 then //91,92:candidate index sets /d1,d2:distance of alternative paths. Then,we define the distance between two spectral vectors s dist =1-cos(Xalu]In],Y[m]) X[i(k)】andY[i(ke】after scaling as: /case 1:(n-1,m)(n,m) 9 9←{(n-1,m,u,0)|llsn-1,m,u,0l≤Q: d(I,c,a,k)=I(k){1-cos(x[i()],Ya(k() d←min({dpsw(n-1,m,u,0]I(n-1,m,u,0)∈91} /+case2:(n-1,m-1)→(n,m) ★ +(1-I(k)){1-cos(Xa(k)[i()],Y[j()] (10) 1 92←{(n-1,m-1,v,4)1lln-1,m-1,v,μ]l≤Q: where cos(X,Y)=(X.Y)/(represents the 12 d2←min({dpsw[n-1,m-1,u,μ]|(m-1,m- 1,0,4)∈92}): cosine distance between vector X and Y.The objective of dpsw [n,m,v,]+min(d1 dist,d2 +2dist) DSW is to find the optimal F.a and I that minimize the 14 Update ls according to equation 16. average distance after the warping operation: 15 end /Scale Y[ml */ else DSW(X,Y)=min ∑KdL,ca,k)u(因 17 dpsw[n,m,v,1]can be calculated in a similar way as ∑太1w(k (*) doswn,m,v,0. 18 end The weight w(k)is a normalizer for different time durations, 19 end 20 for v =1,...,V and u=0,1 do where we use the symmetric form w(k)=(i()-i(k-1))+ dpsw [N,M,v,udpsw [N,M,v,u]/(N+M) (j(k)-j(k-1))so that w(k)=N+M. 22 end Our optimization needs to satisfy the timing constraint so 23 return min(dpsw [N,M,v,u).v=1;...,V,=0,1. that gesture stages can be aligned.Since the scaling operation not only changes frequency distributions but also changes the time of frames,we define the frame time of X[i(k)]and Yk]]after the kth warping: HereT()represents sum of the scaled duration from X[i(1)]to X[i(k)]and Q is the threshold of duration (1-I()·a()I()=0,w()=2 differences that decides suitable candidate warping functions. 1 T)= I(k)=1,w()=2 (1-I()·a(k)I(k)=0,w()=1 (11) B.Basic Dynamic Speed Warping 0 I()=1,w(k)=1 In this section we first present a dynamical programming When we choose to scale Xi(k)]by a(k),whatever the c(k- algorithm to solve the above optimization,which serves as a 1)is,the time duration of X[i(k)]is scaled to a(k).However, basic version of our final solution.We use a four-dimensional if we don't scale X[i()],there are two cases.If X [i(k)]has array dpsw to maintain the smallest dissimilarity between the been matched by Y[j(k-1)],which is equivalent to i(k)= matching part of X and Y at each intermediate state.In the i(k-1).then we set Ti(k)=0.Otherwise,we set Ti(k)=1. array dpswn,m,v,u,the first and the second index,n and The above piecewise function of Ti(k)can be rewritten as: m,are the current time-frame index for X and Y.The third index v indicates the scaling factor for the matching and the T=I(k)·(w(E)-1)+(1-I()·a(E) (12) fourth index u=0,1 is the indication function for whether By symmetry,we have Ti(k)as shown in Eq.(13). scaling X or Y.Thus,dpsw[n,m,v,0]is the shortest distance between the matched part of two spectrograms when T)=(1-I(k)·(w()-1)+I()·a() (13) Xn is scaled by av]times and then matched with Y[m]. Therefore,the timing constraint can be expressed as: We use a searchable scaling ratio array av],v =1,...,V for convenience.We also maintain another four-dimensional array T-sQ. ls,which tracks the difference of duration for the matched part **) of X and Y to meet the constraint in Eq.(**)
monotonic, i.e., i(k 1) i(k), j(k 1) j(k), continuous, i.e., i(k) i(k 1) 1, j(k) j(k 1) 1, and meeting the boundary conditions, i.e., i(1) = 1, j(1) = 1, i(K) = N, j(K) = M as in traditional DTW [20]. We define the scaling operation ↵(k) based on Property 1 to perform speed adaption. For example, performing an ↵(k)- times scaling on X[i(k)] means linearly stretching its spectral in the frequency domain by 1/↵(k) times while shortening duration of this frame to ↵(k) in the time domain. In the following discussion, we always have ↵(k) >>: (1 I(k)) · ↵(k) I(k)=0, w(k)=2 1 I(k)=1, w(k)=2 (1 I(k)) · ↵(k) I(k)=0, w(k)=1 0 I(k)=1, w(k)=1 (11) When we choose to scale X[i(k)] by ↵(k), whatever the c(k 1) is, the time duration of X[i(k)] is scaled to ↵(k). However, if we don’t scale X[i(k)], there are two cases. If X[i(k)] has been matched by Y [j(k 1)], which is equivalent to i(k) = i(k 1), then we set Ti(k) = 0. Otherwise, we set Ti(k) = 1. The above piecewise function of Ti(k) can be rewritten as: Ti(k) = I(k) · (w(k) 1) + (1 I(k)) · ↵(k) (12) By symmetry, we have Tj(k) as shown in Eq. (13). Tj(k) = (1 I(k)) · (w(k) 1) + I(k) · ↵(k) (13) Therefore, the timing constraint can be expressed as: XK k=1 Ti(k) XK k=1 Tj(k) Q, (⇤⇤) Algorithm 1: Basic Dynamic Speed Warping Input: Two spectrograms X[n], n = 1, ··· , N and Y [m], m = 1, ··· , M, scaling ratio list ↵[v], v = 1, ··· , V . Output: min(dDSW [N,M, v, µ]), v = 1, ··· ,V,µ = 0, 1. // dDSW [n, m, v, µ]: 4d array recording distance between two spectrograms at each middle state. // ls[n, m, v, µ]: 4d array recording differences of duration for scaled X and Y from initial to current state. /* Initialization. */ 1 for n = 1,...,N, m = 1,...,M, v = 1,...,V do 2 dDSW [n, m, v, µ] 1 3 dDSW [0, 0, v, µ] 0 4 ls[n, m, v, µ] 0 5 end 6 for n = 1,...,N, m = 1,...,M, v = 1,...,V do /* Scale X[n] */ 7 if µ = 0 then // q1, q2: candidate index sets // d1, d2: distance of alternative paths. 8 dist = 1 coshX↵[v][n], Y [m] i /* Case 1: (n 1, m) ! (n, m) */ 9 q1 {(n 1, m, v, 0) | |ls[n 1, m, v, 0]| Q}; 10 d1 min({dDSW [n1, m, v, 0] | (n1, m, v, 0) 2 q1}); /* Case 2: (n 1, m 1) ! (n, m) */ 11 q2 {(n1, m1, v, µ0 ) | |ls[n1, m1, v, µ0 ]| Q}; 12 d2 min({dDSW [n 1, m 1, v, µ0 ] | (n 1, m 1, v, µ0 ) 2 q2}); 13 dDSW [n, m, v, 0] min(d1 + dist, d2 + 2 dist) 14 Update ls according to equation 16. 15 end /* Scale Y [m] */ 16 else 17 dDSW [n, m, v, 1] can be calculated in a similar way as dDSW [n, m, v, 0]. 18 end 19 end 20 for v = 1,...,V and µ = 0, 1 do 21 dDSW [N,M, v, µ] dDSW [N,M, v, µ]/(N + M) 22 end 23 return min(dDSW [N,M, v, µ]), v = 1, ··· ,V, µ = 0, 1. Here PK k=1 Ti(k) represents sum of the scaled duration from X[i(1)] to X[i(k)] and Q is the threshold of duration differences that decides suitable candidate warping functions. B. Basic Dynamic Speed Warping In this section we first present a dynamical programming algorithm to solve the above optimization, which serves as a basic version of our final solution. We use a four-dimensional array dDSW to maintain the smallest dissimilarity between the matching part of X and Y at each intermediate state. In the array dDSW [n, m, v, µ], the first and the second index, n and m, are the current time-frame index for X and Y . The third index v indicates the scaling factor for the matching and the fourth index µ = 0 , 1 is the indication function for whether scaling X or Y . Thus, dDSW [n, m, v, 0] is the shortest distance between the matched part of two spectrograms when X[n] is scaled by ↵[v] times and then matched with Y [m]. We use a searchable scaling ratio array ↵[v], v = 1, ··· , V for convenience. We also maintain another four-dimensional array ls, which tracks the difference of duration for the matched part of X and Y to meet the constraint in Eq. (**)
The basic DSW algorithm is shown in Algorithm 1.Suppose the kth warping of warping function F is c(k)=(n,m).The monotonicity and continuity of F determine that the previous -0 match c(k-1)can only come from the following three cases: a.33o333a a303033 (n-1,m),(n,m-1)or (n-1,m-1).After considering the mm+1m+2学m+3许m+4m+s mm+1m2外m43 scaling operation,the transfer process from c(k-1)to c()is (a)Matching of basic DSW. (b)Matching of refined DSW. related to a(k),too.Take scaling X[n]as an example,which Figure 2.Problem of basic DSW algorithm and our improvement is equivalent to u=0.There are only two cases for c(k-1): ck-1)=-1,m) dosw(N,M,),which is in contradiction with 4=0 (14) sp up2t being the optimal transfer route.Otherwise,u (n-1,m-1)4=0or1. cannot be in the intermediate state of F,which is also contrary to the assumptions. where u indicates whether to scale X or Y at the (k-1)th In this way,the DSW algorithm provides a new measure of warping.Note that c(k-1)cannot be (n,m -1),because dissimilarity that satisfies the following properties: each frame can be scaled by at most once during the whole warping progress.If we scale Xn]at the (k-1)th warping. 0=DSW(X,X)<DSW(X,Y)=DSW(Y,X).(17) then XIn]is matched with both Y[m-1]and Y[m].It is impossible to match a shortened Xn]to two frames.If we It is easy to verify that the DSW distance is non-negative and choose to scale Y[m-1],it's equivalent to scale X[n]at two DSW(X,X)=0.In addition,the symmetry of DSW can different ratios 1/ak-1]and a,which is also impossible. be proved by induction,which we will not elaborate here. During the warping process,we need to track whether The time complexity of the basic DSW algorithm is the two cumulative duration before the kth warping is close O(VNMN),where the numbers of frames in the spectro- enough after a series of scaling operations.So,we use gi and grams are N and M,the number of samples in the frequency g2 to limit the difference between two cumulative duration, domain is Nf,and the size of the scaling ratio list is V.As the number of candidate scaling ratio V is a constant,the time and then select the minimum distance di and d2 among the set of indexes that satisfy the timing constraint.We can update complexity of DSW is a constant factor to the complexity dpsw [n,m,v,0]as: of DTW algorithms over spectrograms,which is O(NMN). Note that the interpolation operation for scaling the frequency min{dpsw [n-1,m,v,0+dist distribution has been completed in the data preprocessing dpsw[n,m,v,0 min (n-1,m,v,0)∈q1 stage,so the time cost of interpolation is not considered here. min{dpswin-1,m-1,,]+2dist C.Refined Dynamic Speed Warping (n-1,m,v,u)Eq2 (15) The basic DSW algorithm may match an excessive amount where we use the symmetric form of weight w(k),i.e.,w(k) of frames to a scaled frame.Consider the situation shown is 1 for case 1 and 2 for case 2.Then,ls is updated as: in Figure 2.Two different spectrograms X and Y of the same gesture are matched using the basic DSW algorithm, (l.[n-1,m,v,O]+alv while the timing constraint O is set to the duration of one updated with (n-1,m,v,0); frame.With the basic DSW,a single frame of the slower ls[n,m,v,O]= (16) lsn-1,m-1,v,川+av]-1 gesture X[n]can be matched to five scaled frames of Y[n], updated with (n-1,m-1,v,u). given thatT does not exceed the threshold 1=m However,this leads to distortions in the warping function,as Symmetrically,assume that Yj]is scaled by a[k]times. the movement distance along the trajectory of the frame X[n] We use a similar method to divide the transfer process into is closer to frames Ym]to Y[m +3]and the rest frames two cases to calculate dpsw [n,m,v,1].Finally,the distance should be matched the next frame Xn+1].To eliminate the between the two spectrograms is determined by the minimum distortion caused by such repeated matching,we impose a new of all possible final states. constraint on the timing of frames. The DSW algorithm has the optimal substructure and meets We define the matching set S for each Xi]for a specific the requirement of the dynamical programming algorithm. warping function F as: Given an optimal warping function F,which uniquely determ- ines a transfer route of the four-dimensional array dpsw from Sx={Y[jl|3k≤K and k∈N+ s =(0,0,0,0)to t=(N,M,vt,ut).Let u (n,m,v,L) c(k)=(X[i],Yj])and I(k)=1) (18) be an intermediate state of F,the transfer route p from s to u should also be the shortest.This can be proved This set represents all frames Yj]in warping function F by contradiction.Suppose there is a new transfer route p that matched X[i.Symmetrically,Sylil can be defined as such that dpswn,m,v,udpswIn,m,v,ul,and in Eq.(18).The tighter movement distance constraint is let p2 be the transfer route from u to t.If the transfer formalized as: route spup2t can meet the timing constraint dur- z∈KUY and |S.l≥0,∑T:-1≤5.(*利 ing the whole progress,then dpsw(N,M,v)
The basic DSW algorithm is shown in Algorithm 1. Suppose the kth warping of warping function F is c(k)=(n, m). The monotonicity and continuity of F determine that the previous match c (k1) can only come from the following three cases: (n1, m), (n, m1) or (n1, m1). After considering the scaling operation, the transfer process from c (k1) to c (k) is related to ↵(k), too. Take scaling X[n] as an example, which is equivalent to µ = 0. There are only two cases for c (k 1): c (k 1) = ⇢ (n 1, m) µ 0 = 0 (n 1, m 1) µ 0 = 0 or 1. (14) where µ 0 indicates whether to scale X or Y at the (k 1)th warping. Note that c (k 1) cannot be (n, m 1), because each frame can be scaled by at most once during the whole warping progress. If we scale X[n] at the (k 1)th warping, then X[n] is matched with both Y [m 1] and Y [m]. It is impossible to match a shortened X[n] to two frames. If we choose to scale Y [m1], it’s equivalent to scale X[n] at two different ratios 1/↵[k 1] and ↵[k], which is also impossible. During the warping process, we need to track whether the two cumulative duration before the kth warping is close enough after a series of scaling operations. So, we use q1 and q2 to limit the difference between two cumulative duration, and then select the minimum distance d1 and d2 among the set of indexes that satisfy the timing constraint. We can update dDSW [n, m, v, 0] as: dDSW [n, m, v, 0] = min 8 >: min{dDSW [n 1, m, v0 , 0] + dist} (n 1, m, v0 , 0) 2 q1 min{dDSW [n 1, m 1, v0 , µ0 ]+2 dist} (n 1, m, v0 , µ0 ) 2 q2 (15) where we use the symmetric form of weight w(k), i.e., w(k) is 1 for case 1 and 2 for case 2. Then, ls is updated as: ls[n, m, v, 0] = 8 >>>: ls[n 1, m, v0 , 0] + ↵[v 0 ] updated with (n 1, m, v0 , 0); ls[n 1, m 1, v0 , µ] + ↵[v 0 ] 1 updated with (n 1, m 1, v0 , µ). (16) Symmetrically, assume that Y [j] is scaled by ↵[k] times. We use a similar method to divide the transfer process into two cases to calculate dDSW [n, m, v, 1]. Finally, the distance between the two spectrograms is determined by the minimum of all possible final states. The DSW algorithm has the optimal substructure and meets the requirement of the dynamical programming algorithm. Given an optimal warping function F, which uniquely determines a transfer route of the four-dimensional array dDSW from s = (0, 0, 0, 0) to t = (N,M,vt, µt). Let u = (n, m, v, µ) be an intermediate state of F, the transfer route p from s to u should also be the shortest. This can be proved by contradiction. Suppose there is a new transfer route p1 such that dDSW;p1 [n, m, v, µ] dDSW;p [n, m, v, µ], and let p2 be the transfer route from u to t. If the transfer route s ;p1 u ;p2 t can meet the timing constraint during the whole progress, then dDSW;p1;p2 (N,M,vt, µt) 1 0.3 0.3 0.3 0.3 X Y nth mth (m+2)th (m+1)th 0.3 0.3 Q = 1 (m+3)th (m+4)th (m+5)th (a) Matching of basic DSW. 1 0.3 0.3 0.3 0.3 X Y nth mth (m+2)th (m+1)th ξ = 0.2 (m+3)th (b) Matching of refined DSW. Figure 2. Problem of basic DSW algorithm and our improvement. dDSW;p;p2 (N,M,vt, µt), which is in contradiction with s ;p u ;p2 t being the optimal transfer route. Otherwise, u cannot be in the intermediate state of F, which is also contrary to the assumptions. In this way, the DSW algorithm provides a new measure of dissimilarity that satisfies the following properties: 0 = DSW(X, X) < DSW(X,Y ) = DSW(Y , X). (17) It is easy to verify that the DSW distance is non-negative and DSW(X, X)=0. In addition, the symmetry of DSW can be proved by induction, which we will not elaborate here. The time complexity of the basic DSW algorithm is O(VNMNf ), where the numbers of frames in the spectrograms are N and M, the number of samples in the frequency domain is Nf , and the size of the scaling ratio list is V . As the number of candidate scaling ratio V is a constant, the time complexity of DSW is a constant factor to the complexity of DTW algorithms over spectrograms, which is O(NMNf ). Note that the interpolation operation for scaling the frequency distribution has been completed in the data preprocessing stage, so the time cost of interpolation is not considered here. C. Refined Dynamic Speed Warping The basic DSW algorithm may match an excessive amount of frames to a scaled frame. Consider the situation shown in Figure 2. Two different spectrograms X and Y of the same gesture are matched using the basic DSW algorithm, while the timing constraint Q is set to the duration of one frame. With the basic DSW, a single frame of the slower gesture X[n] can be matched to five scaled frames of Y [n], given that | Pm+5 j=m Tj Tn| does not exceed the threshold Q. However, this leads to distortions in the warping function, as the movement distance along the trajectory of the frame X[n] is closer to frames Y [m] to Y [m + 3] and the rest frames should be matched the next frame X[n + 1]. To eliminate the distortion caused by such repeated matching, we impose a new constraint on the timing of frames. We define the matching set S for each X[i] for a specific warping function F as: SX[i] = {Y [j] | 9 k K and k 2 N +, c(k)=(X[i],Y [j]) and I(k)=1} (18) This set represents all frames Y [j] in warping function F that matched X[i]. Symmetrically, SY [j] can be defined as in Eq. (18). The tighter movement distance constraint is formalized as: 8z 2 X [Y and |Sz| 0, X z0 2Sz Tz0 1 ⇠. (⇤⇤⇤)
F 52 A BC O E F G H I A B C D E F G H (a)Speed distribution of gestures.(b)Different normalization methods. (a)Confusion matrix baseline of(b)Confusion matrix baseline of Figure 5.Selecting hyperparameters and normalization measurements. DSW (normalized). DTW (normalized). Figure 3.Gesture recognition capability DSW.The data set is 225 gesture samples of nine gestures performed by a single user at a smooth speed.The distances results of DSW are shown in Figure 3(a).It can be seen that as a similarity measure,the DSW algorithm not only accurately recognizes similar gestures but also well distinguishes different types of gestures.In contrast,DTW-based similarity measure 52 cannot guarantee the synchronization of cumulative movement distances.Therefore,the similarity of complex gestures,such (a)Confusion matrix of DSW on (b)Confusion matrix of DTW on as FI,could be significantly underestimated by DTW without different speeds (normalized). different speeds(normalized). tracking the movement progress,as shown in Figure 3(b). Figure 4.Capability of speed adaptation. To further understand how DSW adapts to different speeds, The above equation shows that the total movement distance we use a special gesture set that has three gesture types on frames z in S=should be close to that of the matched z, (Pushing near,Pushing away,and Circling clockwise)with which we consider as a micro-unit in Property 1. 25 gesture samples performed at three different speeds(fast, With the tighter movement distance constraint, medium,and slow).The distance results of DSW and DTW the original time constraint in Eq. (**can are shown in Figure 4.We observe from Figure 4(a)that DSW also be redefined. Define an ordered set U accurately classifies the gestures regardless of their difference {(x,Sz)x∈X,lSzl≥1}U{(y,Qy)y∈Y,IQl≥1} in speeds.However,DTW does not handle the change of that represents the frames matched on the same micro-unit. speeds very well and only gestures with similar speeds can Then,the time difference of the same mirco-unit U]is be recognized. T-1 U=(a,S) To fully explore the capability of DSW,we need to carefully yES 1-∑T.U[g=(y,S) (19) design the hyperparameters of the DSW algorithm,including the minimum speed scaling ratio amin,the length of scaling ESy ratio list V and the time constraint threshold Q.The minimum Based on the above definition,the timing constraint Eq.(** speed scaling ratio is selected based on how fast/slow that can be re-written as follows: users will perform the gesture.We first collect gesture samples of ten volunteers,who repeat these gestures at the speed that ∀leN*andl∈[1,L (****) they feel comfortable.We then determine the peak speed of each gesture sample and derive the speed distribution of where Q represents the max cumulative timing error of the different gesture types,as in Figure 5(a).We observe that first l micro-unit after scaling the highest gesture speed can reach 75cm/s,and the speed In the refined DSW algorithm,we use a new four- resolution of gestures is 6 cm/s.Thus,in order to satisfy all dimensional array le to record the total duration of elements in possible speed adaptation requirements,we set amin=0.1 so a matching set.Consider the kth warping state whereX that a suitable warping function can be found for two samples and Ym are matched.If c(-1)=(n-1,m),both Xn-1 with a maximum speed difference of ten times. ad Xn]are in the matching set of Y[m].In this case,we need Our experimental results show that different lengths of the to modify gi as {(n -1,m,v,0)leln-1,m,v,0]<1). scaling ratio list V and the different thresholds of the time If c(k-1)=(n-1.m-1),the matching starts with a new constraint O have little effect on the distance calculation. micro-unit since c(k)=(m,n),so no change is required. However,the complexity of DSW is significantly reduced with The refined algorithm avoids the distortion in the warping and a shorter list of a.At the same time,a tighter threshold eliminates redundant calculations for the timing constraint.In can effectively prune warping paths,which also speeds up the following discussions,the DSW algorithm refers to the the execution of the algorithm.In following discussions,we refined DSW algorithm when not specifically mentioned. set V =6 and Q=1 so that the theoretical execution D.Discussion and Parameter Selection time of DSW is 2V =12 times of the DTW algorithm.Our We first use a small set of ultrasound gesture signals experiments show DTW takes 0.075 seconds to compare two collected by smartphones to demonstrate the performance of gesture signals with a length of one second.After applying
(a) Confusion matrix baseline of DSW (normalized). (b) Confusion matrix baseline of DTW (normalized). Figure 3. Gesture recognition capability. (a) Confusion matrix of DSW on different speeds (normalized). (b) Confusion matrix of DTW on different speeds (normalized). Figure 4. Capability of speed adaptation. The above equation shows that the total movement distance on frames z 0 in Sz should be close to that of the matched z, which we consider as a micro-unit in Property 1. With the tighter movement distance constraint, the original time constraint in Eq. (**) can also be redefined. Define an ordered set U = {(x, Sx)|8x 2 X, |Sx| 1} S {(y, Qy)|8y 2 Y , |Qy| 1} that represents the frames matched on the same micro-unit. Then, the time difference of the same mirco-unit U[l] is ⇠[l] = 8 < : P y2Sx Ty 1 U[l]=(x, Sx) 1 P x2Sy Tx U[l]=(y, Sy) (19) Based on the above definition, the timing constraint Eq. (**) can be re-written as follows: 8l 2 N⇤ and l 2 [1, L] , X l v=1 ⇠[v] Q, (⇤ ⇤ ⇤⇤) where Q represents the max cumulative timing error of the first l micro-unit after scaling. In the refined DSW algorithm, we use a new fourdimensional array lc to record the total duration of elements in a matching set. Consider the kth warping state where X↵[v][n] and Y [m] are matched. If c(k1) = (n1, m), both X[n1] ad X[n] are in the matching set of Y [m]. In this case, we need to modify q1 as {(n 1, m, v0 , 0) | lc[n 1, m, v0 , 0] 1}. If c(k 1) = (n 1, m 1), the matching starts with a new micro-unit since c(k)=(m, n), so no change is required. The refined algorithm avoids the distortion in the warping and eliminates redundant calculations for the timing constraint. In the following discussions, the DSW algorithm refers to the refined DSW algorithm when not specifically mentioned. D. Discussion and Parameter Selection We first use a small set of ultrasound gesture signals collected by smartphones to demonstrate the performance of (a) Speed distribution of gestures. F&F J&F G&F (b) Different normalization methods. Figure 5. Selecting hyperparameters and normalization measurements. DSW. The data set is 225 gesture samples of nine gestures performed by a single user at a smooth speed. The distances results of DSW are shown in Figure 3(a). It can be seen that as a similarity measure, the DSW algorithm not only accurately recognizes similar gestures but also well distinguishes different types of gestures. In contrast, DTW-based similarity measure cannot guarantee the synchronization of cumulative movement distances. Therefore, the similarity of complex gestures, such as F⇠I, could be significantly underestimated by DTW without tracking the movement progress, as shown in Figure 3(b). To further understand how DSW adapts to different speeds, we use a special gesture set that has three gesture types (Pushing near, Pushing away, and Circling clockwise) with 25 gesture samples performed at three different speeds (fast, medium, and slow). The distance results of DSW and DTW are shown in Figure 4. We observe from Figure 4(a) that DSW accurately classifies the gestures regardless of their difference in speeds. However, DTW does not handle the change of speeds very well and only gestures with similar speeds can be recognized. To fully explore the capability of DSW, we need to carefully design the hyperparameters of the DSW algorithm, including the minimum speed scaling ratio ↵min, the length of scaling ratio list V and the time constraint threshold Q. The minimum speed scaling ratio is selected based on how fast/slow that users will perform the gesture. We first collect gesture samples of ten volunteers, who repeat these gestures at the speed that they feel comfortable. We then determine the peak speed of each gesture sample and derive the speed distribution of different gesture types, as in Figure 5(a). We observe that the highest gesture speed can reach 75 cm/s, and the speed resolution of gestures is 6 cm/s. Thus, in order to satisfy all possible speed adaptation requirements, we set ↵min = 0.1 so that a suitable warping function can be found for two samples with a maximum speed difference of ten times. Our experimental results show that different lengths of the scaling ratio list V and the different thresholds of the time constraint Q have little effect on the distance calculation. However, the complexity of DSW is significantly reduced with a shorter list of ↵. At the same time, a tighter threshold can effectively prune warping paths, which also speeds up the execution of the algorithm. In following discussions, we set V = 6 and Q = 1 so that the theoretical execution time of DSW is 2V = 12 times of the DTW algorithm. Our experiments show DTW takes 0.075 seconds to compare two gesture signals with a length of one second. After applying
Table I B.Effect on One-shot Learning Tasks LIST OF GESTURES Gesture Label Gesture Label Experimental results show that the Nearest Neighbour Click Pushing Near B (INN)with DSW algorithm achieves a recognition accuracy Pushing Away Circling Clockwise D Circling anti-clockwise Palm drawing N of 99.47%when both training set and testing set are from the 不 Palm drawing inverted N G Writing W H same domain.We randomly select 1,125 samples as testing set Writing M from our gesture samples,ensuring that it contains different samples of the same participant.Then,we randomly select a pruning operation for DSW,the average execution time just one sample of each gesture type for each of the ten for DSW is 0.55 seconds.which is 7.3 times that of DTW. participants to serve as the training set.Thus,the training Therefore,DSW can effectively compare gestures in real-time. set contains 90 gesture sample.With 200 rounds of Monte Another important design consideration is to choose the Carlo cross-validation,we find that the classification error rate data-normalization measures for spectral vectors in the DSW. of DSW-INN is always less than 0.53%.This coincides with We compare the performance of two common normalization the common understanding that DTW-like algorithms performs methods,min-max normalization (Nor_1)and scaling to unit well even with the simple INN. length (Nor 2)in Figure 5(b).We observe that scaling to unit We then compare the generalization performance in a more length leads to better separation between different gestures, challenging scenario where testing samples are from a different i.e.,less overlapping between gestures.Therefore,we choose domain.In this experiment,we select gesture data from one to to use the second data-normalization method in DSW. four users as the training set and the data from the remaining V.EXPERIMENTAL RESULTS six users constitutes the testing set.We first compare the In this section,we evaluate the performance of DSW performance of two memory-based classifiers,kNN based on using the gesture dataset collected through commercial mobile the DSW algorithm (DSW-kNN)and kNN based on the DTW phones.We first quantitatively evaluate the performance of algorithm (DTW-kNN).As shown in Figure 6(a),when the DSW for the one-shot learning scenario and compare it to training data contains only one user,the average accuracy traditional methods such as DTW.SVM.and CNN.We then of the DSW-INN on the testing set for other users increase explore the efficiency of DSW in the unsupervised learning from 91.38%to 96.58%when we use one to ten samples scenario for gesture clustering tasks. of each gesture type for training.At the same time,as the A.Dataset Description number of trainers increases to four,the accuracy rate raises We implement the ultrasound-based gesture sensing system to a maximum of 99.36%.However,with the same amount on the Android platform.The gesture signals are collected us- of training data,the DTW-INN has a maximum accuracy of ing a Samsung Galaxy S7 mobile phone in a lab environment. 89.52%.The result shows that the DSW-kNN can achieve an We first emit sinusoid signals with a frequency of 18 kHz using accuracy of no less than 97%with a minimum of one gesture the speaker on the smartphone.We then use the microphone sample per user when applied to different domains.Figure 6(b) on the smartphone to record the hand reflection signal with a shows us the accuracy of kNN based on two different distance sampling rate of 48 kHz.The recorded signal is mixed with the methods as k increases.It can be found that the accuracy of transmitted sinusoids in a digital down-converter.The output DSW-kNN does not change with the value of k,which means of the down-converter is a complex-valued baseband signal. that the distance density of the same type for DSW does not The sampling rate of the baseband signal is decimated by 8 vary with the number of samples.However,the accuracy of times from 48 kHz down to 6 kHz. DTW-kNN increases with the increase of k value,indicating We collect 2,250 gesture samples for nine types of gestures that the similar samples obtained by DTW are sparse when from ten participants.The categories of gestures are shown in the size of the dataset is small. Table I.Each participant performs each class of the gestures We further compare the DSW-INN with the feature-based 25 times at any speed they feel comfortable in the same region classifier,such as CNNs and Support Vector Machines(SVM) respective to the phone.The preprocessing for gesture samples in Figure 6(c)and Figure 6(d).Here,we choose the classic includes the following steps.We first perform STFT on the LeNet structure [36]for CNN and linear kernel for SVM.We baseband signals with a 1024-point FFT window that moves observe that CNN and SVM rely on a large amount of training 256 points for each step.Under the above configuration,the data and the diversity of participants.When the training set is frequency resolution of STFT is around 5.86 Hz so that it small,they may learn too many irrelevant features and overfit can distinguish movements with speed differences of around the dataset.Figure 6(c)shows that with only one user's training 6 cm/s.We then perform column-wise normalization on each data,the DSW-INN has a strong generalization capability that STFT frame X.so that the modulus of each spectral vector can achieve an accuracy of 97.19%,while the other three X[n]=1.It should be noted that we also save the spectral methods have an accuracy of no more than 86%.In addition, vectors that have been scaled by a times to reduce the com- in the case that the training set contains four participants, putational cost in the later scaling steps.We keep the spectral DSW-INN only needs 36 template samples to achieve 97.58% vectors in frequency range of[-117.19 Hz,117.19 Hz],which generalization accuracy,which is higher than the CNN model is corresponding to the speed range of [-1.2 m/s,1.2 m/s]. based on 900 samples.When using 900 samples,the maximum
Table I LIST OF GESTURES Gesture Label Gesture Label Click A Pushing Near B Pushing Away C Circling Clockwise D Circling anti-clockwise E Palm drawing N F Palm drawing inverted N G Writing W H Writing M I a pruning operation for DSW, the average execution time for DSW is 0.55 seconds, which is 7.3 times that of DTW. Therefore, DSW can effectively compare gestures in real-time. Another important design consideration is to choose the data-normalization measures for spectral vectors in the DSW. We compare the performance of two common normalization methods, min-max normalization (Nor 1) and scaling to unit length (Nor 2) in Figure 5(b). We observe that scaling to unit length leads to better separation between different gestures, i.e., less overlapping between gestures. Therefore, we choose to use the second data-normalization method in DSW. V. EXPERIMENTAL RESULTS In this section, we evaluate the performance of DSW using the gesture dataset collected through commercial mobile phones. We first quantitatively evaluate the performance of DSW for the one-shot learning scenario and compare it to traditional methods such as DTW, SVM, and CNN. We then explore the efficiency of DSW in the unsupervised learning scenario for gesture clustering tasks. A. Dataset Description We implement the ultrasound-based gesture sensing system on the Android platform. The gesture signals are collected using a Samsung Galaxy S7 mobile phone in a lab environment. We first emit sinusoid signals with a frequency of 18 kHz using the speaker on the smartphone. We then use the microphone on the smartphone to record the hand reflection signal with a sampling rate of 48 kHz. The recorded signal is mixed with the transmitted sinusoids in a digital down-converter. The output of the down-converter is a complex-valued baseband signal. The sampling rate of the baseband signal is decimated by 8 times from 48 kHz down to 6 kHz. We collect 2,250 gesture samples for nine types of gestures from ten participants. The categories of gestures are shown in Table I. Each participant performs each class of the gestures 25 times at any speed they feel comfortable in the same region respective to the phone. The preprocessing for gesture samples includes the following steps. We first perform STFT on the baseband signals with a 1024-point FFT window that moves 256 points for each step. Under the above configuration, the frequency resolution of STFT is around 5.86 Hz so that it can distinguish movements with speed differences of around 6 cm/s. We then perform column-wise normalization on each STFT frame X, so that the modulus of each spectral vector |X[n]| = 1. It should be noted that we also save the spectral vectors that have been scaled by ↵ times to reduce the computational cost in the later scaling steps. We keep the spectral vectors in frequency range of [117.19 Hz, 117.19 Hz], which is corresponding to the speed range of [1.2 m/s, 1.2 m/s]. B. Effect on One-shot Learning Tasks Experimental results show that the Nearest Neighbour (1NN) with DSW algorithm achieves a recognition accuracy of 99.47% when both training set and testing set are from the same domain. We randomly select 1,125 samples as testing set from our gesture samples, ensuring that it contains different samples of the same participant. Then, we randomly select just one sample of each gesture type for each of the ten participants to serve as the training set. Thus, the training set contains 90 gesture sample. With 200 rounds of Monte Carlo cross-validation, we find that the classification error rate of DSW-1NN is always less than 0.53%. This coincides with the common understanding that DTW-like algorithms performs well even with the simple 1NN. We then compare the generalization performance in a more challenging scenario where testing samples are from a different domain. In this experiment, we select gesture data from one to four users as the training set and the data from the remaining six users constitutes the testing set. We first compare the performance of two memory-based classifiers, kNN based on the DSW algorithm (DSW-kNN) and kNN based on the DTW algorithm (DTW-kNN). As shown in Figure 6(a), when the training data contains only one user, the average accuracy of the DSW-1NN on the testing set for other users increase from 91.38% to 96.58% when we use one to ten samples of each gesture type for training. At the same time, as the number of trainers increases to four, the accuracy rate raises to a maximum of 99.36%. However, with the same amount of training data, the DTW-1NN has a maximum accuracy of 89.52%. The result shows that the DSW-kNN can achieve an accuracy of no less than 97% with a minimum of one gesture sample per user when applied to different domains. Figure 6(b) shows us the accuracy of kNN based on two different distance methods as k increases. It can be found that the accuracy of DSW-kNN does not change with the value of k, which means that the distance density of the same type for DSW does not vary with the number of samples. However, the accuracy of DTW-kNN increases with the increase of k value, indicating that the similar samples obtained by DTW are sparse when the size of the dataset is small. We further compare the DSW-1NN with the feature-based classifier, such as CNNs and Support Vector Machines (SVM) in Figure 6(c) and Figure 6(d). Here, we choose the classic LeNet structure [36] for CNN and linear kernel for SVM. We observe that CNN and SVM rely on a large amount of training data and the diversity of participants. When the training set is small, they may learn too many irrelevant features and overfit the dataset. Figure 6(c) shows that with only one user’s training data, the DSW-1NN has a strong generalization capability that can achieve an accuracy of 97.19%, while the other three methods have an accuracy of no more than 86%. In addition, in the case that the training set contains four participants, DSW-1NN only needs 36 template samples to achieve 97.58% generalization accuracy, which is higher than the CNN model based on 900 samples. When using 900 samples, the maximum
00 本有行 .10 190 2 540 K Value of KNN (a)DSW-INN vs DTW-INN. (b)DSW-kNN vs DTW-kNN. (c)Methods on one trainer's data. (d)Methods on four trainers'data. Figure 6.Generalization capabilities of different supervised learning methods 50 so SVM (10-3ccmSVM-406m) A BC D G Gesture Number of Fine-tuning Samples per Gesture Gesture Label Gesture Label (a)Different angles. (b)Different distances. (a)DSW-based AP clustering. (b)DTW-based AP clustering. Figure 7.Robusinesstunin of DSW.INN. Figure 8.Clustering results based on DSW or DTW. VALIDITY INDEX OF CLUSTERING Index DSW DTW on the geometric relationship to find the cluster center is not 0.9772 0.4380 applicable for these two distances,as they do not satisfy the FMI 0.9884 0.6094 triangle inequality.Thus,we use the Affinity Propagation(AP) RI 0.9974 0.9110 clustering algorithm [37].We adjust the damping coefficient generalization accuracy of DSW-1NN,DTW-INN,CNN and and preference factor to achieve aggregation of different num- SVM is 99.33%,91.11%,96.61%and 95.70%,respectively. bers of clusters.As we have nine gesture types,we choose to DSW-INN is also robust to gestures at different angles and converge at nine clusters.As shown in Figure 8(a),the distance different distances.To evaluate the influence of different angles defined by DSW correctly cluster the gesture of the same type on the model,we request a volunteer to perform gestures at into the same category.However,the DTW-based clustering three angles (0,45,and 90 degree with respect to the center does not give the correct clustering results.We further use of the phone)while maintaining an equal distance to the three common external indicators to represent cluster validity, phone.We use the 0-degree gesture samples as the training such as Jaccard Coefficient (JC),Fowles and Mallows Index set,which has five samples for each gesture type.Then,we (FMI)and Rand Index (RD).where a larger indicator means evaluate the performance on test datasets at three different better clustering performance.As shown in Table II.DSW- angles,each contains 225 gesture samples.After 50 rounds based clustering has higher scores for all three indicators than of Monte Carlo cross-validation,we find that the accuracy of DTW.This proves that the similarity measure defined by DSW DSW-INN at different angles is 100%.98.85%,and 99.89%. can be used for clustering large-scale unlabeled data and can respectively.We also request the volunteer to perform gestures even guide the design of gestures. at three different distances (10~20 cm,20~30 cm and 30~40 VI.CONCLUSION cm)from the mobile phone.We use the 45 samples in the In this paper,we introduce DSW,a dynamical speed ad- 20~30 cm region as the source domain and DSW-INN has an aptive matching scheme for gesture signals.DSW rescales the accuracy of 99.86%and 91.64%on the testing sets of 10~20 speed distributions of gesture samples while keeping track of cm and 30~40 cm.This means that the DSW-INN algorithm the movement distance at different stages.Our experimental is robust to distance changes within 30 cm.When the distance results show that DSW provides a robust similarity measure changes by more than 30 cm,the accuracy of the DSW-INN between gesture samples and can work for both one-shot algorithm is reduced to 91%due to the signal attenuation. learning in supervised gesture recognition and unsupervised However,we can fine-tune the model by adding a small learning tasks.While we mainly focus on sound-based signals number of samples in the target domain without retraining.As and STFT features,we believe that DSW can be readily shown in Figure 7(b),after adding five target-domain samples applied to Wi-Fi based gesture recognition systems and other for each gesture type,the accuracy of the 30~40 cm region time-frequency analysis schemes,such as DWT or HHT. is increased to 98.61%.In the same situation,CNN and SVM VII.ACKNOWLEDGEMENT can only reach 79.8%and 73.57%before fine-tuning,and the We would like to thank our anonymous shepherd and model parameters need to be retrained for the target domain. reviewers for their valuable comments.This work is partially C.Evaluation for Clustering Tasks supported by National Natural Science Foundation of China Clustering is another application scenario for similarity under Numbers 61872173,61972192,and Collaborative In- measure like DSW and DTW.The clustering algorithm based novation Center of Novel Software Technology
1 2 3 4 5 6 7 8 9 10 Number of Training Samples per Gesture 70 80 90 100 Accuracy(%) 1-DSW 2-DSW 3-DSW 4-DSW 1-DTW 2-DTW 3-DTW 4-DTW (a) DSW-1NN vs DTW-1NN. 1357 K Value of KNN 80 85 90 95 100 Accuracy(%) 1-DSW 2-DSW 3-DSW 4-DSW 1-DTW 2-DTW 3-DTW 4-DTW (b) DSW-kNN vs DTW-kNN. 0 45 90 135 180 225 Number of Training Samples 50 60 70 80 90 100 Accuracy(%) DSW DTW SVM CNN (c) Methods on one trainer’s data. 0 180 360 540 720 900 Number of Training Samples 75 80 85 90 95 100 Accuracy(%) DSW DTW SVM CNN (d) Methods on four trainers’ data. Figure 6. Generalization capabilities of different supervised learning methods. ABCDEFGH I Gesture 90 92 94 96 98 100 Accuracy(%) 0 Degree (Baseline) 45 Degree 90 Degree (a) Different angles. 012345 Number of Fine-tuning Samples per Gesture 70 75 80 85 90 95 100 Accuracy(%) DSW (10~20cm) CNN (10~20cm) SVM (10~20cm) DSW (30~40cm) CNN (30~40cm) SVM (30~40cm) (b) Different distances. Figure 7. Robustness and fine-tuning of DSW-1NN. Table II VALIDITY INDEX OF CLUSTERING Index DSW DTW JC 0.9772 0.4380 FMI 0.9884 0.6094 RI 0.9974 0.9110 generalization accuracy of DSW-1NN, DTW-1NN, CNN and SVM is 99.33%, 91.11%, 96.61% and 95.70%, respectively. DSW-1NN is also robust to gestures at different angles and different distances. To evaluate the influence of different angles on the model, we request a volunteer to perform gestures at three angles (0, 45, and 90 degree with respect to the center of the phone) while maintaining an equal distance to the phone. We use the 0-degree gesture samples as the training set, which has five samples for each gesture type. Then, we evaluate the performance on test datasets at three different angles, each contains 225 gesture samples. After 50 rounds of Monte Carlo cross-validation, we find that the accuracy of DSW-1NN at different angles is 100%, 98.85%, and 99.89%, respectively. We also request the volunteer to perform gestures at three different distances (10⇠20 cm, 20⇠30 cm and 30⇠40 cm) from the mobile phone. We use the 45 samples in the 20⇠30 cm region as the source domain and DSW-1NN has an accuracy of 99.86% and 91.64% on the testing sets of 10⇠20 cm and 30⇠40 cm. This means that the DSW-1NN algorithm is robust to distance changes within 30 cm. When the distance changes by more than 30 cm, the accuracy of the DSW-1NN algorithm is reduced to 91% due to the signal attenuation. However, we can fine-tune the model by adding a small number of samples in the target domain without retraining. As shown in Figure 7(b), after adding five target-domain samples for each gesture type, the accuracy of the 30⇠40 cm region is increased to 98.61%. In the same situation, CNN and SVM can only reach 79.8% and 73.57% before fine-tuning, and the model parameters need to be retrained for the target domain. C. Evaluation for Clustering Tasks Clustering is another application scenario for similarity measure like DSW and DTW. The clustering algorithm based ABCDEFGH I Gesture Label 0 50 100 150 200 250 Number of Samples 0 1 2 3 4 5 6 7 8 (a) DSW-based AP clustering. ABCDEFGH I Gesture Label 0 50 100 150 200 250 Number of Samples 0 1 2 3 4 5 6 7 8 (b) DTW-based AP clustering. Figure 8. Clustering results based on DSW or DTW. on the geometric relationship to find the cluster center is not applicable for these two distances, as they do not satisfy the triangle inequality. Thus, we use the Affinity Propagation (AP) clustering algorithm [37]. We adjust the damping coefficient and preference factor to achieve aggregation of different numbers of clusters. As we have nine gesture types, we choose to converge at nine clusters. As shown in Figure 8(a), the distance defined by DSW correctly cluster the gesture of the same type into the same category. However, the DTW-based clustering does not give the correct clustering results. We further use three common external indicators to represent cluster validity, such as Jaccard Coefficient (JC), Fowles and Mallows Index (FMI) and Rand Index (RI), where a larger indicator means better clustering performance. As shown in Table II, DSWbased clustering has higher scores for all three indicators than DTW. This proves that the similarity measure defined by DSW can be used for clustering large-scale unlabeled data and can even guide the design of gestures. VI. CONCLUSION In this paper, we introduce DSW, a dynamical speed adaptive matching scheme for gesture signals. DSW rescales the speed distributions of gesture samples while keeping track of the movement distance at different stages. Our experimental results show that DSW provides a robust similarity measure between gesture samples and can work for both one-shot learning in supervised gesture recognition and unsupervised learning tasks. While we mainly focus on sound-based signals and STFT features, we believe that DSW can be readily applied to Wi-Fi based gesture recognition systems and other time-frequency analysis schemes, such as DWT or HHT. VII. ACKNOWLEDGEMENT We would like to thank our anonymous shepherd and reviewers for their valuable comments. This work is partially supported by National Natural Science Foundation of China under Numbers 61872173 , 61972192, and Collaborative Innovation Center of Novel Software Technology
REFERENCES [19]Y.Zheng.Y.Zhang.K.Qian.G.Zhang,Y.Liu.C.Wu,and Z.Yang. "Zero-effort cross-domain gesture recognition with Wi-Fi,"in Proceed. [1]Q.Pu,S.Gupta,S.Gollakota,and S.Patel,"Whole-home gesture ings of ACM MobiSys,2019. recognition using wireless signals,"in Proceedings of ACM MobiCom, [20]H.Sakoe and S.Chiba."Dynamic programming algorithm optimization 2013. for spoken word recognition,"IEEE Transactions on Audio.Speech,and [2]F.Adib and D.Katabi."See through walls with WiFi!,"in Proceedings of ACM SIGCOMM.pp.75-86.2013. Language Processing.vol.26,no.1.pp.43-49.1978. [3]W.Wang.A.X.Liu,M.Shahzad,K.Ling,and S.Lu,"Understanding [21]Y.Wang.J.Liu,Y.Chen,M.Gruteser,J.Yang,and H.Liu,"E. eyes:In-home device-free activity identification using fine-grained WiFi and modeling of WiFi signal based human activity recognition,"in signatures,"in Proceedings of ACM MobiCom,2014. Proceedings of ACM MobiCom,2015 [4]J.Lien,N.Gillian,M.E.Karagozler,P.Amihood,C.Schwesig. [22]C.Han,K.Wu,Y.Wang,and L.M.Ni,"WiFall:Device-free fall E.Olson.H.Raja,and I.Poupyrev,"Soli:ubiquitous gesture sensing detection by wireless networks,"in Proceedings of IEEE INFOCOM. pp.271-279,2014. with millimeter wave radar,"ACM Transactions on Graphics,vol.35, [23]H.Abdelnasser,M.Youssef,and K.A.Harras,"WiGest:A ubiquit- no.4,p.142.2016. ous wifi-based gesture recognition system,"in Proceedings of IEEE [5]T.Wei and X.Zhang."mTrack:High-precision passive tracking using INFOCOM.2015. millimeter wave radios,"in Proceedings of ACM MobiCom,2015. [24]B.Kellogg,V.Talla,and S.Gollakota,"Bringing gesture recognition to [6]F.Adib,Z.Kabelac,and D.Katabi,"Multi-person localization via RF body reflections,"in Proceedings of USENIX NSDI,2015. all devices,"in Proceedings of Usenix NSDI,2014. [25]W.Ruan,Q.Z.Sheng,L.Yang,T.Gu,P.Xu,and L.Shangguan, [7]M.Zhao,Y.Tian,H.Zhao,M.A.Alsheikh,T.Li,R.Hristov,Z.Kabelac, D.Katabi.and A.Torralba."RF-based 3D skeletons,"in Proceedings "Audiogest:enabling fine-grained hand gesture detection by decoding of ACM SIGCOMM,2018. edings of ACM UbiComp,2016. 16cD Eang.P Nurmi and Wane."CrossSense: [8]J.Wang,H.Jiang,J.Xiong.K.Jamieson,X.Chen,D.Fang,and towards cross-site and large-scale WiFi sensing."in Proceedings of ACM B.Xie,"LiFS:low human-effort,device-free localization with fine- MobiCom.2018. grained subcarrier information,"in Proceedings of ACM MobiCom, (27]J.Wang.Y.Chen,L.Hu,X.Peng,and S.Y.Philip,"Stratified transfer 2016. learning for cross-domain activity recognition,"in Proceedings of IEEE [9]Y.Ma,G.Zhou,S.Wang,H.Zhao,and W.Jung,"SignFi:Sign language recognition using WiFi,"in Proceedings of ACM UbiComp.2018. PerCom.pp.1-10,IEEE.2018. [28]L.Zhang,Z.Wang,and L.Yang,"Commercial Wi-Fi based fall [10]W.Wang,A.X.Liu,and K.Sun,"Device-free gesture tracking using detection with environment influence mitigation,"in Proceedings of acoustic signals,"in Proceedings of ACM MobiCom,2016. IEEE SECON,2019. [11]R.Nandakumar,V.Iyer.D.Tan.and S.Gollakota."FingerlO:Using [29]K.Chen,L.Yao,D.Zhang,X.Chang.G.Long,and S.Wang,"Distri- active sonar for fine-grained finger tracking."in Proceedings of ACM butionally robust semi-supervised leaming for people-centric sensing." CH1.2016. in Proceedings of ACM AAAI,2018. [12]S.Yun,Y.-C.Chen.H.Zheng.L.Qiu.and W.Mao,"Strata:Fine-grained [30]F.Despinoy,D.Bouget,G.Forestier,C.Penet,N.Zemiti,P.Poignet acoustic-based device-free tracking,"in Proceedings of ACM MobiSys, Pp.15-28.ACM.2017. and P.Jannin,"Unsupervised trajectory segmentation for surgical gesture recognition in robotic training,"IEEE Transactions on Biomedical [13]K.Ling.H.Dai,Y.Liu,and A.X.Liu,"Ultragesture:Fine-grained Engineering,vol.63,no.6,pp.1280-1291,2015. gesture sensing and recognition."in Proceedings of IEEE SECON.2018. [31]N.Gillian,B.Knapp,and S.O'modhrain,"Recognition of multivariate [14]T.Wang,D.Zhang,Y.Zheng,T.Gu,X.Zhou,and B.Dorizzi,"C- temporal musical gestures using n-dimensional dynamic time warping, FMCW based contactless respiration detection using acoustic signal" in Proceedings of NIME,2011. Proceedings of the ACM on Interactive,Mobile.Wearable and Ubiquit- [32]K.Ali,A.X.Liu,W.Wang,and M.Shahzad,"Keystroke recognition ous Technologies,vol.1,no.4,p.170,2018. using WiFi signals,"in Proceedings of ACM MobiCom,2015. [15]K.Sun,W.Wang,A.X.Liu,and H.Dai,"Depth aware finger tapping on (33]H.Li.W.Yang.J.Wang.Y.Xu.and L.Huang,"WiFinger:talk to virtual displays,"in Proceedings of ACM MobiSys,pp.283-295,ACM your smart devices with finger-grained gesture,"in Proceedings of ACM 208. UbiComp,2016. [16]W.Jiang,C.Miao,F.Ma,S.Yao.Y.Wang,Y.Yuan,H.Xue,C.Song, [34]G.Wang.Y.Zou,Z.Zhou,K.Wu,and L.M.Ni,"We can hear you X.Ma,D.Koutsonikolas,et al.,"Towards environment independent with Wi-Fi!,"in Proceedings of ACM MobiCom,2014. device free human activity recognition,"in Proceedings of ACM Mo- [35]X.Xu,J.Yu,Y.Chen,Y.Zhu,L.Kong,and M.Li,"Breathlistener:Fine. biCom,pp.289-304,2018. grained breathing monitoring in driving environments utilizing acoustic [17]H.Du.P.Li.H.Zhou,W.Gong,G.Luo,and P.Yang,"Wordrecorder: signals,"in Proceedings of ACM MobiSys,pp.54-66,ACM,2019. Accurate acoustic-based handwriting recognition using deep leaming," [36]Y.LeCun,L.Bottou,Y.Bengio,P.Haffner,er al.,"Gradient-based in Proceedings of IEEE INFOCOM,pp.1448-1456,IEEE,2018. learning applied to document recognition,"Proceedings of the IEEE [18]A.Virmani and M.Shahzad,"Position and orientation agnostic gesture vol.86.no.11,pp.2278-2324,1998. recognition using WiFi,"in Proceedings of ACM MobiSys,2017. [37]B.J.Frey and D.Dueck,"Clustering by passing messages between data points,"science,vol.315.no.5814,pp.972-976.2007
REFERENCES [1] Q. Pu, S. Gupta, S. Gollakota, and S. Patel, “Whole-home gesture recognition using wireless signals,” in Proceedings of ACM MobiCom, 2013. [2] F. Adib and D. Katabi, “See through walls with WiFi!,” in Proceedings of ACM SIGCOMM, pp. 75–86, 2013. [3] W. Wang, A. X. Liu, M. Shahzad, K. Ling, and S. Lu, “Understanding and modeling of WiFi signal based human activity recognition,” in Proceedings of ACM MobiCom, 2015. [4] J. Lien, N. Gillian, M. E. Karagozler, P. Amihood, C. Schwesig, E. Olson, H. Raja, and I. Poupyrev, “Soli: ubiquitous gesture sensing with millimeter wave radar,” ACM Transactions on Graphics, vol. 35, no. 4, p. 142, 2016. [5] T. Wei and X. Zhang, “mTrack: High-precision passive tracking using millimeter wave radios,” in Proceedings of ACM MobiCom, 2015. [6] F. Adib, Z. Kabelac, and D. Katabi, “Multi-person localization via RF body reflections,” in Proceedings of USENIX NSDI, 2015. [7] M. Zhao, Y. Tian, H. Zhao, M. A. Alsheikh, T. Li, R. Hristov, Z. Kabelac, D. Katabi, and A. Torralba, “RF-based 3D skeletons,” in Proceedings of ACM SIGCOMM, 2018. [8] J. Wang, H. Jiang, J. Xiong, K. Jamieson, X. Chen, D. Fang, and B. Xie, “LiFS: low human-effort, device-free localization with finegrained subcarrier information,” in Proceedings of ACM MobiCom, 2016. [9] Y. Ma, G. Zhou, S. Wang, H. Zhao, and W. Jung, “SignFi: Sign language recognition using WiFi,” in Proceedings of ACM UbiComp, 2018. [10] W. Wang, A. X. Liu, and K. Sun, “Device-free gesture tracking using acoustic signals,” in Proceedings of ACM MobiCom, 2016. [11] R. Nandakumar, V. Iyer, D. Tan, and S. Gollakota, “FingerIO: Using active sonar for fine-grained finger tracking,” in Proceedings of ACM CHI, 2016. [12] S. Yun, Y.-C. Chen, H. Zheng, L. Qiu, and W. Mao, “Strata: Fine-grained acoustic-based device-free tracking,” in Proceedings of ACM MobiSys, pp. 15–28, ACM, 2017. [13] K. Ling, H. Dai, Y. Liu, and A. X. Liu, “Ultragesture: Fine-grained gesture sensing and recognition,” in Proceedings of IEEE SECON, 2018. [14] T. Wang, D. Zhang, Y. Zheng, T. Gu, X. Zhou, and B. Dorizzi, “CFMCW based contactless respiration detection using acoustic signal,” Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, vol. 1, no. 4, p. 170, 2018. [15] K. Sun, W. Wang, A. X. Liu, and H. Dai, “Depth aware finger tapping on virtual displays,” in Proceedings of ACM MobiSys, pp. 283–295, ACM, 2018. [16] W. Jiang, C. Miao, F. Ma, S. Yao, Y. Wang, Y. Yuan, H. Xue, C. Song, X. Ma, D. Koutsonikolas, et al., “Towards environment independent device free human activity recognition,” in Proceedings of ACM MobiCom, pp. 289–304, 2018. [17] H. Du, P. Li, H. Zhou, W. Gong, G. Luo, and P. Yang, “Wordrecorder: Accurate acoustic-based handwriting recognition using deep learning,” in Proceedings of IEEE INFOCOM, pp. 1448–1456, IEEE, 2018. [18] A. Virmani and M. Shahzad, “Position and orientation agnostic gesture recognition using WiFi,” in Proceedings of ACM MobiSys, 2017. [19] Y. Zheng, Y. Zhang, K. Qian, G. Zhang, Y. Liu, C. Wu, and Z. Yang, “Zero-effort cross-domain gesture recognition with Wi-Fi,” in Proceedings of ACM MobiSys, 2019. [20] H. Sakoe and S. Chiba, “Dynamic programming algorithm optimization for spoken word recognition,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 26, no. 1, pp. 43–49, 1978. [21] Y. Wang, J. Liu, Y. Chen, M. Gruteser, J. Yang, and H. Liu, “Eeyes: In-home device-free activity identification using fine-grained WiFi signatures,” in Proceedings of ACM MobiCom, 2014. [22] C. Han, K. Wu, Y. Wang, and L. M. Ni, “WiFall: Device-free fall detection by wireless networks,” in Proceedings of IEEE INFOCOM, pp. 271–279, 2014. [23] H. Abdelnasser, M. Youssef, and K. A. Harras, “WiGest: A ubiquitous wifi-based gesture recognition system,” in Proceedings of IEEE INFOCOM, 2015. [24] B. Kellogg, V. Talla, and S. Gollakota, “Bringing gesture recognition to all devices,” in Proceedings of Usenix NSDI, 2014. [25] W. Ruan, Q. Z. Sheng, L. Yang, T. Gu, P. Xu, and L. Shangguan, “Audiogest: enabling fine-grained hand gesture detection by decoding echo signal,” in Proceedings of ACM UbiComp, 2016. [26] J. Zhang, Z. Tang, M. Li, D. Fang, P. Nurmi, and Z. Wang, “CrossSense: towards cross-site and large-scale WiFi sensing,” in Proceedings of ACM MobiCom, 2018. [27] J. Wang, Y. Chen, L. Hu, X. Peng, and S. Y. Philip, “Stratified transfer learning for cross-domain activity recognition,” in Proceedings of IEEE PerCom, pp. 1–10, IEEE, 2018. [28] L. Zhang, Z. Wang, and L. Yang, “Commercial Wi-Fi based fall detection with environment influence mitigation,” in Proceedings of IEEE SECON, 2019. [29] K. Chen, L. Yao, D. Zhang, X. Chang, G. Long, and S. Wang, “Distributionally robust semi-supervised learning for people-centric sensing,” in Proceedings of ACM AAAI, 2018. [30] F. Despinoy, D. Bouget, G. Forestier, C. Penet, N. Zemiti, P. Poignet, and P. Jannin, “Unsupervised trajectory segmentation for surgical gesture recognition in robotic training,” IEEE Transactions on Biomedical Engineering, vol. 63, no. 6, pp. 1280–1291, 2015. [31] N. Gillian, B. Knapp, and S. O’modhrain, “Recognition of multivariate temporal musical gestures using n-dimensional dynamic time warping,” in Proceedings of NIME, 2011. [32] K. Ali, A. X. Liu, W. Wang, and M. Shahzad, “Keystroke recognition using WiFi signals,” in Proceedings of ACM MobiCom, 2015. [33] H. Li, W. Yang, J. Wang, Y. Xu, and L. Huang, “WiFinger: talk to your smart devices with finger-grained gesture,” in Proceedings of ACM UbiComp, 2016. [34] G. Wang, Y. Zou, Z. Zhou, K. Wu, and L. M. Ni, “We can hear you with Wi-Fi!,” in Proceedings of ACM MobiCom, 2014. [35] X. Xu, J. Yu, Y. Chen, Y. Zhu, L. Kong, and M. Li, “Breathlistener: Finegrained breathing monitoring in driving environments utilizing acoustic signals,” in Proceedings of ACM MobiSys, pp. 54–66, ACM, 2019. [36] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, et al., “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998. [37] B. J. Frey and D. Dueck, “Clustering by passing messages between data points,” science, vol. 315, no. 5814, pp. 972–976, 2007