Device-Free Gesture Tracking Using Acoustic Signals Wei Wang Alex X.Liut+ Ke Sunt tState Key Laboratory for Novel Software Technology,Nanjing University,China +Dept.of Computer Science and Engineering,Michigan State University,USA ww@nju.edu.cn,alexliu@cse.msu.edu,samsonsunke@gmail.com ABSTRACT Device-free gesture tracking means that user hands/fingers are not Device-free gesture tracking is an enabling HCI mechanism for attached with any device.Imagine that if a smart watch has the small wearable devices because fingers are too big to control the device-free gesture tracking capability,then the user can adjust time GUI elements on such small screens,and it is also an import- in a touch-less manner as shown in Figure 1,where the clock hand ant HCI mechanism for medium-to-large size mobile devices be- follows the movement of the finger.Device-free gesture tracking cause it allows users to provide input without blocking screen view. is an enabling HCI mechanism for small wearable devices (such In this paper,we propose LLAP.a device-free gesture tracking as smart watches)because fingers are too big to control the GUI scheme that can be deployed on existing mobile devices as soft- elements on such small screens.In contrast,device-free gesture ware,without any hardware modification.We use speakers and tracking allows users to provide input by performing gestures near microphones that already exist on most mobile devices to perform a device rather than on a device.Device-free gesture tracking is device-free tracking of a hand/finger.The key idea is to use acoustic also an important HCI mechanism for medium-to-large size mobile phase to get fine-grained movement direction and movement dis- devices (such as smartphones and tablets)complementing touch tance measurements.LLAP first extracts the sound signal reflected screens because it allows users to provide inputs without block- by the moving hand/finger after removing the background sound ing screen view,which gives user better visual experience.Fur- signals that are relatively consistent over time.LLAP then meas- thermore,device-free gesture tracking can work in scenarios where ures the phase changes of the sound signals caused by hand/finger touch screens cannot,e.g.,when users wear gloves or when the movements and then converts the phase changes into the distance device is in the pocket of the movement.We implemented and evaluated LLAP using commercial-off-the-shelf mobile phones.For 1-D hand movement and 2-D drawing in the air,LLAP has a tracking accuracy of 3.5 mm and 4.6 mm,respectively.Using gesture traces tracked by LLAP,we can recognize the characters and short words drawn in the air with an accuracy of 92.3%and 91.2%,respectively. CCS Concepts ●Human-centered computing→Gestural input; Keywords Figure 1:Device-free gesture tracking Gesture Tracking;Ultrasound;Device-free Practical device-free gesture tracking systems need to satisfy three requirements.First,such systems need to have high accur- 1.INTRODUCTION acy so that they can capture delicate movements of a hand/finger. Due to the small operational space around the mobile device,e.g.. 1.1 Motivation within tens of centimeters (cm)to the device,we need millimeter Gestures are natural and user-friendly Human Computer Interac- (mm)level tracking accuracy to fully exploit the control capability tion(HCI)mechanisms for users to control their devices.Gesture of human hands.Second,such systems need to have low latency tracking allows devices to get fine-grained user input by quantit- (i.e.,respond quickly),within tens of milliseconds,to hand/finger atively measuring the movement of their hands/fingers in the air. movement without user feeling lagging responsiveness.Third,they need to have low computational cost so that they can be implemen- ted on resource constrained mobile devices. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that c 1.2 Limitations of Prior Art tion on the first page.Copyrights for components of this work owned by others than ACM must be honored.Abstracting with credit is permitted.To copy otherwise,or re- Most existing device-free gesture tracking solutions use cus- publish,to post on servers or to redistribute to lists,requires prior specific permission tomized hardware [1-4].Based on the fact that wireless signal and/or a fee.Request permissions from permissions@acm.org. changes as a hand/finger moves,Google made a customized chip MobiCom'16,October 03-07.2016.New York City.NY.USA in their Soli system that uses 60 GHz wireless signal with mm- ©2016ACM.ISBN978-1-4503-4226-1/1610..$15.00 level wavelength to track small movement of a hand/finger [1]. D0 http:/dx.doi.org/10.1145/2973750.2973764 and Teng et al.made customized directional 60 GHz transceivers
Device-Free Gesture Tracking Using Acoustic Signals Wei Wang† Alex X. Liu†‡ Ke Sun† †State Key Laboratory for Novel Software Technology, Nanjing University, China ‡Dept. of Computer Science and Engineering, Michigan State University, USA ww@nju.edu.cn, alexliu@cse.msu.edu, samsonsunke@gmail.com ABSTRACT Device-free gesture tracking is an enabling HCI mechanism for small wearable devices because fingers are too big to control the GUI elements on such small screens, and it is also an important HCI mechanism for medium-to-large size mobile devices because it allows users to provide input without blocking screen view. In this paper, we propose LLAP, a device-free gesture tracking scheme that can be deployed on existing mobile devices as software, without any hardware modification. We use speakers and microphones that already exist on most mobile devices to perform device-free tracking of a hand/finger. The key idea is to use acoustic phase to get fine-grained movement direction and movement distance measurements. LLAP first extracts the sound signal reflected by the moving hand/finger after removing the background sound signals that are relatively consistent over time. LLAP then measures the phase changes of the sound signals caused by hand/finger movements and then converts the phase changes into the distance of the movement. We implemented and evaluated LLAP using commercial-off-the-shelf mobile phones. For 1-D hand movement and 2-D drawing in the air, LLAP has a tracking accuracy of 3.5 mm and 4.6 mm, respectively. Using gesture traces tracked by LLAP, we can recognize the characters and short words drawn in the air with an accuracy of 92.3% and 91.2%, respectively. CCS Concepts •Human-centered computing → Gestural input; Keywords Gesture Tracking; Ultrasound; Device-free 1. INTRODUCTION 1.1 Motivation Gestures are natural and user-friendly Human Computer Interaction (HCI) mechanisms for users to control their devices. Gesture tracking allows devices to get fine-grained user input by quantitatively measuring the movement of their hands/fingers in the air. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. MobiCom’16, October 03-07, 2016, New York City, NY, USA c 2016 ACM. ISBN 978-1-4503-4226-1/16/10. . . $15.00 DOI: http://dx.doi.org/10.1145/2973750.2973764 Device-free gesture tracking means that user hands/fingers are not attached with any device. Imagine that if a smart watch has the device-free gesture tracking capability, then the user can adjust time in a touch-less manner as shown in Figure 1, where the clock hand follows the movement of the finger. Device-free gesture tracking is an enabling HCI mechanism for small wearable devices (such as smart watches) because fingers are too big to control the GUI elements on such small screens. In contrast, device-free gesture tracking allows users to provide input by performing gestures near a device rather than on a device. Device-free gesture tracking is also an important HCI mechanism for medium-to-large size mobile devices (such as smartphones and tablets) complementing touch screens because it allows users to provide inputs without blocking screen view, which gives user better visual experience. Furthermore, device-free gesture tracking can work in scenarios where touch screens cannot, e.g., when users wear gloves or when the device is in the pocket. Figure 1: Device-free gesture tracking Practical device-free gesture tracking systems need to satisfy three requirements. First, such systems need to have high accuracy so that they can capture delicate movements of a hand/finger. Due to the small operational space around the mobile device, e.g., within tens of centimeters (cm) to the device, we need millimeter (mm) level tracking accuracy to fully exploit the control capability of human hands. Second, such systems need to have low latency (i.e., respond quickly), within tens of milliseconds, to hand/finger movement without user feeling lagging responsiveness. Third, they need to have low computational cost so that they can be implemented on resource constrained mobile devices. 1.2 Limitations of Prior Art Most existing device-free gesture tracking solutions use customized hardware [1–4]. Based on the fact that wireless signal changes as a hand/finger moves, Google made a customized chip in their Soli system that uses 60 GHz wireless signal with mmlevel wavelength to track small movement of a hand/finger [1], and Teng et al. made customized directional 60 GHz transceivers
in their mTrack system to track the movement of a pen or a fin shift,our approach has three advantages:(1)tracking capability,(2) ger using steerable directional beams [2].Based on the fact that low latency,and (3)ability to track slow or small movements of a light reflection strength changes as a hand/finger moves,Zhang et hand/finger.We have lower latency than Doppler shift based sys- al.made customized LED/light sensors in their Okuli system to tems because Doppler shift requires Fast Fourier Transform(FFT), use visible light to track hand/finger movement [3].Based on vis- which needs to accumulate at least 2048 samples(translated to 42.7 ion processing algorithms,Leap Motion made customized infrared ms)to process,whereas we only need to accumulate 16 samples cameras to track hand/finger movements [4].Recently,Nandak- (translated to 0.3 ms).In other words,Doppler shift based systems umar et al.explored the feasibility of using commercial mobile only respond to hand/finger movement every 42.7 ms whereas our devices to track fingers/hands within a short distance.They pro- LLAP system can respond to hand/finger movement every 0.3 ms. posed fingerlO,which uses OFDM modulated sound to locate the Note that in practice,we may need to accumulate more samples fingers with accuracy of 8 mm [5]. due to the hardware limitations of mobile devices,such as 512 samples (translated to 10.7 ms)on smartphones.We can deal with 1.3 Proposed Approach slow hand/finger movement because LLAP can precisely measure In this paper,we propose a device-free gesture tracking scheme, the accumulated slow phase changes over time.We can deal with called Low-Latency Acoustic Phase (LLAP),that can be deployed small hand/finger movement because LLAP can precisely measure on existing mobile devices as a software (such as an APP)without small phase changes that is less than a full phase cycle.In contrast, any hardware modification.We use speakers and microphones Doppler-based approaches cannot detect slow or small movements that already exist on most mobile devices to perform device-free due to their limited frequency resolution,as we show in Section 3 tracking of a hand/finger.Commercial-Off-The-Shelf (COTS)mo- The second challenge is to achieve two dimensional gesture bile devices can emit and record sound waves with frequency tracking.Although LLAP can precisely measure the relative move- higher than 17 kHz,which are inaudible to most people [6].The ment distance of a hand,it cannot directly measure the absolute dis- wavelength of sound waves in this frequency range is less than 2 tance between the hand and the speaker/microphones,and therefore cm.Therefore,a small movement of a few millimeters will sig- it is hard to determine the initial hand location that is essential for nificantly change the phase of the received sound wave.Our key two dimensional tracking.To address this challenge,we use mul- idea is to use the acoustic phase to get fine-grained movement dir- tiple Continuous Waves (CW)with linearly spaced frequencies to ection and movement distance measurements.LLAP first extracts measure the path length.We observe that sound waves with dif- the sound signal reflected by the moving hand/finger after remov- ferent frequencies have different wavelengths,which leads to dif- ing the background sound signals that are relatively consistent over ferent phase shifts even if they travel through the same path.To time.Second,LLAP measures the phase changes of the sound sig- determine the path length of the reflected sound wave,we first isol- nals caused by hand/finger movements and then converts the phase ate the phase changes caused by hand/finger movement and then changes into the distance of the movement.LLAP achieves a track- apply Inverse Discrete Fourier Transform(IDFT)on the phases of ing accuracy of 3.5 mm and a latency of 15 ms on COTS mobile different sound frequencies to get the TOA of the path.By identi- phones with limited computing power.For mobile devices with two fying the TOA that has the strongest energy in the IDFT result,we or more microphones,LLAP is capable of 2-D gesture tracking that can determine the path length for the sound reflected by the mov- allows users to draw in the air with their hands/fingers. ing hand/finger.Thus,our approach can serve as a coarse-grained initial position estimation.Combining the fine-grained relative dis 1.4 Technical Challenges and Solutions tance measurement and the coarse-grained initial position estima- The first challenge is to achieve mm-level accuracy for the meas- tion,we can achieve a relatively accurate 2-D hand/finger tracking urement of hand/finger movement distance.Existing sound based 1.5 Summary of Experimental Results ranging systems either use the Time-Of-Arrival/Time-Difference- We implemented and evaluated LLAP using commercial mobile Of-Arrival(TOA/TDOA)measurements [7,8]or the Doppler shift phones without any hardware modification.Under normal indoor measurements [9.10].Traditional TOA/TDOA based systems re- noise level,for 1-D hand movement and 2-D drawing in the air, quire the device to emit bursty sound signals,such as pulses or LLAP has a tracking accuracy of 3.5 mm and 4.57 mm,respect- chirps,which are often audible to humans as these signals change ively.Under loud indoor noise level such as playing music,for 1-D abruptly [7,8].Furthermore,their distance measurement accuracy hand movement and 2-D drawing in the air,LLAP has a tracking is often in the scale of cm,except for the recent OFDM phase based accuracy of 5.81 mm and 4.89 mm,respectively.Experimental res- approach [5].Doppler shift based device-free systems do not have ults also show that LLAP can detect small hand/finger movements tracking capability and can only recognize predefined gestures be- For example,for a small single-finger movement of 5 mm,LLAP cause Doppler shift can only provide the coarse-grained measure- has a detection accuracy of 94%within a distance of 30 cm.Using ment of the speed or direction of hand/finger movements due to the gesture traces tracked by LLAP,we can recognize the characters limited frequency measurement precision [9,11,12].In contrast,to achieve mm-level hand/finger tracking accuracy,we leverage the and short words drawn in the air with an accuracy of 92.3%and 91.2%.respectively. fact that the sound reflected by a human hand is coherent to the sound emitted by the mobile device.Two signals are coherent if they have a constant phase difference and the same frequency.This 2. RELATED WORK coherency allows us to use a coherent detector to convert the re- Sound Based Localization and Tracking:TOA and TDOA ceived sound signal into a complex-valued baseband signal.Our ranging systems using sound waves has a good ranging accuracy approach is to first measure the phase change of the reflected sig- of a few centimeters because of the slower propagation speed com- nal,rather than using the noise-prone integration of the Doppler pared to radio waves [7.8,14-16].However,such systems often shift as AAMouse [13]did,and then convert the phase change to either require specially designed ultrasound transceivers [14]or the movement distance of a hand/finger.Compared with traditional emit audible probing sounds,such as short bursty sound pulses or TOA/TDOA,our approach has two advantages:(1)human inaudib- chirps [7,8,15].Furthermore,most existing sound based tracking lity,and(2)mm-level tracking accuracy.Compared with Doppler systems are not device-free as they can only track a device that
in their mTrack system to track the movement of a pen or a finger using steerable directional beams [2]. Based on the fact that light reflection strength changes as a hand/finger moves, Zhang et al. made customized LED/light sensors in their Okuli system to use visible light to track hand/finger movement [3]. Based on vision processing algorithms, Leap Motion made customized infrared cameras to track hand/finger movements [4]. Recently, Nandakumar et al. explored the feasibility of using commercial mobile devices to track fingers/hands within a short distance. They proposed fingerIO, which uses OFDM modulated sound to locate the fingers with accuracy of 8 mm [5]. 1.3 Proposed Approach In this paper, we propose a device-free gesture tracking scheme, called Low-Latency Acoustic Phase (LLAP), that can be deployed on existing mobile devices as a software (such as an APP) without any hardware modification. We use speakers and microphones that already exist on most mobile devices to perform device-free tracking of a hand/finger. Commercial-Off-The-Shelf (COTS) mobile devices can emit and record sound waves with frequency higher than 17 kHz, which are inaudible to most people [6]. The wavelength of sound waves in this frequency range is less than 2 cm. Therefore, a small movement of a few millimeters will significantly change the phase of the received sound wave. Our key idea is to use the acoustic phase to get fine-grained movement direction and movement distance measurements. LLAP first extracts the sound signal reflected by the moving hand/finger after removing the background sound signals that are relatively consistent over time. Second, LLAP measures the phase changes of the sound signals caused by hand/finger movements and then converts the phase changes into the distance of the movement. LLAP achieves a tracking accuracy of 3.5 mm and a latency of 15 ms on COTS mobile phones with limited computing power. For mobile devices with two or more microphones, LLAP is capable of 2-D gesture tracking that allows users to draw in the air with their hands/fingers. 1.4 Technical Challenges and Solutions The first challenge is to achieve mm-level accuracy for the measurement of hand/finger movement distance. Existing sound based ranging systems either use the Time-Of-Arrival/Time-DifferenceOf-Arrival (TOA/TDOA) measurements [7, 8] or the Doppler shift measurements [9, 10]. Traditional TOA/TDOA based systems require the device to emit bursty sound signals, such as pulses or chirps, which are often audible to humans as these signals change abruptly [7, 8]. Furthermore, their distance measurement accuracy is often in the scale of cm, except for the recent OFDM phase based approach [5]. Doppler shift based device-free systems do not have tracking capability and can only recognize predefined gestures because Doppler shift can only provide the coarse-grained measurement of the speed or direction of hand/finger movements due to the limited frequency measurement precision [9, 11, 12]. In contrast, to achieve mm-level hand/finger tracking accuracy, we leverage the fact that the sound reflected by a human hand is coherent to the sound emitted by the mobile device. Two signals are coherent if they have a constant phase difference and the same frequency. This coherency allows us to use a coherent detector to convert the received sound signal into a complex-valued baseband signal. Our approach is to first measure the phase change of the reflected signal, rather than using the noise-prone integration of the Doppler shift as AAMouse [13] did, and then convert the phase change to the movement distance of a hand/finger. Compared with traditional TOA/TDOA, our approach has two advantages: (1) human inaudibility, and (2) mm-level tracking accuracy. Compared with Doppler shift, our approach has three advantages: (1) tracking capability, (2) low latency, and (3) ability to track slow or small movements of a hand/finger. We have lower latency than Doppler shift based systems because Doppler shift requires Fast Fourier Transform (FFT), which needs to accumulate at least 2048 samples (translated to 42.7 ms) to process, whereas we only need to accumulate 16 samples (translated to 0.3 ms). In other words, Doppler shift based systems only respond to hand/finger movement every 42.7 ms whereas our LLAP system can respond to hand/finger movement every 0.3 ms. Note that in practice, we may need to accumulate more samples due to the hardware limitations of mobile devices, such as 512 samples (translated to 10.7 ms) on smartphones. We can deal with slow hand/finger movement because LLAP can precisely measure the accumulated slow phase changes over time. We can deal with small hand/finger movement because LLAP can precisely measure small phase changes that is less than a full phase cycle. In contrast, Doppler-based approaches cannot detect slow or small movements due to their limited frequency resolution, as we show in Section 3. The second challenge is to achieve two dimensional gesture tracking. Although LLAP can precisely measure the relative movement distance of a hand, it cannot directly measure the absolute distance between the hand and the speaker/microphones, and therefore it is hard to determine the initial hand location that is essential for two dimensional tracking. To address this challenge, we use multiple Continuous Waves (CW) with linearly spaced frequencies to measure the path length. We observe that sound waves with different frequencies have different wavelengths, which leads to different phase shifts even if they travel through the same path. To determine the path length of the reflected sound wave, we first isolate the phase changes caused by hand/finger movement and then apply Inverse Discrete Fourier Transform (IDFT) on the phases of different sound frequencies to get the TOA of the path. By identifying the TOA that has the strongest energy in the IDFT result, we can determine the path length for the sound reflected by the moving hand/finger. Thus, our approach can serve as a coarse-grained initial position estimation. Combining the fine-grained relative distance measurement and the coarse-grained initial position estimation, we can achieve a relatively accurate 2-D hand/finger tracking. 1.5 Summary of Experimental Results We implemented and evaluated LLAP using commercial mobile phones without any hardware modification. Under normal indoor noise level, for 1-D hand movement and 2-D drawing in the air, LLAP has a tracking accuracy of 3.5 mm and 4.57 mm, respectively. Under loud indoor noise level such as playing music, for 1-D hand movement and 2-D drawing in the air, LLAP has a tracking accuracy of 5.81 mm and 4.89 mm, respectively. Experimental results also show that LLAP can detect small hand/finger movements. For example, for a small single-finger movement of 5 mm, LLAP has a detection accuracy of 94% within a distance of 30 cm. Using gesture traces tracked by LLAP, we can recognize the characters and short words drawn in the air with an accuracy of 92.3% and 91.2%, respectively. 2. RELATED WORK Sound Based Localization and Tracking: TOA and TDOA ranging systems using sound waves has a good ranging accuracy of a few centimeters because of the slower propagation speed compared to radio waves [7, 8, 14–16]. However, such systems often either require specially designed ultrasound transceivers [14] or emit audible probing sounds, such as short bursty sound pulses or chirps [7, 8, 15]. Furthermore, most existing sound based tracking systems are not device-free as they can only track a device that
transmits or receives sound signals [7.8.10.13-15.17].For ex- 3.MEASURE 1-D RELATIVE DISTANCE ample,AAMouse measures the Doppler shifts of the sound waves In this section,we present our approach to measuring the one transmitted by a smart phone to track the phone itself with an accur- dimensional relative movement distance of a hand/finger,which acy of 1.4 cm [13].In comparison,our approach is device-free as consists of three steps.First,we use a coherent detector to down we use the sound signals reflected by a hand/finger.The problems comert the received sound signal into a complex-valued baseband that we face are more challenging because the signal reflected by signal.Second,we measure the path length change based on the the object has much weaker energy compared to the signal travelled phase changes of the baseband signal.Third,we combine the phase through the Line-Of-Sight (LOS)path changes at different frequencies to mitigate the multipath effect Sound Based Device-Free Gesture Recognition:Most sound Before we introduce these three steps,we analyze the limitations of based device-free gesture recognition systems use the Doppler ef- the Doppler shift based approach,which is used by most existing fect of the sound reflected by hands [9.11.12].Such systems do sound-based gesture recognition systems [8,9,11-13]and present not have tracking capability and can only recognize predefined ges- the advantages of our phase based approach over the Doppler shift tures because Doppler shift can only provide the coarse-grained based approach measurement of the speed or direction of hand/finger movements due to the limited frequency measurement precision [9,11,12].An- 3.1 Limitations of Doppler Shift Based Dis- other system,ApenaApp,uses chirp signals to detect the changes tance Measurement in reflected sound that is caused by human breaths [18].ApenaApp As a moving object changes the frequency of the sound waves applies FFT over the sound signals of a long duration to achieve reflected by it,by measuring the frequency changes in the re- better distance resolution at the cost of reducing the time resolu- ceived sound signal,which is called Doppler shift,we can calcu- tion.Thus,ApenaApp's approach can only be used for long term late the movement speed of the object.The traditional Doppler shift monitoring for periodical movements(such as human breaths)that measurement approach,which uses Short-Time Fourier Transform have frequency lower than 1 Hz.There are keystroke recognition (STFT)to get the Doppler shift,is not suitable for device-free ges systems that use the sound emitted by gestures,such as typing on a ture recognition due to its low resolution and highly noisy results. keyboard or tapping on a table,to recognize keystrokes [19-21]or First,the resolution of STFT is limited by the fundamental con- handwriting [22].Compared with such systems,we use inaudible. straints of time-frequency analysis [36].The STFT approach first rather than audible,sound reflected by hands/fingers. divides the received sound data into data segments,where each In recent pioneer work parallel with us,Nandakumar et al.pro- segment has equal number(say 2.048)of signal samples,and then posed an OFDM based finger tracking system,called fingerlO [5]. FingerIO achieves a finger location accuracy of 8 mm and also performs Fast Fourier Transform(FFT)on each segment to get the spectrum of the given data segment.With a small segment size,the allows 2-D drawing in the air using COTS mobile devices.The frequency resolution is very low.For example,when the segment key difference between LLAP and fingerIO is that LLAP uses CW size is 2,048 samples and the sampling rate is 48 kHz,the frequency signals rather than OFDM pulses.The phase measured by CW resolution of STFT is 23.4 Hz.This corresponds to a movement signals is less noisy due to the narrower bandwidth compared to OFDM pulses.This allows LLAP to achieve better tracking ac- speed of 0.2 meters per second(m/s)when the sound wave has a frequency of 20 kHz.In other words,the hand must move at a speed curacy.Furthermore,the complex valued baseband signal extracted of at least 20 cm per second to be detectable by the STFT approach. by LLAP can potentially give more information about hand/finger Note that improving the frequency resolution is always at the cost movements than the TOA measurements from fingerIO.However, of reducing the time resolution [36].For example,if we use a larger the CW signal approach used by LLAP is more susceptible to the segment size with 48,000 samples to get the frequency resolution of interference of background movements than the OFDM approach. 1 Hz,this will inevitably reduce the time resolution of STFT to one RF Based Gesture Recognition:Radio Frequency (RF)signals, second as it takes one second to collect 48,000 samples when the such as Wi-Fi signals,reflected by human bodies can be used for sampling rate is 48 kHz.Distance measuring schemes with such a human gesture and activity recognition [23-28].However,as the low time resolution are unacceptable for interactive inputs because propagation speed of light is almost one million times faster than they can only measure the moving distances of a hand/finger at a the speed of sound,it is very difficult to achieve fine-grained dis- one-second time interval.Note that the resolution for STFT cannot tance measurements through RF signals.Therefore,existing Wi- be improved by padding short data segments with zeros and per Fi signal based gesture recognition systems cannot perform fine- form FFT with a larger size,as done in [13],because zero padding grained quantification of gesture movement.Instead,they recog- is equivalent to convolution with a sinc function in the frequency nize predefined gestures,such as punch,push,or sweep [27,29.30] domain.Figure 2 shows the STFT result for a hand that first moves When using narrow band RF signals lower than 5 GHz,the state toward and then moves away from the microphone,where each of-the-art tracking systems have a measurement accuracy of sev- sample segment contains 2,048 samples and is padded with zeros eral cm [31,32].To the best of our knowledge,the only RF based to perform FFT with size of 48,000.Although the frequency resol- gesture recognition systems that achieve mm-level tracking accur- ution seems to be improved to 1 Hz when we perform FFT with a acy are mTrack [2]and Soli [1],which uses 60 GHz RF signals larger size,the high energy band in the frequency domain(red part The key advantage of our system over mTrack and Soli is that we in the spectrogram)still spans about 80 Hz range,instead of being use speakers and microphones that already exist on most mobile around I Hz.Most of the small frequency variations are buried in devices to perform device-free tracking of a hand/finger. this wide band and we can only roughly recognize a positive fre- Vision Based Gesture Recognition:Vision based gesture re- quency shift from 4 to 5.2 seconds and a negative frequency shift cognition systems use cameras or light sensors to capture fine- from 6 to 7.5 seconds. grained gesture movements [3,4,33-35].For example,Okuli Second,Doppler shift measurements are subject to high noises achieves a localization accuracy of 7 mm using LED and light as shown in Figure 2.In device-based tracking systems,such as sensors [3].However,such systems have a limited viewing angle AAMouse [13],where the sound source or sound receiver is mov- and are susceptible to lighting condition changes [3].In contrast, ing,it is possible to use the frequency that has the maximal energy LLAP can operate while the device is within the pocket. to determine the Doppler shift.In device-free tracking systems
transmits or receives sound signals [7, 8, 10, 13–15, 17]. For example, AAMouse measures the Doppler shifts of the sound waves transmitted by a smart phone to track the phone itself with an accuracy of 1.4 cm [13]. In comparison, our approach is device-free as we use the sound signals reflected by a hand/finger. The problems that we face are more challenging because the signal reflected by the object has much weaker energy compared to the signal travelled through the Line-Of-Sight (LOS) path. Sound Based Device-Free Gesture Recognition: Most sound based device-free gesture recognition systems use the Doppler effect of the sound reflected by hands [9, 11, 12]. Such systems do not have tracking capability and can only recognize predefined gestures because Doppler shift can only provide the coarse-grained measurement of the speed or direction of hand/finger movements due to the limited frequency measurement precision [9,11,12]. Another system, ApenaApp, uses chirp signals to detect the changes in reflected sound that is caused by human breaths [18]. ApenaApp applies FFT over the sound signals of a long duration to achieve better distance resolution at the cost of reducing the time resolution. Thus, ApenaApp’s approach can only be used for long term monitoring for periodical movements (such as human breaths) that have frequency lower than 1 Hz. There are keystroke recognition systems that use the sound emitted by gestures, such as typing on a keyboard or tapping on a table, to recognize keystrokes [19–21] or handwriting [22]. Compared with such systems, we use inaudible, rather than audible, sound reflected by hands/fingers. In recent pioneer work parallel with us, Nandakumar et al. proposed an OFDM based finger tracking system, called fingerIO [5]. FingerIO achieves a finger location accuracy of 8 mm and also allows 2-D drawing in the air using COTS mobile devices. The key difference between LLAP and fingerIO is that LLAP uses CW signals rather than OFDM pulses. The phase measured by CW signals is less noisy due to the narrower bandwidth compared to OFDM pulses. This allows LLAP to achieve better tracking accuracy. Furthermore, the complex valued baseband signal extracted by LLAP can potentially give more information about hand/finger movements than the TOA measurements from fingerIO. However, the CW signal approach used by LLAP is more susceptible to the interference of background movements than the OFDM approach. RF Based Gesture Recognition: Radio Frequency (RF) signals, such as Wi-Fi signals, reflected by human bodies can be used for human gesture and activity recognition [23–28]. However, as the propagation speed of light is almost one million times faster than the speed of sound, it is very difficult to achieve fine-grained distance measurements through RF signals. Therefore, existing WiFi signal based gesture recognition systems cannot perform finegrained quantification of gesture movement. Instead, they recognize predefined gestures, such as punch, push, or sweep [27,29,30]. When using narrow band RF signals lower than 5 GHz, the stateof-the-art tracking systems have a measurement accuracy of several cm [31, 32]. To the best of our knowledge, the only RF based gesture recognition systems that achieve mm-level tracking accuracy are mTrack [2] and Soli [1], which uses 60 GHz RF signals. The key advantage of our system over mTrack and Soli is that we use speakers and microphones that already exist on most mobile devices to perform device-free tracking of a hand/finger. Vision Based Gesture Recognition: Vision based gesture recognition systems use cameras or light sensors to capture finegrained gesture movements [3, 4, 33–35]. For example, Okuli achieves a localization accuracy of 7 mm using LED and light sensors [3]. However, such systems have a limited viewing angle and are susceptible to lighting condition changes [3]. In contrast, LLAP can operate while the device is within the pocket. 3. MEASURE 1-D RELATIVE DISTANCE In this section, we present our approach to measuring the onedimensional relative movement distance of a hand/finger, which consists of three steps. First, we use a coherent detector to down convert the received sound signal into a complex-valued baseband signal. Second, we measure the path length change based on the phase changes of the baseband signal. Third, we combine the phase changes at different frequencies to mitigate the multipath effect. Before we introduce these three steps, we analyze the limitations of the Doppler shift based approach, which is used by most existing sound-based gesture recognition systems [8, 9, 11–13] and present the advantages of our phase based approach over the Doppler shift based approach. 3.1 Limitations of Doppler Shift Based Distance Measurement As a moving object changes the frequency of the sound waves reflected by it, by measuring the frequency changes in the received sound signal, which is called Doppler shift, we can calculate the movement speed of the object. The traditional Doppler shift measurement approach, which uses Short-Time Fourier Transform (STFT) to get the Doppler shift, is not suitable for device-free gesture recognition due to its low resolution and highly noisy results. First, the resolution of STFT is limited by the fundamental constraints of time-frequency analysis [36]. The STFT approach first divides the received sound data into data segments, where each segment has equal number (say 2,048) of signal samples, and then performs Fast Fourier Transform (FFT) on each segment to get the spectrum of the given data segment. With a small segment size, the frequency resolution is very low. For example, when the segment size is 2,048 samples and the sampling rate is 48 kHz, the frequency resolution of STFT is 23.4 Hz. This corresponds to a movement speed of 0.2 meters per second (m/s) when the sound wave has a frequency of 20 kHz. In other words, the hand must move at a speed of at least 20 cm per second to be detectable by the STFT approach. Note that improving the frequency resolution is always at the cost of reducing the time resolution [36]. For example, if we use a larger segment size with 48,000 samples to get the frequency resolution of 1 Hz, this will inevitably reduce the time resolution of STFT to one second as it takes one second to collect 48,000 samples when the sampling rate is 48 kHz. Distance measuring schemes with such a low time resolution are unacceptable for interactive inputs because they can only measure the moving distances of a hand/finger at a one-second time interval. Note that the resolution for STFT cannot be improved by padding short data segments with zeros and perform FFT with a larger size, as done in [13], because zero padding is equivalent to convolution with a sinc function in the frequency domain. Figure 2 shows the STFT result for a hand that first moves toward and then moves away from the microphone, where each sample segment contains 2,048 samples and is padded with zeros to perform FFT with size of 48,000. Although the frequency resolution seems to be improved to 1 Hz when we perform FFT with a larger size, the high energy band in the frequency domain (red part in the spectrogram) still spans about 80 Hz range, instead of being around 1 Hz. Most of the small frequency variations are buried in this wide band and we can only roughly recognize a positive frequency shift from 4 to 5.2 seconds and a negative frequency shift from 6 to 7.5 seconds. Second, Doppler shift measurements are subject to high noises as shown in Figure 2. In device-based tracking systems, such as AAMouse [13], where the sound source or sound receiver is moving, it is possible to use the frequency that has the maximal energy to determine the Doppler shift. In device-free tracking systems
1.81X10 g 6.5 79 Time(seconds) 170 (a)I/Q waveforms 5.5 6 6.5 7.5 Time(seconds) 100 Figure 2:Doppler shift of hand movements however,the frequency with the highest energy,which is plotted Static as the white line around 18 kHz in Figure 2,does not closely fol- vector low the hand movement because the sound waves reflected by the -100 moving hand are mixed with the sound waves traveling through the 'Dynamic Line-Of-Sight (LOS)path as well as those reflected by static ob- vector jects.Furthermore,there are impluses in the Doppler shift measure- -200 Starting Ending (4.04s) (4.64s) ments due to frequency selective fading caused by the hand move- ment,i.e.,the sound waves traveling from different paths may get cancelled with each other on the target frequency when the hand is 30000 -100 100 200 at certain positions. I(normalized) (b)Complex I/Q traces 3.2 Phase Based Distance Measurement Figure 3:Baseband signal of sound waves Because of the above limitations of Doppler shift based distance measurement,we propose a phase based distance measurement ap- suming that the speed of sound is c=343 m/s,the wavelength of proach for sound signals.As Doppler shift in the reflected signal sound signals with frequency f 18 kHz is 1.9 cm.We observe is caused by the increase/decrease in the phase of the signal when that the complex signal moves by about 4.25 circles,which cor- the hand moves close/moves away,the idea is to treat the reflected responds to an 8.5 increase in phase values in Figure 3(b).Thus signal as a phase modulated signal whose phase changes with the the path length changes by 1.9 x 4.25 =8.08 cm during the 0.6 movement distance.Except for fingerIO that uses OFDM phase [5]. second shown in Figure 3(b).This is equivalent to hand movement no prior work has used phase changes of sound signals to measure distance of 4.04 cm considering the two-way path length change movement distance,although the phase of RF baseband signal has Furthermore.we can determine whether the hand is moving toward been used for measuring the movement distance of objects [2,23] or moving away from the microphone by the sign of the phase Compared to the Doppler shift,the phase change of the baseband changes.Note that it is important to use both the I and Q com- signal can be easily measured in the time domain.Figure 3 shows ponents because the movement direction information is lost when the In-phase (I)and the Quadrature (Q)components of the base- we only use a single component or the magnitude [23]. band signal obtained from the same sound record that produces the This phase based distance measurement approach has three ad- spectrogram in Figure 2.From Figure 3(a).we observe that the I/O vantages over the Doppler shift based approach.First,the accuracy waveforms remain static when the hand is not moving and vary like is much higher because by directly measuring the phase changes sinusoids when the hand moves.Combining the in-phase (as the we eliminate the noise-prone steps of first measuring the Dop- real part)and quadrature (as the imaginary part)components into pler shift and then integrating the Doppler shift to get the dis a complex signal,we can clearly observe patterns caused by hand tance changes.Second,the latency is much lower because the phase movement.Figure 3(b)shows how the complex signal changes dur- measurement can be conducted on a short data segment with only ing a short time period from 4.04 to 4.64 seconds while the hand hundreds of samples.Third,the speed resolution is much higher moves towards the microphone.We observe that the traces of the because the phase measurement can track small phase changes complex signal are close to circles on the complex plane. and slow phase shifts.For example,phase based measurement can In essence,the complex signal is a combination of two vectors easily achieve 2.4 mm distance resolution,which corresponds to in the complex plane:we call a static vector and a dynamic vector. a phase change of /4 when the wavelength is 1.9 cm.Further- The static vector corresponds to the sound wave traveling through more,the information is much richer because phase measurements the LOS path or reflected by static objects,such as walls and tables. provide more information than what we get from STFT.For ex- This vector remains quasi-static during this short time period.The ample,the phase difference at different frequencies can be used for dynamic vector corresponds to the reflection caused by the moving localizing the hand as discussed in Section 4 hand.When the hand moves towards the microphone,we observe an increase in the phase of the dynamic vector,which is caused 3.3 LLAP Overview by the decrease in length of the reflected path.As the phase of We now give an overview of LLAP when operating on a single the signal increases by 2 when the path length decreases by one sound frequency.Without loss of generality,we assume that the wavelength of the sound wave,we can calculate the distance that sampling frequency of the device is 48 kHz.We have tested our the hand moves via the phase change of the dynamic vector.As- implementation under other sampling frequencies,e.g.,44.1 kHz
Figure 2: Doppler shift of hand movements however, the frequency with the highest energy, which is plotted as the white line around 18 kHz in Figure 2, does not closely follow the hand movement because the sound waves reflected by the moving hand are mixed with the sound waves traveling through the Line-Of-Sight (LOS) path as well as those reflected by static objects. Furthermore, there are impluses in the Doppler shift measurements due to frequency selective fading caused by the hand movement, i.e., the sound waves traveling from different paths may get cancelled with each other on the target frequency when the hand is at certain positions. 3.2 Phase Based Distance Measurement Because of the above limitations of Doppler shift based distance measurement, we propose a phase based distance measurement approach for sound signals. As Doppler shift in the reflected signal is caused by the increase/decrease in the phase of the signal when the hand moves close/moves away, the idea is to treat the reflected signal as a phase modulated signal whose phase changes with the movement distance. Except for fingerIO that uses OFDM phase [5], no prior work has used phase changes of sound signals to measure movement distance, although the phase of RF baseband signal has been used for measuring the movement distance of objects [2, 23]. Compared to the Doppler shift, the phase change of the baseband signal can be easily measured in the time domain. Figure 3 shows the In-phase (I) and the Quadrature (Q) components of the baseband signal obtained from the same sound record that produces the spectrogram in Figure 2. From Figure 3(a), we observe that the I/Q waveforms remain static when the hand is not moving and vary like sinusoids when the hand moves. Combining the in-phase (as the real part) and quadrature (as the imaginary part) components into a complex signal, we can clearly observe patterns caused by hand movement. Figure 3(b) shows how the complex signal changes during a short time period from 4.04 to 4.64 seconds while the hand moves towards the microphone. We observe that the traces of the complex signal are close to circles on the complex plane. In essence, the complex signal is a combination of two vectors in the complex plane: we call a static vector and a dynamic vector. The static vector corresponds to the sound wave traveling through the LOS path or reflected by static objects, such as walls and tables. This vector remains quasi-static during this short time period. The dynamic vector corresponds to the reflection caused by the moving hand. When the hand moves towards the microphone, we observe an increase in the phase of the dynamic vector, which is caused by the decrease in length of the reflected path. As the phase of the signal increases by 2π when the path length decreases by one wavelength of the sound wave, we can calculate the distance that the hand moves via the phase change of the dynamic vector. As- 3.5 4 4.5 5 5.5 6 6.5 7 7.5 8 −300 −200 −100 0 100 200 300 Time (seconds) I/Q (normalized) I Q (a) I/Q waveforms −200 −100 0 100 200 −300 −200 −100 0 100 I (normalized) Q (normalized) Starting (4.04s) Static vector Ending (4.64s) Dynamic vector (b) Complex I/Q traces Figure 3: Baseband signal of sound waves suming that the speed of sound is c = 343 m/s, the wavelength of sound signals with frequency f = 18 kHz is 1.9 cm. We observe that the complex signal moves by about 4.25 circles, which corresponds to an 8.5π increase in phase values in Figure 3(b). Thus, the path length changes by 1.9 × 4.25 = 8.08 cm during the 0.6 second shown in Figure 3(b). This is equivalent to hand movement distance of 4.04 cm considering the two-way path length change. Furthermore, we can determine whether the hand is moving toward or moving away from the microphone by the sign of the phase changes. Note that it is important to use both the I and Q components because the movement direction information is lost when we only use a single component or the magnitude [23]. This phase based distance measurement approach has three advantages over the Doppler shift based approach. First, the accuracy is much higher because by directly measuring the phase changes, we eliminate the noise-prone steps of first measuring the Doppler shift and then integrating the Doppler shift to get the distance changes. Second, the latency is much lower because the phase measurement can be conducted on a short data segment with only hundreds of samples. Third, the speed resolution is much higher because the phase measurement can track small phase changes and slow phase shifts. For example, phase based measurement can easily achieve 2.4 mm distance resolution, which corresponds to a phase change of π/4 when the wavelength is 1.9 cm. Furthermore, the information is much richer because phase measurements provide more information than what we get from STFT. For example, the phase difference at different frequencies can be used for localizing the hand as discussed in Section 4. 3.3 LLAP Overview We now give an overview of LLAP when operating on a single sound frequency. Without loss of generality, we assume that the sampling frequency of the device is 48 kHz. We have tested our implementation under other sampling frequencies, e.g., 44.1 kHz
and obtained similar results as in 48 kHz.LLAP uses Continuous represented as Rp(t)=2A cos(2rft-2nfdp(t)/c-0p).where Wave(CW)signal of Acos 2ft,where A is the amplitude and f 2Ap is the amplitude of the received signal,the term 2fdp(t)/c is the frequency of the sound,which is in the range of 1723 kHz. comes from the phase lag caused by the propagation delay of CW sound signals in this range can be generated by many COTS Tp =dp(t)/c and c is the speed of sound.There is also an ini- devices without introducing audible noises [6]. tial phase p.which is caused by the hardware delay and phase We use the microphones on the same device to record the sound inversion due to reflection.Based on the system structure shown in wave using the same sampling rate of 48 kHz.As the received Figure 4,when we multiply this received signal with cos(2ft) sound waves are transmitted by the same device,there is no Carrier we have: Frequency Offset(CFO)between the sender and receiver.There- fore.we can use the traditional coherent detector structure as shown 2 A cos(2rft-2πfdp(t)/c-0p)×cos(2πft) in Figure 4 to down convert the received sound signal to a base- band signal [37].The received signal is first split into two identical Ap(cos(-2xfdp(t)/c-0p)+cos(4nft-2mfdp(t)/c-0p)). copies and multiplied with the transmitted signal cos 2ft and its Note that the second term has a high frequency of 2f phase shifted version-sin 2mft.We then use a Cascaded Integ- and will be removed by the low-pass CIC filter.There- rator Comb(CIC)filter to remove high frequency components and fore,we have the I-component of the baseband as Ip(t)= decimate the signal to get the corresponding In-phase and Quadrat- Ap cos(-2rfdp(t)/c-0p).Similarly,we get the Q-component as ure signals. Qp(t)=Ap sin(-2fdp(t)/c-0p).Combining these two com- ponents as real and imaginary part of a complex signal,we have the Acos2nft CIC complex baseband as follows,wherej=-1: ↑cos2mft Bp(t)=Ate-i(2x/dp(t)/c+0p) (1) Q CIC Note that the phase for path p is p(t)=-(2xfdp(t)/c+p). ↑-sin2mft which changes by 2 when dp(t)changes by the amount of sound wavelength入=c/f Figure 4:System structure 3.5 Phase Based Path Length Measurement 3.4 Sound Signal Down Conversion As the received signal is a combination of the signals traveling through many paths,we need to first extract the baseband signal Our CIC filter is a three section filter with the decimate ratio of component that corresponds to the one reflected by the moving 16 and differential delay of 17.Figure 5 shows the frequency re- hand so that we can infer the movement distance from the phase sponse of the CIC filter.We select the parameters so that the first change of that component,as we will show next.Thus.we need and second zeros of the filter appear at 175 Hz and 350 Hz.The to decompose the baseband signal into the static and dynamic vec- pass-band of the CIC filter is0100 Hz,which corresponds to the tor.Recall that the static vector comes from sound waves travel- movements with a speed lower than 0.95 m/s when the wavelength ing through the LOS path or the static surrounding objects,which is 1.9 cm.The second zero of the filter appears at 350 Hz so that could be much stronger compared to the sound waves reflected the signals at (f+350)Hz will be attenuated by more than 120 by hand.In practice,this static vector may also vary slowly with dB.Thus,to minimize the interferences from adjacent frequencies. the movement of the hand.Such changes in the static vector are we use a frequency interval of 350 Hz when the speaker transmits caused by the blocking of other objects by the moving hand or multiple frequencies simultaneously.To achieve better computa- slow movements of the arm.It is therefore challenging to sep- tional efficiency,we do not use a frequency compensate FIR filter arate the slowly changing static vector from the dynamic vector after the CIC. caused by a slow hand movement.Existing work in 60 GHz tech- nology uses two methods.Dual-Differential Background Removal (DDBR)and Phase Counting and Reconstruction(PCR).to remove -与0 the static vector [2].However,the DDBR algorithm is susceptible to noises and cannot reliably detect slow movements,while PCR 100 has long latency and requires strong periodicity in the baseband signal.Thus,both of these algorithms are not suitable for our pur- .150 pose. 0正 0 0.8 We use a heuristic algorithm called Local Extreme Value De- Frequency(kHz) tection (LEVD)to estimate the static vector.This algorithm op Figure 5:Frequency response of CIC filter erates on the I/Q component separately to estimate the real and imaginary parts of the static vector.The basic idea of LEVD is in- CIC filter incurs low computational overhead as they involve spired by the well-known Empirical Mode Decomposition (EMD) only additions and subtractions.Therefore,we only need two multi- algorithm [38].We first find alternate local maximum and min- plications per sample point for the down conversion,i.e..multiply- imum points that are different more than an empirical threshold ing the cos 2mft and -sin 2ft with each received sample.For Thr,which is set as three times of the standard deviation of the 48 kHz sampling rate,this only involves 96,000 multiplications per baseband signal in a static environment.These large variations in second and can be easily carried out by mobile devices.After the the waveform indicate the movements of surrounding objects.We down conversion,the sampling rate is decreased to 3 kHz to make then use the average of two nearby local maxima and minima as the subsequent signal processing more efficient. estimated value of the static vector.Since the dynamic vector has a To understand the digital down conversion process,we consider trace similar to circles,the average of two extremes would be close the sound signal that travels through a path p with time-varying to the center.Figure 6 shows the LEVD result for a short piece of path length of dp(t).This received sound signal from path p can be waveform in Figure 3(a).LEVD pseudocode is in Algorithm 1
and obtained similar results as in 48 kHz. LLAP uses Continuous Wave (CW) signal of A cos 2πf t, where A is the amplitude and f is the frequency of the sound, which is in the range of 17 ∼ 23 kHz. CW sound signals in this range can be generated by many COTS devices without introducing audible noises [6]. We use the microphones on the same device to record the sound wave using the same sampling rate of 48 kHz. As the received sound waves are transmitted by the same device, there is no Carrier Frequency Offset (CFO) between the sender and receiver. Therefore, we can use the traditional coherent detector structure as shown in Figure 4 to down convert the received sound signal to a baseband signal [37]. The received signal is first split into two identical copies and multiplied with the transmitted signal cos 2πf t and its phase shifted version − sin 2πf t. We then use a Cascaded Integrator Comb (CIC) filter to remove high frequency components and decimate the signal to get the corresponding In-phase and Quadrature signals. cos 2πft —sin 2πft CIC CIC I Q Acos2πft Figure 4: System structure 3.4 Sound Signal Down Conversion Our CIC filter is a three section filter with the decimate ratio of 16 and differential delay of 17. Figure 5 shows the frequency response of the CIC filter. We select the parameters so that the first and second zeros of the filter appear at 175 Hz and 350 Hz. The pass-band of the CIC filter is 0 ∼ 100 Hz, which corresponds to the movements with a speed lower than 0.95 m/s when the wavelength is 1.9 cm. The second zero of the filter appears at 350 Hz so that the signals at (f ± 350) Hz will be attenuated by more than 120 dB. Thus, to minimize the interferences from adjacent frequencies, we use a frequency interval of 350 Hz when the speaker transmits multiple frequencies simultaneously. To achieve better computational efficiency, we do not use a frequency compensate FIR filter after the CIC. 0 0.2 0.4 0.6 0.8 1 −150 −100 −50 0 Frequency (kHz) Magnitude (dB) Figure 5: Frequency response of CIC filter CIC filter incurs low computational overhead as they involve only additions and subtractions. Therefore, we only need two multiplications per sample point for the down conversion, i.e., multiplying the cos 2πf t and − sin 2πf t with each received sample. For 48 kHz sampling rate, this only involves 96,000 multiplications per second and can be easily carried out by mobile devices. After the down conversion, the sampling rate is decreased to 3 kHz to make subsequent signal processing more efficient. To understand the digital down conversion process, we consider the sound signal that travels through a path p with time-varying path length of dp(t). This received sound signal from path p can be represented as Rp(t) = 2A 0 p cos(2πf t−2πf dp(t)/c−θp), where 2A 0 p is the amplitude of the received signal, the term 2πf dp(t)/c comes from the phase lag caused by the propagation delay of τp = dp(t)/c and c is the speed of sound. There is also an initial phase θp, which is caused by the hardware delay and phase inversion due to reflection. Based on the system structure shown in Figure 4, when we multiply this received signal with cos(2πf t), we have: 2A 0 p cos(2πf t − 2πf dp(t)/c − θp) × cos(2πf t) = A 0 p cos(−2πf dp(t)/c − θp) + cos(4πf t − 2πf dp(t)/c − θp) . Note that the second term has a high frequency of 2f and will be removed by the low-pass CIC filter. Therefore, we have the I-component of the baseband as Ip(t) = A 0 p cos(−2πf dp(t)/c−θp). Similarly, we get the Q-component as Qp(t) = A 0 p sin(−2πf dp(t)/c − θp). Combining these two components as real and imaginary part of a complex signal, we have the complex baseband as follows, where j 2 = −1: Bp(t) = A 0 pe −j(2πfdp(t)/c+θp) . (1) Note that the phase for path p is φp(t) = −(2πf dp(t)/c + θp), which changes by 2π when dp(t) changes by the amount of sound wavelength λ = c/f. 3.5 Phase Based Path Length Measurement As the received signal is a combination of the signals traveling through many paths, we need to first extract the baseband signal component that corresponds to the one reflected by the moving hand so that we can infer the movement distance from the phase change of that component, as we will show next. Thus, we need to decompose the baseband signal into the static and dynamic vector. Recall that the static vector comes from sound waves traveling through the LOS path or the static surrounding objects, which could be much stronger compared to the sound waves reflected by hand. In practice, this static vector may also vary slowly with the movement of the hand. Such changes in the static vector are caused by the blocking of other objects by the moving hand or slow movements of the arm. It is therefore challenging to separate the slowly changing static vector from the dynamic vector caused by a slow hand movement. Existing work in 60 GHz technology uses two methods, Dual-Differential Background Removal (DDBR) and Phase Counting and Reconstruction (PCR), to remove the static vector [2]. However, the DDBR algorithm is susceptible to noises and cannot reliably detect slow movements, while PCR has long latency and requires strong periodicity in the baseband signal. Thus, both of these algorithms are not suitable for our purpose. We use a heuristic algorithm called Local Extreme Value Detection (LEVD) to estimate the static vector. This algorithm operates on the I/Q component separately to estimate the real and imaginary parts of the static vector. The basic idea of LEVD is inspired by the well-known Empirical Mode Decomposition (EMD) algorithm [38]. We first find alternate local maximum and minimum points that are different more than an empirical threshold T hr, which is set as three times of the standard deviation of the baseband signal in a static environment. These large variations in the waveform indicate the movements of surrounding objects. We then use the average of two nearby local maxima and minima as the estimated value of the static vector. Since the dynamic vector has a trace similar to circles, the average of two extremes would be close to the center. Figure 6 shows the LEVD result for a short piece of waveform in Figure 3(a). LEVD pseudocode is in Algorithm 1
200 -I(raw) Algorithm 1:Local Extreme Value Detection Algorithm -Static (LEVD) Input:One baseband signal component X(t)=I(t)or Q(t) 100 Static(avg) ▲Local extrema t=0..T Output:Real or imaginary part of the estimated static vector S(t). Larger t Thr t=0...T 1 Initialize n:number of extrema,S(0):initial estimation 100 2 E(n):extrema list 3 fort=1toTdo /*Find extreme points that meet our requirements"/ 3.5 ti瓶e(seconds) 5.5 if X(t)is a local maxima or minima then Compare X(t)with the last extreme point E(n)in the list: Figure 6:Local extrema based static vector estimation if Both X(t)and E(n)are local maxima/minima,and the value of X(t)is larger/smaller than E(n)then E(n)←-X(t)月 The advantage of LEVD lies in its robustness to movement speed end if One of X(t)and E(n)is maxima and the other is minima. changes.On one hand,by following the averages of the extreme and X(t)-E(n)>Thr then points,it can quickly trace static vector changes caused by arm 1 n←-n+1; movements when the hand moves fast.On the other hand.the es- 12 E(n)←-X(t) timated vaule of the static vector remains constant when there are 3 end no movements or the movements are slow.For example,during the 14 end time period of 5.5 to 6 seconds in Figure 6,the normalized value /Update the static component estimation using exponential of in-phase component is around-100,which is far away from the moving average*/ 16 actual real part of the static vector.If we use a long term averaging S(t)-0.9×S(t-1)+0.1×(E(n-1)+E(n)/2 17 end algorithm to estimate the static vector,the estimated real part of the static vector will slowly drift towards-100.In contrast,the static s return S(t) vector estimation of LEVD keeps stable as there are no valid ex- treme points during this period. After finding the static vector using LEVD,we subtract it from the baseband signal to get the dynamic vector.We then use the path length phase od(t)of the dynamic vector to determine the path length 20 change 2a change.We first unwrap the phase oa(t)and the path length change during the time period 0~t is given by: d()-do)=-a(④-pa(0) ×入 (2) 2n path length where d(t)is the path length from the speaker reflected through the 500 80089nal2814o0 160 hand to the microphone,and A =c/f is the sound wavelength When the hand and the microphone/speaker are on the same line, (a)I/Q trace under multipath (b)Impact of the hand size the movement distance of the hand is (d(t)-d(0))/2 when it moves towards the speaker,as shown in Figure 4.Note that the Figure 7:Multipath effect distance calculation can be made on a small data segment,e.g.. segments with only hundreds of samples.This allows us to respond to hand movements with very low latency,such as 15 ms. obtained from different frequencies to mitigate the multipath ef- fect.To get the baseband signal at different frequencies,we trans- 3.6 Multipath Effect Mitigation mit sounds at multiple frequencies at the same time.The coherent Although LEVD can mitigate the effect of static multipaths by detection structure can be applied on each frequency to obtain one subtracting the static vector,there are dynamic multipaths when the complex baseband signal for each frequency.We remove the in- hand moves.A path that the sound wave travels is called static if its terference between adjacent frequencies by carefully selecting the length does not change as the hand moves and dynamic if its length parameters of the CIC filter and the frequency interval.Thus,each changes as the hand moves.An example dynamic path is from the frequency can be measured independently.After getting the phase speaker to the hand,and then to a nearby table,and finally to the of dynamic vectors at different frequencies,we can obtain the dis- microphone.Therefore,sometimes there are multiple dynamic vec- tance change curve over time using the wavelength corresponding tors and these dynamic vectors may have different phases.This will to each frequency.We combine the results of different frequencies result in complex signal trajectories,as shown in Figure 7(a).Be- using linear regression.Our approach is based on two observations. cause of dynamic multipaths,it is difficult to determine the actual First,the measured distance change should be the same for all fre- phase change from superimposed dynamic vectors. quencies when there is no multipath effect.Second,the distance We use frequency diversity to mitigate the multipath effect.The should change linearly during a short time period,e.g.,10 ms,as wavelengths of different sound frequencies are different.Thus,the the movement speed is almost constant during that short period. phases of the same multipath component are different under dif- Therefore,we use linear regression to find the best line that fits ferent frequencies,and the phase changes under different frequen- all distance change curves obtained from different frequencies.For cies are also different.The dynamic vectors at different frequencies those frequencies that have abnormal distance estimation results are combinations of the same set of dynamic paths under different due to multipath effects,the regression error will be large.We then phase offsets.As the multipath components are combined differ- remove frequencies with large regression errors to achieve a better ently in different frequencies,we can combine the measurements linear regression result using the rest of frequencies
3.5 4 4.5 5 5.5 6 −100 0 100 200 Time (seconds) I (normalized) I (raw) Static (LEVD) Static (avg) Local extrema Larger than Thr Figure 6: Local extrema based static vector estimation The advantage of LEVD lies in its robustness to movement speed changes. On one hand, by following the averages of the extreme points, it can quickly trace static vector changes caused by arm movements when the hand moves fast. On the other hand, the estimated vaule of the static vector remains constant when there are no movements or the movements are slow. For example, during the time period of 5.5 to 6 seconds in Figure 6, the normalized value of in-phase component is around -100, which is far away from the actual real part of the static vector. If we use a long term averaging algorithm to estimate the static vector, the estimated real part of the static vector will slowly drift towards -100. In contrast, the static vector estimation of LEVD keeps stable as there are no valid extreme points during this period. After finding the static vector using LEVD, we subtract it from the baseband signal to get the dynamic vector. We then use the phase φd(t) of the dynamic vector to determine the path length change. We first unwrap the phase φd(t) and the path length change during the time period 0 ∼ t is given by: d(t) − d(0) = − φd(t) − φd(0) 2π × λ (2) where d(t) is the path length from the speaker reflected through the hand to the microphone, and λ = c/f is the sound wavelength. When the hand and the microphone/speaker are on the same line, the movement distance of the hand is (d(t) − d(0))/2 when it moves towards the speaker, as shown in Figure 4. Note that the distance calculation can be made on a small data segment, e.g., segments with only hundreds of samples. This allows us to respond to hand movements with very low latency, such as 15 ms. 3.6 Multipath Effect Mitigation Although LEVD can mitigate the effect of static multipaths by subtracting the static vector, there are dynamic multipaths when the hand moves. A path that the sound wave travels is called static if its length does not change as the hand moves and dynamic if its length changes as the hand moves. An example dynamic path is from the speaker to the hand, and then to a nearby table, and finally to the microphone. Therefore, sometimes there are multiple dynamic vectors and these dynamic vectors may have different phases. This will result in complex signal trajectories, as shown in Figure 7(a). Because of dynamic multipaths, it is difficult to determine the actual phase change from superimposed dynamic vectors. We use frequency diversity to mitigate the multipath effect. The wavelengths of different sound frequencies are different. Thus, the phases of the same multipath component are different under different frequencies, and the phase changes under different frequencies are also different. The dynamic vectors at different frequencies are combinations of the same set of dynamic paths under different phase offsets. As the multipath components are combined differently in different frequencies, we can combine the measurements Algorithm 1: Local Extreme Value Detection Algorithm Input: One baseband signal component X(t) = I(t) or Q(t), t = 0 . . . T Output: Real or imaginary part of the estimated static vector S(t), t = 0 . . . T 1 Initialize n: number of extrema, S(0): initial estimation 2 E(n): extrema list 3 for t = 1 to T do 4 /*Find extreme points that meet our requirements*/ 5 if X(t) is a local maxima or minima then 6 Compare X(t) with the last extreme point E(n) in the list; 7 if Both X(t) and E(n) are local maxima/minima, and the value of X(t) is larger/smaller than E(n) then 8 E(n) ← X(t); 9 end 10 if One of X(t) and E(n) is maxima and the other is minima, and |X(t) − E(n)| > T hr then 11 n ← n + 1; 12 E(n) ← X(t); 13 end 14 end 15 /*Update the static component estimation using exponential moving average*/ 16 S(t) ← 0.9 × S(t − 1) + 0.1 × (E(n − 1) + E(n))/2; 17 end 18 return S(t) 600 800 1000 1200 1400 1600 −800 −600 −400 −200 0 I (normalized) Q (normalized) (a) I/Q trace under multipath a path length change = 2a path length change < 2a (b) Impact of the hand size Figure 7: Multipath effect obtained from different frequencies to mitigate the multipath effect. To get the baseband signal at different frequencies, we transmit sounds at multiple frequencies at the same time. The coherent detection structure can be applied on each frequency to obtain one complex baseband signal for each frequency. We remove the interference between adjacent frequencies by carefully selecting the parameters of the CIC filter and the frequency interval. Thus, each frequency can be measured independently. After getting the phase of dynamic vectors at different frequencies, we can obtain the distance change curve over time using the wavelength corresponding to each frequency. We combine the results of different frequencies using linear regression. Our approach is based on two observations. First, the measured distance change should be the same for all frequencies when there is no multipath effect. Second, the distance should change linearly during a short time period, e.g., 10 ms, as the movement speed is almost constant during that short period. Therefore, we use linear regression to find the best line that fits all distance change curves obtained from different frequencies. For those frequencies that have abnormal distance estimation results due to multipath effects, the regression error will be large. We then remove frequencies with large regression errors to achieve a better linear regression result using the rest of frequencies
3.7 The Impact of Hand Size time t will have a constant phase change along the frequency axis The size of the moving object,i.e.,the human hand,cannot be i.e.,changing the value of k.If we perform the Inverse Discrete ignored when it is close to the speakers and microphones.Human Fourier Transform (IDFT)on B(k.t).we have the IDFT result as hands have an average length of 15 cm [39].Thus,different parts follows: of the hand have significant differences in path lengths when we aim at mm-level measurement accuracy.As shown by Figure 7(b). bp(n,t) Bp(k,t)ei2xkn/N n=0,,N-1 when the hand moves by a distance of a,the path reflected by the center of the hand has path length change of 2a.However,path re flected by the top of the hand will have smaller path length change, Suppose we ignore the changes in Ap andp.for this moment, especially when the hand is close to the microphone.As the dy- by setting Ap.k=Ap and p.=0.In the case that dp(t)= namic vector in the received signal is a mixture of all paths reflec- ic/(N△f)for an integer∈[0,N-l],we derive thatbp(n,t)= ted by the hand,the measured path length change will be smaller Ae-2mfoa,e/e×dn-元,t,where(n,t)is thentimpulse than the expected value.In our experiments,this type of error in- function with o(n,t)=1,when n =0.For other cases,we have creases when the hand is closer to the microphone.As shown by 6(n,t)=0. our experiments in Section 6.2,when the hand is 20 cm away from The IDFT of Bp(k,t),denoted as bp(n,t),is actually a time- the microphone,the distance measurement error is 3.5 mm;when delay profile for path p.It has a single peak at time= this distance reduces to 5 cm.the measurement error increases to Nd(t)Af/c.Therefore,the f that maximizes the magnitude of 6.8 mm.Errors are mostly caused by the impact of the hand size as bp(n,t)indicates the time-delay of path p.Note that both the di- we consistently underestimates the movement distance.Note that gital down conversion process and the IDFT operation are linear operations.Therefore,as the received signal is a linear combina- such small error can be compensated by the user when we provide realtime feedbacks to the user.Therefore,we do not use a special tion of sound waves traveling from different paths,the resulting algorithm to compensate the underestimation. IDFT is also a linear combination for the delay profile of all paths As the static vector has been removed by our LEVD algorithm. the IDFT of the dynamic vector contains only the time-delay pro- A.MEASURE 2-D ABSOLUTE DISTANCE file of the moving objects.We identify the peaks in bp(n,t)and In this section,we present our 2-D tracking algorithm using each peak corresponds to one path caused by one moving object. sound signals.We first use a delay profile based method to determ Measuring the delay n of the peak gives the path length of the cor- ine the path length so that we can obtain a coarse-grained hand responding object.Figure 8 shows the IDFT result bp(n,t)for a position.We then combine the coarse-grained hand position with moving hand with N 16 sound frequencies.The "hot"posi the fine-grained path length change to enable 2-D tracking. tions indicates the delay profile of high energy sound reflections. There is only one "hot"curve in Figure 8,which corresponds to the 4.1 Delay Profile Based Path Measurement dominating reflection path of the hand.We can also measure how The phase based algorithm in Section 3 only measures the path the path length changes with time in Figure 8.We observe that the length change,which is not sufficient for 2-D tracking for two reas- hand starts close to the phone,where the path has a length of about ons.First,we cannot determine the movement direction only using 15 cm.As the hand moves away.the corresponding path length the path length change due to the lack of the initial position.The increases.We observe that the reflection becomes weak when the path length change is determined by both the movement distance hand is about 45 cm away,where the path length increases to 90 and the movement direction with respect to the speaker and mi- cm between 0.7~1.5 seconds.We also observe that the hand then crophone.Movements that are perpendicular to the line connect- moves close to the phone twice at 2.9 and 6 seconds. ing the speaker and the microphone incur different changes in path length than movements that are parallel to the line,even if the ob- 90 ject moves the same distance.Second,the measurement errors in 80 the path length change accumulate over time.Thus,even if we have the initial hand position,the path length estimation will drift away s60 after tracking for a long time. In this paper,we propose a delay profile based method to obtain 40 a coarse-grained path length estimation.Our method uses unmod- 30 ulated CW sound signals to avoid audible noises,such as bursty pulses,introduced by traditional ranging signals.Although the ac- curacy of the coarse grained measurement is low,which is around 4 cm as shown by our experiments,it serves well for the purpose of Time(seconds) providing an initial position,as the realtime tracking is carried out by fine-grained path length change measurements with accuracy at Figure 8:Delay profile bp(n,t)for a moving hand mm-level once the initial position is given. 4.2 Parameter Setting To measure the path length,we transmit sound signals at N dif- ferent frequencies fk fo+kAf,k =0,...,N-1,which are The time-delay profile measurement has two parameters that separated by a constant frequency interval of Af.Thus,the base- need to be carefully chosen:the frequency interval Af and the band signal for any path p at frequency f is: number of frequencies N.For Af,on one hand,Af should be large enough so that we can separate high speed movements at ad- B(k,t)=Ag.ke-2mUo+k△f)4,(1c+0p.k)】 (3) jacent carrier frequencies.For example,a movement with a speed of 1 m/s leads to frequency components around 100 Hz in the base- We observe that for a given path length of dp(t).the phases of the band signal.Thus,adjacent frequencies should be separated by at baseband signals at different frequencies decrease as a linear func- least 200 Hz.On the other hand,Af should be small enough so that tion of Af,i.e.,-2kAfdp(t)/c.Therefore,Bp(k,t)at a given we can avoid time-domain aliasing.Note that Af determines the
3.7 The Impact of Hand Size The size of the moving object, i.e., the human hand, cannot be ignored when it is close to the speakers and microphones. Human hands have an average length of 15 cm [39]. Thus, different parts of the hand have significant differences in path lengths when we aim at mm-level measurement accuracy. As shown by Figure 7(b), when the hand moves by a distance of a, the path reflected by the center of the hand has path length change of 2a. However, path re- flected by the top of the hand will have smaller path length change, especially when the hand is close to the microphone. As the dynamic vector in the received signal is a mixture of all paths reflected by the hand, the measured path length change will be smaller than the expected value. In our experiments, this type of error increases when the hand is closer to the microphone. As shown by our experiments in Section 6.2, when the hand is 20 cm away from the microphone, the distance measurement error is 3.5 mm; when this distance reduces to 5 cm, the measurement error increases to 6.8 mm. Errors are mostly caused by the impact of the hand size as we consistently underestimates the movement distance. Note that such small error can be compensated by the user when we provide realtime feedbacks to the user. Therefore, we do not use a special algorithm to compensate the underestimation. 4. MEASURE 2-D ABSOLUTE DISTANCE In this section, we present our 2-D tracking algorithm using sound signals. We first use a delay profile based method to determine the path length so that we can obtain a coarse-grained hand position. We then combine the coarse-grained hand position with the fine-grained path length change to enable 2-D tracking. 4.1 Delay Profile Based Path Measurement The phase based algorithm in Section 3 only measures the path length change, which is not sufficient for 2-D tracking for two reasons. First, we cannot determine the movement direction only using the path length change due to the lack of the initial position. The path length change is determined by both the movement distance and the movement direction with respect to the speaker and microphone. Movements that are perpendicular to the line connecting the speaker and the microphone incur different changes in path length than movements that are parallel to the line, even if the object moves the same distance. Second, the measurement errors in the path length change accumulate over time. Thus, even if we have the initial hand position, the path length estimation will drift away after tracking for a long time. In this paper, we propose a delay profile based method to obtain a coarse-grained path length estimation. Our method uses unmodulated CW sound signals to avoid audible noises, such as bursty pulses, introduced by traditional ranging signals. Although the accuracy of the coarse grained measurement is low, which is around 4 cm as shown by our experiments, it serves well for the purpose of providing an initial position, as the realtime tracking is carried out by fine-grained path length change measurements with accuracy at mm-level once the initial position is given. To measure the path length, we transmit sound signals at N different frequencies fk = f0 + k∆f, k = 0, . . . , N − 1, which are separated by a constant frequency interval of ∆f. Thus, the baseband signal for any path p at frequency fk is: Bp(k, t) = A 0 p,ke −j(2π(f0+k∆f)dp(t)/c+θp,k) . (3) We observe that for a given path length of dp(t), the phases of the baseband signals at different frequencies decrease as a linear function of ∆f, i.e., −2πk∆f dp(t)/c. Therefore, Bp(k, t) at a given time t will have a constant phase change along the frequency axis, i.e., changing the value of k. If we perform the Inverse Discrete Fourier Transform (IDFT) on Bp(k, t), we have the IDFT result as follows: bp(n, t) = 1 N NX−1 k=0 Bp(k, t)e j2πkn/N , n = 0, . . . , N − 1. Suppose we ignore the changes in A 0 p,k and θp,k for this moment, by setting A 0 p,k = A 0 p and θp,k = 0. In the case that dp(t) = nc/ ˆ (N∆f) for an integer nˆ ∈ [0, N −1], we derive that bp(n, t) = A 0 pe −j2πf0dp(t)/c × δ(n − n, t ˆ ), where δ(n, t) is the unit impulse function with δ(n, t) = 1, when n = 0. For other cases, we have δ(n, t) = 0. The IDFT of Bp(k, t), denoted as bp(n, t), is actually a timedelay profile for path p. It has a single peak at time nˆ = N dp(t)∆f /c. Therefore, the nˆ that maximizes the magnitude of bp(n, t) indicates the time-delay of path p. Note that both the digital down conversion process and the IDFT operation are linear operations. Therefore, as the received signal is a linear combination of sound waves traveling from different paths, the resulting IDFT is also a linear combination for the delay profile of all paths. As the static vector has been removed by our LEVD algorithm, the IDFT of the dynamic vector contains only the time-delay pro- file of the moving objects. We identify the peaks in bp(n, t) and each peak corresponds to one path caused by one moving object. Measuring the delay nˆ of the peak gives the path length of the corresponding object. Figure 8 shows the IDFT result bp(n, t) for a moving hand with N = 16 sound frequencies. The “hot” positions indicates the delay profile of high energy sound reflections. There is only one “hot” curve in Figure 8, which corresponds to the dominating reflection path of the hand. We can also measure how the path length changes with time in Figure 8. We observe that the hand starts close to the phone, where the path has a length of about 15 cm. As the hand moves away, the corresponding path length increases. We observe that the reflection becomes weak when the hand is about 45 cm away, where the path length increases to 90 cm between 0.7∼1.5 seconds. We also observe that the hand then moves close to the phone twice at 2.9 and 6 seconds. 0 10 20 30 40 50 60 70 80 90 0 1 2 3 4 5 6 Time (seconds) Path length (cm) Figure 8: Delay profile bp(n, t) for a moving hand 4.2 Parameter Setting The time-delay profile measurement has two parameters that need to be carefully chosen: the frequency interval ∆f and the number of frequencies N. For ∆f, on one hand, ∆f should be large enough so that we can separate high speed movements at adjacent carrier frequencies. For example, a movement with a speed of 1 m/s leads to frequency components around 100 Hz in the baseband signal. Thus, adjacent frequencies should be separated by at least 200 Hz. On the other hand, ∆f should be small enough so that we can avoid time-domain aliasing. Note that ∆f determines the
time domain aliasing range.The estimated peak position ni is given and low latency.The delay profile measurement gives the estim as an integer value modulo N,which is in the range of 0~N-1. ation of path length so that the error in phase measurements would Therefore,a reflector with path length of d will have the same time- not accumulate over time.We combine the fine-grained and coarse- delay profile as those with path length of d+mc/Af.where m grained measurements to achieve both low latency and stableness is an integer.For example,when A f is 350 Hz.paths with length in measurements.From Figure 8.we observe that the delay pro- of 0 cm will have the same delay profile as paths with length of file gives consistent estimations when the energy of the reflected c/Af =98 cm.Such time domain aliasing can be observed in Fig- sound is high,e.g.,between 2.1~2.5 seconds.Therefore,we use ure 8.where the high energy curve wraps back to around 0cm when the delay profile based path length estimation only when there is a the path length is larger than 98 cm between 4.55.1 seconds.As dominating peak inb(n,t)that has normalized energy higher than we aim at an operational range of less than 50 cm,we let Af to a given threshold.In such cases,we augment the path length estim- be 350 Hz.For the number of frequencies N.on one hand,a larger ation obtained through the delay profile with the path length traced N gives us a better distance resolution because a larger N leads to through the phase measurements using an Exponential Moving Av- a smaller path length difference c/(NAf)between two adjacent erage (EMA)algorithm.If the hand reflection is weak and there is points f and n+1.On the other hand,a larger N requires higher no dominating peak in bp(n,t),we only use the phase change to bandwidth and reduces the energy that we can transmit in a single update the path length as the delay profile is unreliable. frequency.As the total bandwidth for N frequencies is (N-1)Af. we can only fit a limited number of frequencies into the available 4.5 2-D Gesture Tracking frequency range,e.g.,17~23 kHz.Furthermore,the more fre- quencies we use,the less energy we can transmit in each frequency The position of the hand is determined through multiple path length measurements obtained from different speaker/microphone because the total energy that can be transmitted by the speaker is limited for mobile devices.When the transmission energy in each pairs on the mobile device.Figure 9(a)shows the positions of frequency is reduced,the Signal-to-Noise Ratio (SNR)is also re- the speakers and microphones on a typical mobile phone,Sam- sung Galaxy S5.To measure the path length for multiple speak- duced and the phase measurement becomes less reliable.In this pa- per,we let N =16,which implies that the bandwidth is 5.25 kHz ers/microphones,we use stereo playback and recording capability that is available on many mobile devices.For example,we can and the path length resolution is 6.16 cm.The actual path length record the sound at two microphones that are located at different measurement error is smaller than 4 cm when the target is within positions to get two path measurements at the same time.When 30 cm to the phone,as in our experimental results on Section 6.2. there are multiple speakers,we can separate the signal from differ- 4.3 System Calibration ent speakers by assigning different frequencies to each speaker. The initial phase offset p.comes from two sources:one is the Region A phase inversion caused by reflection,which is the same for all fre- Mic 1 Front Speaker y quencies,and the other is the delay in audio playing and recording process caused by the hardware limitation of the mobile device. Mic 1 which is different for different device models.Because of the delay. (0,L) the time that we transmit the CW to the speaker is misaligned with the reference cos(2ft)signal that is used for multiplication in the coherent detector.Thus,there is a random offset of At between the d Hand emitted and received signal.Consequently,there will be a time off- Rear Speaker (x,y) set of At in bp(n,t)after the IDFT.This time offset,whose value Speaker depends on the audio initialization process,will remain constant (0,0) after the system starts emitting and receiving continuous signals. We perform the time offset calibration after the system starts Mic 2 emitting sound signals.As the hardware/operating system intro- Mic 2 0,-L2 duced time offset At is the same for all paths,we use the LOS Region B path as the reference path in our calibration process.As we know (a)Layout of Samsung S5 (b)Geometric abstraction the exact distance between the speaker and microphone for a given mobile device model,we can calculate the expected niLos for the Figure 9:Two dimensional tracking LOS path.As the static vector is dominated by the LOS path when there are no large reflectors around,if we perform IDFT on the To simplify our discussion,let us consider a mobile phone with static vector of different frequencies,we expect the highest peak one speaker and two microphones,as shown in Figure 9(b).Con- will appear at fLos if At =0.If we observe that the peak is not sider the case where the speaker is placed at the origin,while the at fLos,we apply a delay At'on the reference cos(2mft)signal two microphones have coordinates of(0,L1)and(0,-L2).re- and iteratively adjust the value of At'until the peak appears at the spectively.Suppose that the path length from the speaker through expected position.In our implementation,the average time used for the hand to two microphones are di and d2,respectively.The co- the calibration process is 0.82 seconds with a standard deviation of ordinates (x,y)of the hand should be on the ellipses defined by: 0.16 seconds. 4.4 Combining Fine-grained Phase and 4x2 4(y-L1/2)2 二1 (4) Coarse-grained Delay Measurements -+ d好 Our 2-D tracking requires both the fine-grained phase measure- 4x2 4(y+L2/2)2 -店+ =1 (5) ment and the coarse-grained delay profile measurement.The phase measurement provides accurate and realtime distance changes so that the system can respond to user actions with high accuracy Solving this we have:
time domain aliasing range. The estimated peak position nˆ is given as an integer value modulo N, which is in the range of 0 ∼ N − 1. Therefore, a reflector with path length of d will have the same timedelay profile as those with path length of d + m c/∆f, where m is an integer. For example, when ∆f is 350 Hz, paths with length of 0 cm will have the same delay profile as paths with length of c/∆f = 98 cm. Such time domain aliasing can be observed in Figure 8, where the high energy curve wraps back to around 0 cm when the path length is larger than 98 cm between 4.5 ∼ 5.1 seconds. As we aim at an operational range of less than 50 cm, we let ∆f to be 350 Hz. For the number of frequencies N, on one hand, a larger N gives us a better distance resolution because a larger N leads to a smaller path length difference c/(N∆f) between two adjacent points nˆ and nˆ + 1. On the other hand, a larger N requires higher bandwidth and reduces the energy that we can transmit in a single frequency. As the total bandwidth for N frequencies is (N −1)∆f, we can only fit a limited number of frequencies into the available frequency range, e.g., 17 ∼ 23 kHz. Furthermore, the more frequencies we use, the less energy we can transmit in each frequency because the total energy that can be transmitted by the speaker is limited for mobile devices. When the transmission energy in each frequency is reduced, the Signal-to-Noise Ratio (SNR) is also reduced and the phase measurement becomes less reliable. In this paper, we let N = 16, which implies that the bandwidth is 5.25 kHz and the path length resolution is 6.16 cm. The actual path length measurement error is smaller than 4 cm when the target is within 30 cm to the phone, as in our experimental results on Section 6.2. 4.3 System Calibration The initial phase offset θp,k comes from two sources: one is the phase inversion caused by reflection, which is the same for all frequencies, and the other is the delay in audio playing and recording process caused by the hardware limitation of the mobile device, which is different for different device models. Because of the delay, the time that we transmit the CW to the speaker is misaligned with the reference cos(2πf t) signal that is used for multiplication in the coherent detector. Thus, there is a random offset of ∆t between the emitted and received signal. Consequently, there will be a time offset of ∆t in bp(n, t) after the IDFT. This time offset, whose value depends on the audio initialization process, will remain constant after the system starts emitting and receiving continuous signals. We perform the time offset calibration after the system starts emitting sound signals. As the hardware/operating system introduced time offset ∆t is the same for all paths, we use the LOS path as the reference path in our calibration process. As we know the exact distance between the speaker and microphone for a given mobile device model, we can calculate the expected nˆLOS for the LOS path. As the static vector is dominated by the LOS path when there are no large reflectors around, if we perform IDFT on the static vector of different frequencies, we expect the highest peak will appear at nˆLOS if ∆t = 0. If we observe that the peak is not at nˆLOS, we apply a delay ∆t 0 on the reference cos(2πf t) signal and iteratively adjust the value of ∆t 0 until the peak appears at the expected position. In our implementation, the average time used for the calibration process is 0.82 seconds with a standard deviation of 0.16 seconds. 4.4 Combining Fine-grained Phase and Coarse-grained Delay Measurements Our 2-D tracking requires both the fine-grained phase measurement and the coarse-grained delay profile measurement. The phase measurement provides accurate and realtime distance changes so that the system can respond to user actions with high accuracy and low latency. The delay profile measurement gives the estimation of path length so that the error in phase measurements would not accumulate over time. We combine the fine-grained and coarsegrained measurements to achieve both low latency and stableness in measurements. From Figure 8, we observe that the delay pro- file gives consistent estimations when the energy of the reflected sound is high, e.g., between 2.1∼2.5 seconds. Therefore, we use the delay profile based path length estimation only when there is a dominating peak in bp(n, t) that has normalized energy higher than a given threshold. In such cases, we augment the path length estimation obtained through the delay profile with the path length traced through the phase measurements using an Exponential Moving Average (EMA) algorithm. If the hand reflection is weak and there is no dominating peak in bp(n, t), we only use the phase change to update the path length as the delay profile is unreliable. 4.5 2-D Gesture Tracking The position of the hand is determined through multiple path length measurements obtained from different speaker/microphone pairs on the mobile device. Figure 9(a) shows the positions of the speakers and microphones on a typical mobile phone, Samsung Galaxy S5. To measure the path length for multiple speakers/microphones, we use stereo playback and recording capability that is available on many mobile devices. For example, we can record the sound at two microphones that are located at different positions to get two path measurements at the same time. When there are multiple speakers, we can separate the signal from different speakers by assigning different frequencies to each speaker. Mic 1 Front Speaker Mic 2 Rear Speaker Region A Region B (a) Layout of Samsung S5 ! " #$%&' ()*&+', #$%&- ()*&.+-, /012314 ()*&), 5267 ("*&!, 7' 7- (b) Geometric abstraction Figure 9: Two dimensional tracking To simplify our discussion, let us consider a mobile phone with one speaker and two microphones, as shown in Figure 9(b). Consider the case where the speaker is placed at the origin, while the two microphones have coordinates of (0, L1) and (0, −L2), respectively. Suppose that the path length from the speaker through the hand to two microphones are d1 and d2, respectively. The coordinates (x, y) of the hand should be on the ellipses defined by: 4x 2 d 2 1 − L2 1 + 4(y − L1/2)2 d 2 1 = 1 (4) 4x 2 d 2 2 − L2 2 + 4(y + L2/2)2 d 2 2 = 1 (5) Solving this we have:
2-D tracking experiments are performed in front of or behind the V(d-L)(-)(L1+L2)2-(d1-d2)2) phone when using the front or rear speaker,rather than in region t= A or B.Second,the latency of our system is constrained by the 2(d1L2+d2L1) operating system.Although LLAP can operate on short data seg- ments,the Android system only returns sound data in 10~20 ms y= dL-d1L号-d片d2+dd山 (6) 2(d1L2+d2L1) intervals,depending on the phone models.Therefore,we choose data segment size of 512 samples in our implementation,which As the distance Li and L2 between the speaker to the microphones has time duration of 10.7 ms when the sampling rate is 48 kHz.The are fixed for a given device,we can directly calculate the position iOS system provides better sound APIs which can operate at data of the hand using the path length di and d2. segment sizes as small as 32 samples.However,the iOS system The pseudocode of our 2-D tracking algorithm is in Algorithm 2. only supports recording from a single microphone so that we did This algorithm uses the path length estimation on two microphones not implement 2-D tracking on the iOS platform.Even with these to track the hand.Note that it is possible to use sophisticated track- hardware and software limitations,LLAP achieves good accuracy ing algorithms,such as Kalman filters,to further improve the track- and latency on existing mobile phones.We believe that if the mo- ing performance.We choose not to use them in our implementation bile phones were designed with hardware/software optimizations because they incur high computational cost.However,for mobile for sound based gesture tracking,such as placing the speaker and devices with enough computational power,we recommend using microphones on one side of the phone,the performance of LLAP them. could be even better. Algorithm 2:Two Dimensional Tracking Algorithm 6. EVALUATION Input:Data segment of baseband signal for two microphones on N frequencies Output:Updated hand position 6.1 Evaluation Setup 1 foreach microphone do We conducted experiments on Samsung Galaxy S5 using its rear foreach frequency do speaker,top microphone,and bottom microphones in normal of- Estimate the static vector using LEVD: fice and home environments with the phone on a table as shown Obtain the dynamic vector by subtracting the static vector from the baseband signal; in Figure 10.Experiments were conducted with five human users. Calculate the path length change based on the phase change The users interacted with the phone using their bare hands without of the dynamic vector: wearing any accessory. end Use linear regression to combine the path length change estimation in different frequencies; Update the path length using the path length change estimation: Take IDFT of the dynamic vector of different frequencies to get bp(n,t): 10 if Peak value in bp(n,t)is larger than threshold then Estimate the coarse-grained path length using fi; PLAY Use EMA to augment the coarse-grained estimation: Distance 13 end STOP 6.54cm 14 end 1s Use the path length of two microphones to update the hand position; 5.IMPLEMENTATION We implemented LLAP on both the Android and iOS platforms. On the Android platform,we implement most signal processing algorithms as C functions using Android NDK to achieve better ef- ficiency.Our implementation works as an APP that can draw the Figure 10:Experimental setup 2D hand traces in realtime on recent Android phones,e.g.,Sam- sung Galaxy S5 with Android 5.0 OS.On the iOS platform,we use the vDSP accelerate framework which achieves much better com- For /-D tracking,we evaluated LLAP using three metrics:(1) putational efficiency than the Android platform.However,the iOS Movement distance error:the difference between the LLAP repor platform only supports single channel recording.So.we only im- ted movement distance and the ground truth movement distance plement 1-D hand tracking on the iOS system.Note that we need to measured by a ruler placed along the movement path.(2)Abso- reconfigure the system for certain mobile phones,so that the hard- lute path length error:the difference between the LLAP repor- ware echo cancellation can be bypassed. ted path length and the ground truth measured by a ruler.(3)Mi- There are some limitations in the hardware and operating sys- cro movement detection accuracy:the probability that LLAP cor- tem of existing mobile phones.First,the placement of the micro- rectly detects a small single-finger movement and reports the cor phones and speakers are not optimized for gesture tracking.For rect movement direction of either moving towards or away from the example,the microphones for Samsung S5 are pointing towards phone.For 2-D tracking,we evaluated LLAP using two metrics:(4) opposite directions as shown in Figure 9(a).When the hand is in Tracking error:the distance between the LLAP reported trace and Region A shown in Figure 9(a).the reflected signal obtained by the standard drawing template.Because the 2-D tracking error is microphone I is very good while microphone 2 only gets weak sig- defined in a different way to 1-D tracking,the results for these two nals.Therefore,to achieve strong signals for both microphones,our metrics are not directly comparable.(5)Character recognition ac-
x = q (d 2 1 − L2 1 )(d 2 2 − L2 2 ) ((L1 + L2) 2 − (d1 − d2) 2) 2(d1L2 + d2L1) y = d2L2 1 − d1L2 2 − d 2 1 d2 + d 2 2 d1 2(d1L2 + d2L1) (6) As the distance L1 and L2 between the speaker to the microphones are fixed for a given device, we can directly calculate the position of the hand using the path length d1 and d2. The pseudocode of our 2-D tracking algorithm is in Algorithm 2. This algorithm uses the path length estimation on two microphones to track the hand. Note that it is possible to use sophisticated tracking algorithms, such as Kalman filters, to further improve the tracking performance. We choose not to use them in our implementation because they incur high computational cost. However, for mobile devices with enough computational power, we recommend using them. Algorithm 2: Two Dimensional Tracking Algorithm Input: Data segment of baseband signal for two microphones on N frequencies Output: Updated hand position 1 foreach microphone do 2 foreach frequency do 3 Estimate the static vector using LEVD; 4 Obtain the dynamic vector by subtracting the static vector from the baseband signal; 5 Calculate the path length change based on the phase change of the dynamic vector; 6 end 7 Use linear regression to combine the path length change estimation in different frequencies; 8 Update the path length using the path length change estimation; 9 Take IDFT of the dynamic vector of different frequencies to get bp(n, t); 10 if Peak value in bp(n, t) is larger than threshold then 11 Estimate the coarse-grained path length using nˆ; 12 Use EMA to augment the coarse-grained estimation; 13 end 14 end 15 Use the path length of two microphones to update the hand position; 5. IMPLEMENTATION We implemented LLAP on both the Android and iOS platforms. On the Android platform, we implement most signal processing algorithms as C functions using Android NDK to achieve better ef- ficiency. Our implementation works as an APP that can draw the 2D hand traces in realtime on recent Android phones, e.g., Samsung Galaxy S5 with Android 5.0 OS. On the iOS platform, we use the vDSP accelerate framework which achieves much better computational efficiency than the Android platform. However, the iOS platform only supports single channel recording. So, we only implement 1-D hand tracking on the iOS system. Note that we need to reconfigure the system for certain mobile phones, so that the hardware echo cancellation can be bypassed. There are some limitations in the hardware and operating system of existing mobile phones. First, the placement of the microphones and speakers are not optimized for gesture tracking. For example, the microphones for Samsung S5 are pointing towards opposite directions as shown in Figure 9(a). When the hand is in Region A shown in Figure 9(a), the reflected signal obtained by microphone 1 is very good while microphone 2 only gets weak signals. Therefore, to achieve strong signals for both microphones, our 2-D tracking experiments are performed in front of or behind the phone when using the front or rear speaker, rather than in region A or B. Second, the latency of our system is constrained by the operating system. Although LLAP can operate on short data segments, the Android system only returns sound data in 10∼20 ms intervals, depending on the phone models. Therefore, we choose data segment size of 512 samples in our implementation, which has time duration of 10.7 ms when the sampling rate is 48 kHz. The iOS system provides better sound APIs which can operate at data segment sizes as small as 32 samples. However, the iOS system only supports recording from a single microphone so that we did not implement 2-D tracking on the iOS platform. Even with these hardware and software limitations, LLAP achieves good accuracy and latency on existing mobile phones. We believe that if the mobile phones were designed with hardware/software optimizations for sound based gesture tracking, such as placing the speaker and microphones on one side of the phone, the performance of LLAP could be even better. 6. EVALUATION 6.1 Evaluation Setup We conducted experiments on Samsung Galaxy S5 using its rear speaker, top microphone, and bottom microphones in normal of- fice and home environments with the phone on a table as shown in Figure 10. Experiments were conducted with five human users. The users interacted with the phone using their bare hands without wearing any accessory. Figure 10: Experimental setup For 1-D tracking, we evaluated LLAP using three metrics: (1) Movement distance error: the difference between the LLAP reported movement distance and the ground truth movement distance measured by a ruler placed along the movement path. (2) Absolute path length error: the difference between the LLAP reported path length and the ground truth measured by a ruler. (3) Micro movement detection accuracy: the probability that LLAP correctly detects a small single-finger movement and reports the correct movement direction of either moving towards or away from the phone. For 2-D tracking, we evaluated LLAP using two metrics: (4) Tracking error: the distance between the LLAP reported trace and the standard drawing template. Because the 2-D tracking error is defined in a different way to 1-D tracking, the results for these two metrics are not directly comparable. (5) Character recognition ac-
LEVD-Multi Hand Normal (45dB) 0.8 oDDBR Reflecto Music (70bB) LEVD-Single ◆Speech(65dB Pocket Speaker(65dB】 0.4 0.2 Error(mm) 8 10 10 Distance(cm 10 Distance (cm 10 Distance(cm (a)CDF for measurement error (b)Different algorithms (c)Different objects (d)Different environments Figure 11:1-D Movement distance errors.(Confidence intervals for (b),(c),and (d)are 95%.) 200 E 150 20 15 100 10 5 10 15 20 25 30 0 15 20 25 35 40 25 30 35 40 Movement speed(cm/s) Distance(cm) 30 10 15 bistance (cm) (a)Movement error for different speeds (b)Absolute path length error (c)Micro movement detection accuracy Figure 12:Micro benchmarks(Confidence interval for (b)is 95%.) curacy:the probability that the tracking trace reported by LLAP. As shown in Figure 11(c),smaller objects,such as two-fingers and based on the character drawn by a user,can be correctly recognized the small reflector,result in a better accuracy of 3.76 mm and 2.68 by MyScript,a handwriting recognition tool [40].For efficiency. mm,respectively,when the object is very close to the microphone we evaluated LLAP using two metrics:(6)Response latency:the (within a distance of 5 cm).Due to the better reflection ability of time used by LLAP to accumulate and process the sound data be- the reflector,the measurement accuracy for the reflector at a dis- fore it responses to the hand movement.(7)Power consumption: tance of 40 cm is 5.32 mm,which is much smaller than that of the energy consumption of LLAP on mobile phones. the hand and two-fingers.This is because when the hand is too far from the microphone,the sound signal reflected from the hand is 6.2 Experimental Results too weak and the SNR is too low,which leads to larger movement LLAP achieves an average movement distance error of 3.5 mm distance errors.When the hand is more than 40 cm way from the when the hand moves for 10 cm at a distance of 20 cm.We moved microphone,the error increases to more than 14 mm.Other small the hand in"Region A in Figure 9(a)and measured the movement variations in accuracy in Figure 11(c)are mostly caused by the dif- distance using the top microphone and the rear speaker.The initial ferent multi-path conditions at different distances.LLAP can also hand position was 20 cm away from the microphone and the hand operate while the device is inside the pocket.Figure 11(c)shows moved away from the microphone for a distance of 10 cm.Figure that the measurement error of LLAP only slightly increases by 1.4 11(a)show the Cumulative Distribution Function (CDF)of the dis- mm on average when the device is inside a bag made of cloth. tance measurement error for 200 movements.The 90th percentile LLAP is robust to background noises and achieves an average measurement error is 7.3 mm and the average error is 3.5 mm as movement distance error of 5.81 mm under noise interferences. shown in Figure 11(a) Figure 11(d)shows the measurement error under four different en- LLAP achieves an average movement distance error of less than vironments:the "normal"environment is a typical silent indoor 8.7 mm when the hand moves for 10 cm at a distance of less than one,the "music"environment is an indoor environment with pop 35 cm.Figure 11(b)shows the average movement distance error music being played with normal volume,the"speech"environment when the hand is at different distances from the microphone in side- is a room with people talking at the same time,and the "speaker' by-side comparison with DDBR,the movement distance measure environment is playing music from the speaker on the same device ment algorithm proposed in [2].Results show that our LEVD al- The sound pressure levels measured at these four environments are gorithm outperforms the DDBR algorithm as DDBR is susceptible 45 dB,70 dB.65 dB,and 65 dB,respectively.We observe that to noises.Results also show that LEVD with signals of multiple LLAP has slightly larger movement distance errors under noise in- frequencies outperforms LEVD with signals of a single frequency terferences.Compared to the "normal"environment,the movement in terms of distance measurement accuracy by 21%on average distance errors are increased by 2.45 mm and 1.66 mm (averaged We observe that for LEVD,the movement distance error increases over different distances)for the“music”and“speech'”environ- when the hand is too close or too far from the microphone.When ments,respectively.Because LLAP only uses the narrow baseband the hand is too close to the microphone,the impact of the hand signal around each transmitted frequency,the robustness of LLAP size increases,which leads to larger movement distance errors.To under audible sound noises is sufficient for practical usage.For the verify the impact of hand sizes,we conducted the same set of ex- challenging scenario where the smart phone plays music from the periments with different types of moving objects,including a hand, same speaker that is used for sending the CW signal,LLAP still two fingers,and a plastic flat reflector with an area of 12x4 cm achieves distance accuracy of 7.5 mm when the hand is within 25
0 2 4 6 8 10 0 0.2 0.4 0.6 0.8 1 Error (mm) CDF (a) CDF for measurement error 10 20 30 40 0 5 10 15 20 25 Distance (cm) Average error (mm) LEVD−Multi DDBR LEVD−Single (b) Different algorithms 10 20 30 40 0 5 10 15 Distance (cm) Average error (mm) Hand Reflector Finger Pocket (c) Different objects 10 20 30 40 0 5 10 15 Distance (cm) Average error (mm) Normal (45dB) Music (70bB) Speech (65dB) Speaker (65dB) (d) Different environments Figure 11: 1-D Movement distance errors. (Confidence intervals for (b), (c), and (d) are 95%.) 0 5 10 15 20 25 30 0 5 10 15 20 25 30 Movement speed (cm/s) Measurement error (mm) (a) Movement error for different speeds 5 10 15 20 25 30 35 40 0 50 100 150 200 Distance (cm) Average error (mm) (b) Absolute path length error 10 15 20 25 30 35 40 0 20 40 60 80 100 Distance (cm) Dection probability (%) (c) Micro movement detection accuracy Figure 12: Micro benchmarks (Confidence interval for (b) is 95%.) curacy: the probability that the tracking trace reported by LLAP, based on the character drawn by a user, can be correctly recognized by MyScript, a handwriting recognition tool [40]. For efficiency, we evaluated LLAP using two metrics: (6) Response latency: the time used by LLAP to accumulate and process the sound data before it responses to the hand movement. (7) Power consumption: the energy consumption of LLAP on mobile phones. 6.2 Experimental Results LLAP achieves an average movement distance error of 3.5 mm when the hand moves for 10 cm at a distance of 20 cm. We moved the hand in “Region A” in Figure 9(a) and measured the movement distance using the top microphone and the rear speaker. The initial hand position was 20 cm away from the microphone and the hand moved away from the microphone for a distance of 10 cm. Figure 11(a) show the Cumulative Distribution Function (CDF) of the distance measurement error for 200 movements. The 90th percentile measurement error is 7.3 mm and the average error is 3.5 mm as shown in Figure 11(a). LLAP achieves an average movement distance error of less than 8.7 mm when the hand moves for 10 cm at a distance of less than 35 cm. Figure 11(b) shows the average movement distance error when the hand is at different distances from the microphone in sideby-side comparison with DDBR, the movement distance measurement algorithm proposed in [2]. Results show that our LEVD algorithm outperforms the DDBR algorithm as DDBR is susceptible to noises. Results also show that LEVD with signals of multiple frequencies outperforms LEVD with signals of a single frequency in terms of distance measurement accuracy by 21% on average. We observe that for LEVD, the movement distance error increases when the hand is too close or too far from the microphone. When the hand is too close to the microphone, the impact of the hand size increases, which leads to larger movement distance errors. To verify the impact of hand sizes, we conducted the same set of experiments with different types of moving objects, including a hand, two fingers, and a plastic flat reflector with an area of 12×4 cm. As shown in Figure 11(c), smaller objects, such as two-fingers and the small reflector, result in a better accuracy of 3.76 mm and 2.68 mm, respectively, when the object is very close to the microphone (within a distance of 5 cm). Due to the better reflection ability of the reflector, the measurement accuracy for the reflector at a distance of 40 cm is 5.32 mm, which is much smaller than that of the hand and two-fingers. This is because when the hand is too far from the microphone, the sound signal reflected from the hand is too weak and the SNR is too low, which leads to larger movement distance errors. When the hand is more than 40 cm way from the microphone, the error increases to more than 14 mm. Other small variations in accuracy in Figure 11(c) are mostly caused by the different multi-path conditions at different distances. LLAP can also operate while the device is inside the pocket. Figure 11(c) shows that the measurement error of LLAP only slightly increases by 1.4 mm on average when the device is inside a bag made of cloth. LLAP is robust to background noises and achieves an average movement distance error of 5.81 mm under noise interferences. Figure 11(d) shows the measurement error under four different environments: the “normal” environment is a typical silent indoor one, the “music” environment is an indoor environment with pop music being played with normal volume, the “speech” environment is a room with people talking at the same time, and the “speaker” environment is playing music from the speaker on the same device. The sound pressure levels measured at these four environments are 45 dB, 70 dB, 65 dB, and 65 dB, respectively. We observe that LLAP has slightly larger movement distance errors under noise interferences. Compared to the “normal” environment, the movement distance errors are increased by 2.45 mm and 1.66 mm (averaged over different distances) for the “music” and “speech” environments, respectively. Because LLAP only uses the narrow baseband signal around each transmitted frequency, the robustness of LLAP under audible sound noises is sufficient for practical usage. For the challenging scenario where the smart phone plays music from the same speaker that is used for sending the CW signal, LLAP still achieves distance accuracy of 7.5 mm when the hand is within 25