计算机科学与技术（参考文献）VSkin - Sensing Touch Gestures on Surfaces of Mobile Devices Using Acoustic Signals

团购合买资源类别：文库，文档格式：PDF，文档页数：15，文件大小：5.14MB

VSkin:Sensing Touch Gestures on Surfaces of Mobile Devices Using Acoustic Signals Ke Sun Ting Zhao State Key Laboratory for Novel Software Technology State Key Laboratory for Novel Software Technology Nanjing University,China,kesun@smail.nju.edu.cn Nanjing University,zhaoting@smail.nju.edu.cn Wei Wang Lei Xie State Key Laboratory for Novel Software Technology State Key Laboratory for Novel Software Technology Nanjing University,China,ww@nju.edu.cn Nanjing University,China,lxie@nju.edu.cn ABSTRACT Enabling touch gesture sensing on all surfaces of the mo- bile device,not limited to the touchscreen area,leads to new user interaction experiences.In this paper,we propose VSkin,a system that supports fine-grained gesture-sensing on the back of mobile devices based on acoustic signals. VSkin utilizes both the structure-borne sounds,i.e.,sounds propagating through the structure of the device,and the air-borne sounds,i.e.,sounds propagating through the air, (a)Back-Swiping (b)Back-Tapping (c)Back-Scrolling to sense finger tapping and movements.By measuring both Figure 1:Back-of-Device interactions the amplitude and the phase of each path of sound signals, VSkin detects tapping events with an accuracy of 99.65%and captures finger movements with an accuracy of 3.59 mm. 1 INTRODUCTION Touch gesture is one of the most important ways for users CCS CONCEPTS to interact with mobile devices.With the wide-deployment of touchscreens,a set of user-friendly touch gestures,such Human-centered computing-Interface design as swiping,tapping,and scrolling,have become the de facto prototyping;Gestural input; standard user interface for mobile devices.However,due to KEYWORDS the high-cost of the touchscreen hardware,gesture-sensing is usually limited to the front surface of the device.Further- Touch gestures;Ultrasound more,touchscreens combine the function of gesture-sensing ACM Reference Format: with the function of displaying.This leads to the occlusion Ke Sun,Ting Zhao,Wei Wang,and Lei Xie.2018.VSkin:Sens- problem [30],i.e.,user fingers often block the content dis- ing Touch Gestures on Surfaces of Mobile Devices Using Acoustic played on the screen during the interaction process Signals.In MobiCom'18:24th Annual International Conference on Enabling gesture-sensing on all surfaces of the mobile de- Mobile Computing and Networking,October 29-November 2,2018, vice,not limited to the touchscreen area,leads to new user New Delhi,India.ACM,New York,NY,USA,15 pages.https://doi. interaction experiences.First,new touch gestures solve the rg/10.1145/3241539.3241568 occlusion problem of the touchscreen.For example,Back-of- Device(BoD)gestures use tapping or swiping on the back of Permission to make digital or hard copies of all or part of this work for a smartphone as a supplementary input interface [22,35].As personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear shown in Figure 1,the screen is no longer blocked when the this notice and the full citation on the first page.Copyrights for components back-scrolling gesture is used for scrolling the content.BoD of this work owned by others than ACM must be honored.Abstracting with gestures also enrich the user experience of mobile games by credit is permitted.To copy otherwise,or republish,to post on servers or to allowing players to use the back surface as a touchpad.Sec- redistribute to lists,requires prior specific permission and/or a fee.Request ond,defining new touch gestures on different surfaces helps permissions from permissions@acm.org. the system better understand user intentions.On traditional MobiCom'18,October 29-November 2,2018,New Delhi,India e2018 Association for Computing Machinery. touchscreens,touching a webpage on the screen could mean ACM ISBN978-1-4503-5903-0/18/10..$15.00 that the user wishes to click a hyperlink or the user just https:/doi.org/10.1145/3241539.3241568 wants to scroll down the page.Existing touchscreen schemes

VSkin: Sensing Touch Gestures on Surfaces of Mobile Devices Using Acoustic Signals Ke Sun State Key Laboratory for Novel Software Technology Nanjing University, China, kesun@smail.nju.edu.cn Ting Zhao State Key Laboratory for Novel Software Technology Nanjing University, zhaoting@smail.nju.edu.cn Wei Wang State Key Laboratory for Novel Software Technology Nanjing University, China, ww@nju.edu.cn Lei Xie State Key Laboratory for Novel Software Technology Nanjing University, China, lxie@nju.edu.cn ABSTRACT Enabling touch gesture sensing on all surfaces of the mobile device, not limited to the touchscreen area, leads to new user interaction experiences. In this paper, we propose VSkin, a system that supports fine-grained gesture-sensing on the back of mobile devices based on acoustic signals. VSkin utilizes both the structure-borne sounds, i.e., sounds propagating through the structure of the device, and the air-borne sounds, i.e., sounds propagating through the air, to sense finger tapping and movements. By measuring both the amplitude and the phase of each path of sound signals, VSkin detects tapping events with an accuracy of 99.65% and captures finger movements with an accuracy of 3.59 mm. CCS CONCEPTS • Human-centered computing → Interface design prototyping; Gestural input; KEYWORDS Touch gestures; Ultrasound ACM Reference Format: Ke Sun, Ting Zhao, Wei Wang, and Lei Xie. 2018. VSkin: Sensing Touch Gestures on Surfaces of Mobile Devices Using Acoustic Signals. In MobiCom ’18: 24th Annual International Conference on Mobile Computing and Networking, October 29–November 2, 2018, New Delhi, India. ACM, New York, NY, USA, 15 pages. https://doi. org/10.1145/3241539.3241568 Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. MobiCom’18, October 29–November 2, 2018, New Delhi, India © 2018 Association for Computing Machinery. ACM ISBN 978-1-4503-5903-0/18/10. . . $15.00 https://doi.org/10.1145/3241539.3241568 (a) Back-Swiping (b) Back-Tapping (c) Back-Scrolling Figure 1: Back-of-Device interactions 1 INTRODUCTION Touch gesture is one of the most important ways for users to interact with mobile devices. With the wide-deployment of touchscreens, a set of user-friendly touch gestures, such as swiping, tapping, and scrolling, have become the de facto standard user interface for mobile devices. However, due to the high-cost of the touchscreen hardware, gesture-sensing is usually limited to the front surface of the device. Furthermore, touchscreens combine the function of gesture-sensing with the function of displaying. This leads to the occlusion problem [30], i.e., user fingers often block the content displayed on the screen during the interaction process. Enabling gesture-sensing on all surfaces of the mobile device, not limited to the touchscreen area, leads to new user interaction experiences. First, new touch gestures solve the occlusion problem of the touchscreen. For example, Back-ofDevice (BoD) gestures use tapping or swiping on the back of a smartphone as a supplementary input interface [22, 35]. As shown in Figure 1, the screen is no longer blocked when the back-scrolling gesture is used for scrolling the content. BoD gestures also enrich the user experience of mobile games by allowing players to use the back surface as a touchpad. Second, defining new touch gestures on different surfaces helps the system better understand user intentions. On traditional touchscreens, touching a webpage on the screen could mean that the user wishes to click a hyperlink or the user just wants to scroll down the page. Existing touchscreen schemes

often confuse these two intentions,due to the overloaded ac- a short time interval of 0.13~0.34 ms,which is just 6~16 sam tions on gestures that are similar to each other.With the new ple points at a sampling rate of 48 kHz.With the limited types of touch gestures performed on different surfaces of inaudible sound bandwidth(around 6 kHz)available on com- the device.these actions can be assigned to distinct gestures. mercial mobile devices,it is challenging to separate these e.g,selecting an item should be performed on the screen paths.Moreover,to achieve accurate movement measure- while scrolling or switching should be performed on the back ment and location independent touch detection,we need or the side of the device.Third,touch sensing on the side to measure both the phase and the magnitude of each path. of the phone enables virtual side-buttons that could replace To address this challenge,we design a system that uses the physical buttons and improve the waterproof performance of Zadoff-Chu(ZC)sequence to measure different sound paths. the device.Compared to in-air gestures that also enrich the With the near-optimal auto-correlation function of the ZC gesture semantics,touch gestures have a better user experi- sequence,which has a peak width of 6 samples,we can sepa- ence,due to their accurate touch detection(for confirmation) rate the structure-borne and the air-borne signals when the connected to the useful haptic feedbacks. distance between the speaker and microphone is just 12 cm. Fine-grained gesture movement distance/speed measure- Furthermore,we develop a new algorithm that measures ments are vital for enabling touch gestures that users are the phase of each sound path at a rate of 3,000 samples per already familiar with,including scrolling and swiping.How- second.Compared to traditional impulsive signal systems ever,existing accelerometer or structural vibration based that measure sound paths in a frame by frame manner(with touch sensing schemes only recognize coarse-grained ac- frame rate <170 Hz [14,34),the higher sampling rate helps tivities,such as the tapping events [5,35].Extra informa- VSkin capture fast swiping and tapping events. tion on the tapping position or the tapping force levels usu- We implement VSkin on commercial smartphones as real- ally requires intensive training and calibration processes time Android applications.Experimental results show that [12,13,25]or additional hardware,such as a mirror on the VSkin achieves a touch detection accuracy of 99.65%and an back of the smartphone [31]. accuracy of 3.59 mm for finger movement distances.Our user In this paper,we propose VSkin,a system that supports study shows that VSkin only slightly increases the movement fine-grained gesture-sensing on the surfaces of mobile de- time used for interaction tasks,e.g.,scrolling and swiping, vices based on acoustic signals.Similar to a layer of skin by 34%and 10%when compared to touchscreens. on the surfaces of the mobile device,VSkin can sense both We made the following contributions in this work: the finger tapping and finger movement distance/direction We introduce a new approach for touch-sensing on mo- on the surface of the device.Without modifying the hard- bile devices by separating the structure-borne and the air- ware,VSkin utilizes the built-in speakers and microphones borne sound signals. to send and receive sound signals for touch-sensing.More We design an algorithm that performs the phase and specifically,VSkin captures both the structure-borne sounds. magnitude measurement of multiple sound paths at a high ie.,sounds propagating through the structure of the device, sampling rate of 3 kHz. and the air-borne sounds,ie,sounds propagating through We implement our system on the Android platform and the air.As touching the surface can significantly change the perform real-world user studies to verify our design. structural vibration pattern of the device,the characteristics of structure-borne sounds are reliable features for touch de- 2 RELATED WORK tection,i.e.,whether the finger contacts the surface or not [12,13,25].While it is difficult to use the structure-borne We categorize researches related to VSkin into three classes:Back-of-Device interactions,tapping and force sens- sounds to sense finger movements,air-borne sounds can measure the movement with mm-level accuracy [14,28,34] ing,and sound-based gesture sensing. Therefore,by analyzing both the structure-borne and the Back-of-Device Interactions:Back-of-Device interac- air-borne sounds,it is possible to reliably recognize a rich set tion is a popular way to extend the user interface of mobile of touch gestures as if there is another touchscreen on the devices [5,11,31,32,35.Gestures performed on the back back of the phone.Moreover,VSkin does not require inten- of the device can be detected by the built-in camera [31,32 sive training,as it uses the physical properties of the sound or sensors [5,35]on the mobile device.LensGesture [32] propagation to detect touch and measure finger movements. uses the rear camera to detect finger movements that are The key challenge faced by VSkin is to measure both the performed just above the camera.Back-Mirror [31]uses an structure-borne and the air-borne signals with high fidelity additional mirror attached to the rear camera to capture BoD while the hand is very close to the mobile device.Given the gestures in a larger region.However,due to the limited view- small form factor of mobile devices,sounds traveling through ing angle of cameras,these approaches either have limited different mediums and paths arrive at the microphone within sensing area or need extra hardware for extending sensing

often confuse these two intentions, due to the overloaded actions on gestures that are similar to each other. With the new types of touch gestures performed on different surfaces of the device, these actions can be assigned to distinct gestures, e.g., selecting an item should be performed on the screen while scrolling or switching should be performed on the back or the side of the device. Third, touch sensing on the side of the phone enables virtual side-buttons that could replace physical buttons and improve the waterproof performance of the device. Compared to in-air gestures that also enrich the gesture semantics, touch gestures have a better user experience, due to their accurate touch detection (for confirmation) connected to the useful haptic feedbacks. Fine-grained gesture movement distance/speed measurements are vital for enabling touch gestures that users are already familiar with, including scrolling and swiping. However, existing accelerometer or structural vibration based touch sensing schemes only recognize coarse-grained activities, such as the tapping events [5, 35]. Extra information on the tapping position or the tapping force levels usually requires intensive training and calibration processes [12, 13, 25] or additional hardware, such as a mirror on the back of the smartphone [31]. In this paper, we propose VSkin, a system that supports fine-grained gesture-sensing on the surfaces of mobile devices based on acoustic signals. Similar to a layer of skin on the surfaces of the mobile device, VSkin can sense both the finger tapping and finger movement distance/direction on the surface of the device. Without modifying the hardware, VSkin utilizes the built-in speakers and microphones to send and receive sound signals for touch-sensing. More specifically, VSkin captures both the structure-borne sounds, i.e., sounds propagating through the structure of the device, and the air-borne sounds, i.e., sounds propagating through the air. As touching the surface can significantly change the structural vibration pattern of the device, the characteristics of structure-borne sounds are reliable features for touch detection, i.e., whether the finger contacts the surface or not [12, 13, 25]. While it is difficult to use the structure-borne sounds to sense finger movements, air-borne sounds can measure the movement with mm-level accuracy [14, 28, 34]. Therefore, by analyzing both the structure-borne and the air-borne sounds, it is possible to reliably recognize a rich set of touch gestures as if there is another touchscreen on the back of the phone. Moreover, VSkin does not require intensive training, as it uses the physical properties of the sound propagation to detect touch and measure finger movements. The key challenge faced by VSkin is to measure both the structure-borne and the air-borne signals with high fidelity while the hand is very close to the mobile device. Given the small form factor of mobile devices, sounds traveling through different mediums and paths arrive at the microphone within a short time interval of 0.13∼0.34ms, which is just 6∼16 sample points at a sampling rate of 48 kHz. With the limited inaudible sound bandwidth (around 6 kHz) available on commercial mobile devices, it is challenging to separate these paths. Moreover, to achieve accurate movement measurement and location independent touch detection, we need to measure both the phase and the magnitude of each path. To address this challenge, we design a system that uses the Zadoff-Chu (ZC) sequence to measure different sound paths. With the near-optimal auto-correlation function of the ZC sequence, which has a peak width of 6 samples, we can separate the structure-borne and the air-borne signals when the distance between the speaker and microphone is just 12 cm. Furthermore, we develop a new algorithm that measures the phase of each sound path at a rate of 3,000 samples per second. Compared to traditional impulsive signal systems that measure sound paths in a frame by frame manner (with frame rate <170 Hz [14, 34]), the higher sampling rate helps VSkin capture fast swiping and tapping events. We implement VSkin on commercial smartphones as realtime Android applications. Experimental results show that VSkin achieves a touch detection accuracy of 99.65% and an accuracy of 3.59mm for finger movement distances. Our user study shows that VSkin only slightly increases the movement time used for interaction tasks, e.g., scrolling and swiping, by 34% and 10% when compared to touchscreens. We made the following contributions in this work: • We introduce a new approach for touch-sensing on mobile devices by separating the structure-borne and the airborne sound signals. • We design an algorithm that performs the phase and magnitude measurement of multiple sound paths at a high sampling rate of 3 kHz. • We implement our system on the Android platform and perform real-world user studies to verify our design. 2 RELATED WORK We categorize researches related to VSkin into three classes: Back-of-Device interactions, tapping and force sensing, and sound-based gesture sensing. Back-of-Device Interactions: Back-of-Device interaction is a popular way to extend the user interface of mobile devices [5, 11, 31, 32, 35]. Gestures performed on the back of the device can be detected by the built-in camera [31, 32] or sensors [5, 35] on the mobile device. LensGesture [32] uses the rear camera to detect finger movements that are performed just above the camera. Back-Mirror [31] uses an additional mirror attached to the rear camera to capture BoD gestures in a larger region. However, due to the limited viewing angle of cameras, these approaches either have limited sensing area or need extra hardware for extending sensing

range.BackTap [35]and BTap[5]use built-in sensors,such Top Mic (Mic 2) as the accelerometer,to sense coarse-grained gestures.How- Rack Surtace ever,sensor readings only provide limited information about f the the gesture,and they cannot quantify the movement speed and distance.Furthermore,accelerometers are sensitive to vibrations caused by hand movements while the user is hold- ing the device.Compared to camera-based and sensor-based schemes,VSkin incurs no additional hardware costs and can Struchure path- perform fine-grained gesture measurements. LOS ar path - Tapping and Force Sensing:Tapping and force applied to the surface can be sensed by different types of sensors Bottom Mic (Mic 1) [4,7,9,10,12,13,15,19,25].TapSense [7]leverages the Figure 2:Sound propagation paths on a smartphone tapping sound to recognize whether the user touches the screen with a fingertip or a fist.Force Tap [9]measures the air-borne sound signals to sense gestures performed on the tapping force using the built-in accelerometer.VibWrite [13] surface of the mobile devices,which are very close (e.g.,less and VibSense [12]use the vibration signal instead of the than 12 cm)to both the speakers and the microphones.As the sound signal to sense the tapping position so that the inter- sound reflections at a short distance are often submerged by ference in air-borne propagation can be avoided.However, the Line-of-Sight(LOS)signals,sensing gestures with SNR they require pre-trained vibration profiles for tapping local- 2 dB at 5 cm is considerably harder than sensing in-air ization.ForcePhone [25]uses linear chirp sounds to sense gestures with SNR 12 dB at 30 cm. force and touch based on changes in the magnitude of the structure-borne signal.However,fine-grained phase infor- 3 SYSTEM OVERVIEW mation cannot be measured through chirps and chirps only VSkin uses both the structure-borne and the air-borne capture the magnitude of the structure-borne signal at a low sound signals to capture gestures performed on the surface of sampling rate.In comparison,our system measures both the the mobile device.We transmit and record inaudible sounds phase and the magnitude of multiple sound paths with a using the built-in speakers and microphones on commodity high sampling rate of 3 kHz so that we can perform robust mobile devices.As an example illustrated in Figure 2,sound tap sensing without intensive training. signals transmitted by the rear speaker travel through multi- Sound-based Gesture Sensing:Several sound-based ple paths on the back of the phone to the top and bottom mi- gesture recognition systems have been proposed to recog- crophones.On both microphones,the structure-borne sound nize in-air gestures [1,3,6,16,17,21,23,33,37].Soundwave that travels through the body structure of the smartphone [6],Multiwave [17],and AudioGest [21]use Doppler effect arrives first.This is because sound wave propagates much to recognize predefined gestures.However,Doppler effect faster in the solid(>2.000m/s)than in the air(around 343m/s) only gives coarse-grained movement speeds.Thus,these [24].There might be multiple copies of air-borne sounds ar- schemes only recognize a small set of gestures that have riving within a short interval following the structure-borne distinctive speed characters.Recently,three state-of-the-art sound.The air-borne sounds include the LOS sound and the schemes (i.e,FingerIO [14],LLAP [28],and Strata [34])use reflection sounds of surrounding objects,e.g.,the finger or ultrasound to track fine-grained finger gestures.FingerIO the table.All these sound signals are mixed at the recording [14]transmits OFDM modulated sound frames and locates microphones. the moving finger based on the change of the echo profiles of VSkin performs gesture-sensing based on the mixture of two consecutive frames.LLAP [28]uses Continuous Wave sound signals recorded by the microphones.The design of (CW)signal to track the moving target based on the phase VSkin consists of the following four components: information,which is susceptible to the dynamic multipath Transmission signal design:We choose to use the caused by other moving objects.Strata [34]combines the Zadoff-Chu(ZC)sequence modulated by a sinusoid carrier as frame-based approach and the phase-based approach.Using our transmitted sound signal.This transmission signal design the 26-bit GSM training sequence that has nice autocorrela- meets three key design goals.First,the auto-correlation of tion properties,Strata can track phase changes at different ZC sequence has a narrow peak width of 6 samples so that we time delays so that objects that are more than 8.5 cm apart can separate sound paths arrive with a small time-difference can be resolved.However,these schemes mainly focus on by locating the peaks corresponding to their different delays tracking in-air gestures that are performed at more than see Figure 3.Second,we use interpolation schemes to reduce 20 cm away from the mobile device [14,23,28,34].In com- the bandwidth of the ZC sequence to less than 6 kHz so that parison,our system uses both the structure-borne and the it can be fit into the narrow inaudible range of 17~23 kHz

range. BackTap [35] and βTap[5] use built-in sensors, such as the accelerometer, to sense coarse-grained gestures. However, sensor readings only provide limited information about the gesture, and they cannot quantify the movement speed and distance. Furthermore, accelerometers are sensitive to vibrations caused by hand movements while the user is holding the device. Compared to camera-based and sensor-based schemes, VSkin incurs no additional hardware costs and can perform fine-grained gesture measurements. Tapping and Force Sensing: Tapping and force applied to the surface can be sensed by different types of sensors [4, 7, 9, 10, 12, 13, 15, 19, 25]. TapSense [7] leverages the tapping sound to recognize whether the user touches the screen with a fingertip or a fist. ForceTap [9] measures the tapping force using the built-in accelerometer. VibWrite [13] and VibSense [12] use the vibration signal instead of the sound signal to sense the tapping position so that the interference in air-borne propagation can be avoided. However, they require pre-trained vibration profiles for tapping localization. ForcePhone [25] uses linear chirp sounds to sense force and touch based on changes in the magnitude of the structure-borne signal. However, fine-grained phase information cannot be measured through chirps and chirps only capture the magnitude of the structure-borne signal at a low sampling rate. In comparison, our system measures both the phase and the magnitude of multiple sound paths with a high sampling rate of 3 kHz so that we can perform robust tap sensing without intensive training. Sound-based Gesture Sensing: Several sound-based gesture recognition systems have been proposed to recognize in-air gestures [1, 3, 6, 16, 17, 21, 23, 33, 37]. Soundwave [6], Multiwave [17], and AudioGest [21] use Doppler effect to recognize predefined gestures. However, Doppler effect only gives coarse-grained movement speeds. Thus, these schemes only recognize a small set of gestures that have distinctive speed characters. Recently, three state-of-the-art schemes (i.e., FingerIO [14], LLAP [28], and Strata [34]) use ultrasound to track fine-grained finger gestures. FingerIO [14] transmits OFDM modulated sound frames and locates the moving finger based on the change of the echo profiles of two consecutive frames. LLAP [28] uses Continuous Wave (CW) signal to track the moving target based on the phase information, which is susceptible to the dynamic multipath caused by other moving objects. Strata [34] combines the frame-based approach and the phase-based approach. Using the 26-bit GSM training sequence that has nice autocorrelation properties, Strata can track phase changes at different time delays so that objects that are more than 8.5 cm apart can be resolved. However, these schemes mainly focus on tracking in-air gestures that are performed at more than 20 cm away from the mobile device [14, 23, 28, 34]. In comparison, our system uses both the structure-borne and the Top Mic (Mic 2) Bottom Mic (Mic 1) Path 2 Path 1 Path 3 Path 4 Path 6 Path 5 Rear Speaker Structure path LOS air path Reflection air path Back Surface of the Phone Figure 2: Sound propagation paths on a smartphone air-borne sound signals to sense gestures performed on the surface of the mobile devices, which are very close (e.g., less than 12 cm) to both the speakers and the microphones. As the sound reflections at a short distance are often submerged by the Line-of-Sight (LOS) signals, sensing gestures with SNR ≈ 2 dB at 5 cm is considerably harder than sensing in-air gestures with SNR ≈ 12 dB at 30 cm. 3 SYSTEM OVERVIEW VSkin uses both the structure-borne and the air-borne sound signals to capture gestures performed on the surface of the mobile device. We transmit and record inaudible sounds using the built-in speakers and microphones on commodity mobile devices. As an example illustrated in Figure 2, sound signals transmitted by the rear speaker travel through multiple paths on the back of the phone to the top and bottom microphones. On both microphones, the structure-borne sound that travels through the body structure of the smartphone arrives first. This is because sound wave propagates much faster in the solid (>2,000m/s) than in the air (around 343m/s) [24]. There might be multiple copies of air-borne sounds arriving within a short interval following the structure-borne sound. The air-borne sounds include the LOS sound and the reflection sounds of surrounding objects, e.g., the finger or the table. All these sound signals are mixed at the recording microphones. VSkin performs gesture-sensing based on the mixture of sound signals recorded by the microphones. The design of VSkin consists of the following four components: Transmission signal design: We choose to use the Zadoff-Chu (ZC) sequence modulated by a sinusoid carrier as our transmitted sound signal. This transmission signal design meets three key design goals. First, the auto-correlation of ZC sequence has a narrow peak width of 6 samples so that we can separate sound paths arrive with a small time-difference by locating the peaks corresponding to their different delays, see Figure 3. Second, we use interpolation schemes to reduce the bandwidth of the ZC sequence to less than 6 kHz so that it can be fit into the narrow inaudible range of 17 ∼ 23 kHz

roah3 Note that finger movement measurement and touch measurement can use signal captured by the top micro- &←Pahs phone,the bottom microphone,or both.How these mea- 257 513 769 surements are used in specific gestures,such as scrolling Samples and swiping,depends on both the type of the gestures (a)Bottom microphone (Mic 1) and the placement of microphones of the given device,see Section 6.5. -Path 4 3131.17105 4 TRANSMISSION SIGNAL DESIGN -Path 6 4.1 Baseband Sequence Selection Sound signals propagating through the structure path,the 512 76 Samples LOS path,and the reflection path arrive within a very small (b)Top microphone(Mic 2) time interval of less than 0.34ms,due to the small size of a Figure 3:IR estimation of dual microphones smartphone (20cm).One way to separate these paths is to transmit short impulses of sounds so that the reflected provided by commodity speakers and microphones.Third, impulses do not overlap with each other.However,impulses we choose to modulate the ZC sequence so that we can ex- with short time durations have very low energy so that the tract the phase information,which cannot be measured by received signals,especially those reflected by the finger,are traditional chirp-like sequences such as FMCW sequences. too weak to be reliably measured. Sound path separation and measurement:To sepa- In VSkin,we choose to transmit a periodical high-energy rate different sound paths at the receiving end,we first use signal and rely on the auto-correlation properties of the cross-correlation to estimate the Impulse Response (IR)of the signal to separate the sound paths.A continuous period- mixed sound.Second,we locate the candidate sound paths ical signal has higher energy than impulses so that the using the amplitude of the IR estimation.Third,we identify weak reflections can be reliably measured.The cyclic auto- the structure-borne path,the LOS path,and the reflection correlation function of the signal s[n]is defined as R(r)= path by aligning candidate paths on different microphones 为∑N!s[ns*[n-r）modN],where N is the length of based on the known microphone positions.Finally,we use the signal,r is the delay,and s'[n]is the conjugation of the an efficient algorithm to calculate the phase and amplitude signal.The cyclic auto-correlation function is maximized of each sound path at a high sampling rate of 48 kHz. around r =0 and we define the peak at r =0 as the main Finger movement measurement:The finger move- lobe of the auto-correlation function,see Figure 5(b).When ment measurement is based on the phase of the air-borne the cyclic auto-correlation function has a single narrow peak, path reflected by the finger.To detect the weak reflections of i.e.,R(r)0 for r +0,we can separate multiple copies of the finger,we first calculate the differential IR estimations s[n]arrived at different arrival delay r by performing cross- so that changes caused by finger movements are amplified. correlation of the mixed signal with the cyclically shifted Second,we use an adaptive algorithm to determine the de- s[n].For the cross-correlation results as shown in Figure 3, lay of the reflection path so that the phase and amplitude each delayed copy of s[n]in the mixed signal leads to a peak can be measured with high SNR.Third,we use an Extend at its corresponding delay value of r. Kalman Filter to further amplify the sound signal based on The transmitted sound signal needs to satisfy the following the finger movement model.Finally,the finger movement extra requirements to ensure both the resolution and signal- distance is calculated by measuring the phase change of the to-noise ratio of the path estimation: corresponding reflection path. Narrow autocorrelation main lobe width:The Touch measurement:We use the structure-borne path width of the main lobe is the number of points on each to detect touch events,since the structure-borne path is side of the lobe where the power has fallen to half(-3 dB) mainly determined by whether the user's finger is pressing of its maximum value.A narrow main lobe leads to better on the surface or not.To detect touch events,we first cal- time resolution in sound propagation paths. culate the differential IR estimations of the structure-borne Low baseband crest factor:Baseband crest factor is path.We then use a threshold-based scheme to detect the the ratio of peak values to the effective value of the baseband touch and release events.To locate the touch position,we signal.A signal with a low crest factor has higher energy found that the delay of the changes in structure-borne sound than a high crest factor signal with the same peak power [2] is closely related to the distance from the touch position to Therefore,it produces cross-correlation results with higher the speaker.Using this observation,we classify the touch signal-to-noise ratio while the peak power is still below the event into three different regions with an accuracy of 87.8%. audible power threshold

1 257 513 769 Samples 0 1 2 3 Absolute value 106 (301, 2.25 106) Path 1 and Path 3 Path 5 (a) Bottom microphone (Mic 1) 0 256 512 768 1024 Samples 0 5 10 15 Absolute value 104 Path 2 (301, 1.19 105) Path 4 (313, 1.17 105) Path 6 (b) Top microphone (Mic 2) Figure 3: IR estimation of dual microphones provided by commodity speakers and microphones. Third, we choose to modulate the ZC sequence so that we can extract the phase information, which cannot be measured by traditional chirp-like sequences such as FMCW sequences. Sound path separation and measurement: To separate different sound paths at the receiving end, we first use cross-correlation to estimate the Impulse Response (IR) of the mixed sound. Second, we locate the candidate sound paths using the amplitude of the IR estimation. Third, we identify the structure-borne path, the LOS path, and the reflection path by aligning candidate paths on different microphones based on the known microphone positions. Finally, we use an efficient algorithm to calculate the phase and amplitude of each sound path at a high sampling rate of 48 kHz. Finger movement measurement: The finger movement measurement is based on the phase of the air-borne path reflected by the finger. To detect the weak reflections of the finger, we first calculate the differential IR estimations so that changes caused by finger movements are amplified. Second, we use an adaptive algorithm to determine the delay of the reflection path so that the phase and amplitude can be measured with high SNR. Third, we use an Extend Kalman Filter to further amplify the sound signal based on the finger movement model. Finally, the finger movement distance is calculated by measuring the phase change of the corresponding reflection path. Touch measurement: We use the structure-borne path to detect touch events, since the structure-borne path is mainly determined by whether the user’s finger is pressing on the surface or not. To detect touch events, we first calculate the differential IR estimations of the structure-borne path. We then use a threshold-based scheme to detect the touch and release events. To locate the touch position, we found that the delay of the changes in structure-borne sound is closely related to the distance from the touch position to the speaker. Using this observation, we classify the touch event into three different regions with an accuracy of 87.8%. Note that finger movement measurement and touch measurement can use signal captured by the top microphone, the bottom microphone, or both. How these measurements are used in specific gestures, such as scrolling and swiping, depends on both the type of the gestures and the placement of microphones of the given device, see Section 6.5. 4 TRANSMISSION SIGNAL DESIGN 4.1 Baseband Sequence Selection Sound signals propagating through the structure path, the LOS path, and the reflection path arrive within a very small time interval of less than 0.34ms, due to the small size of a smartphone (< 20cm). One way to separate these paths is to transmit short impulses of sounds so that the reflected impulses do not overlap with each other. However, impulses with short time durations have very low energy so that the received signals, especially those reflected by the finger, are too weak to be reliably measured. In VSkin, we choose to transmit a periodical high-energy signal and rely on the auto-correlation properties of the signal to separate the sound paths. A continuous periodical signal has higher energy than impulses so that the weak reflections can be reliably measured. The cyclic autocorrelation function of the signal s[n] is defined as R(τ ) = 1 N PN n=1 s[n]s ∗ [(n − τ ) mod N], where N is the length of the signal, τ is the delay, and s ∗ [n] is the conjugation of the signal. The cyclic auto-correlation function is maximized around τ = 0 and we define the peak at τ = 0 as the main lobe of the auto-correlation function, see Figure 5(b). When the cyclic auto-correlation function has a single narrow peak, i.e., R(τ ) ≈ 0 for τ , 0, we can separate multiple copies of s[n] arrived at different arrival delay τ by performing crosscorrelation of the mixed signal with the cyclically shifted s[n]. For the cross-correlation results as shown in Figure 3, each delayed copy of s[n] in the mixed signal leads to a peak at its corresponding delay value of τ . The transmitted sound signal needs to satisfy the following extra requirements to ensure both the resolution and signalto-noise ratio of the path estimation: • Narrow autocorrelation main lobe width: The width of the main lobe is the number of points on each side of the lobe where the power has fallen to half (−3 dB) of its maximum value. A narrow main lobe leads to better time resolution in sound propagation paths. • Low baseband crest factor: Baseband crest factor is the ratio of peak values to the effective value of the baseband signal. A signal with a low crest factor has higher energy than a high crest factor signal with the same peak power [2]. Therefore, it produces cross-correlation results with higher signal-to-noise ratio while the peak power is still below the audible power threshold

Interpolation Auto-correlation Baseband crest Auto-correlation Auto-correlation Method main lobe width factor gain side lobe level Time domain 14 samples 8.10dB 11.80dB -4.64dB GSM(26 bits) Frequency domain 8 samples 6.17dB 11.43dB -3.60dB Time domain 10.50dB 11.81dB -9.57dB Barker(13 bits) 16 samples Frequency domain 8 samples 5.12dB 13.46dB -6.50dB Time domain 16 samples 5.04dB 12.04dB -11.63dB M-sequence(127 bits) Frequency domain 8 samples 6.68dB 13.90dB -6.58dB Time domain 16 samples 3.85dB 12.14dB -12.45dB ZC(127 bits) Frequency domain 6 samples 2.56dB 13.93dB -6.82dB Table 1:Performance of different types of sequences cos2nfet High auto-correlation gain:The auto-correlation gain is the peak power of the main lobe divided by the FFT Up- sample average power of the auto-correlation function.A higher auto-correlation gain leads to a higher signal-to-noise ratio in the correlation result.Usually,a longer code sequence has -sin 2nfet a higher auto-correlation gain. Figure 4:Sound signal modulation structure Low auto-correlation side lobe level:Side lobes are sharp transitions between "0"and"1"in M-sequence make the small peaks(local maxima)other than the main lobe in the interpolated version worse than chirp-like polyphase the auto-correlation function.A large side lobe level will sequences [2].In general,frequency domain interpolation cause interference in the impulse response estimation. is better than the time domain interpolation,due to their We compare the performance of the transmission sig- narrow main lobe width.While the side lobe level of fre- nals with different code sequence designs and interpolation quency domain interpolation is higher than the time domain methods.For code sequence design,we compare commonly interpolation,the side lobe level of-6.82 dB provided by the used pseudo-noise (PN)sequences (i.e,GSM training se- ZC sequence gives enough attenuation on side lobes for our quence,Barker sequence,and M-sequence)with a chirp-like system. polyphase sequence(ZC sequence [18])in Table 1.Note that Based on above considerations,we choose to use the fre- the longest Barker sequence and GSM training sequence quency domain interpolated ZC sequence as our transmitted are 13 bits and 26 bits,respectively.For M-sequence and ZC signal.The root ZC sequence parametrized by u is given by: sequence,we use a sequence length of 127 bits. We interpolate the raw code sequences before transmit- ZC[n川=ej“g2 (1) ting them.The purpose of the interpolation is to reduce the bandwidth of the code sequence so that it can be fit into a where 0 s n Nzc,q is a constant integer,and Nzc is the length of sequence.The parameter u is an integer with narrow transmission band that is inaudible to humans.There 0<u Nzc and gcd(Nzc,u)=1.The ZC sequence has are two methods to interpolate the sequence,the time do- several nice properties [18]that are useful for sound signal main method and the frequency domain method.For the modulation.For example,the ZC sequences have constant time domain method [34],we first upsample the sequences magnitudes.Therefore,the power of the transmitted sound by repeating each sample by k times(usually k=6~8)and is constant so that we can measure its phase at high sam- then use a low-pass filter to ensure that the signal occupies pling rates as shown in later sections.Note that compared the desired bandwidth.For the frequency domain method, to the single frequency scheme [28],the disadvantages of we first perform Fast Fourier Transform(FFT)of the raw modulated signals including using ZC sequence are that they sequence,perform zero padding in the frequency domain to have to occupy the larger bandwidth and therefore require increase the length of the signal,and then use Inverse Fast stable frequency response for the microphone. Fourier Transform(IFFT)to convert the signal back into the time domain.For both methods,we reduce the bandwidth 4.2 Modulation and Demodulation of all sequences to 6 kHz with a sampling rate of 48 kHz so We use a two-step modulation scheme to convert the raw that the modulated signal can be fit into the 17~23 kHz ZC sequence into an inaudible sound signal,as illustrated inaudible range supported by commercial devices. in Figure 4.The first step is to use the frequency domain The performance of different sound signals is summarized interpolation to reduce the bandwidth of the sequence.We in Table 1.The ZC sequence has the best baseband crest factor first perform Nzc-points FFT on the raw complex valued and auto-correlation gain.Although the raw M-sequence has ZC sequence,where Nzc is the length of the sequence.We the ideal auto-correlation performance and crest factor,the then zero-pad the FFT result into Nc=Nzcfs/B points by

Interpolation Method Auto-correlation main lobe width Baseband crest factor Auto-correlation gain Auto-correlation side lobe level GSM (26 bits) Time domain 14 samples 8.10 dB 11.80 dB -4.64 dB Frequency domain 8 samples 6.17 dB 11.43 dB -3.60 dB Barker (13 bits) Time domain 16 samples 10.50 dB 11.81 dB -9.57 dB Frequency domain 8 samples 5.12 dB 13.46 dB -6.50 dB M-sequence (127 bits) Time domain 16 samples 5.04 dB 12.04 dB -11.63 dB Frequency domain 8 samples 6.68 dB 13.90 dB -6.58 dB ZC (127 bits) Time domain 16 samples 3.85 dB 12.14 dB -12.45 dB Frequency domain 6 samples 2.56 dB 13.93 dB -6.82 dB Table 1: Performance of different types of sequences • High auto-correlation gain: The auto-correlation gain is the peak power of the main lobe divided by the average power of the auto-correlation function. A higher auto-correlation gain leads to a higher signal-to-noise ratio in the correlation result. Usually, a longer code sequence has a higher auto-correlation gain. • Low auto-correlation side lobe level: Side lobes are the small peaks (local maxima) other than the main lobe in the auto-correlation function. A large side lobe level will cause interference in the impulse response estimation. We compare the performance of the transmission signals with different code sequence designs and interpolation methods. For code sequence design, we compare commonly used pseudo-noise (PN) sequences (i.e., GSM training sequence, Barker sequence, and M-sequence) with a chirp-like polyphase sequence (ZC sequence [18]) in Table 1. Note that the longest Barker sequence and GSM training sequence are 13 bits and 26 bits, respectively. For M-sequence and ZC sequence, we use a sequence length of 127 bits. We interpolate the raw code sequences before transmitting them. The purpose of the interpolation is to reduce the bandwidth of the code sequence so that it can be fit into a narrow transmission band that is inaudible to humans. There are two methods to interpolate the sequence, the time domain method and the frequency domain method. For the time domain method [34], we first upsample the sequences by repeating each sample by k times (usually k = 6 ∼ 8) and then use a low-pass filter to ensure that the signal occupies the desired bandwidth. For the frequency domain method, we first perform Fast Fourier Transform (FFT) of the raw sequence, perform zero padding in the frequency domain to increase the length of the signal, and then use Inverse Fast Fourier Transform (IFFT) to convert the signal back into the time domain. For both methods, we reduce the bandwidth of all sequences to 6 kHz with a sampling rate of 48 kHz so that the modulated signal can be fit into the 17 ∼ 23 kHz inaudible range supported by commercial devices. The performance of different sound signals is summarized in Table 1. The ZC sequence has the best baseband crest factor and auto-correlation gain. Although the raw M-sequence has the ideal auto-correlation performance and crest factor, the ZC IFFT Upsample I Q FFT Figure 4: Sound signal modulation structure sharp transitions between “0” and “1” in M-sequence make the interpolated version worse than chirp-like polyphase sequences [2]. In general, frequency domain interpolation is better than the time domain interpolation, due to their narrow main lobe width. While the side lobe level of frequency domain interpolation is higher than the time domain interpolation, the side lobe level of −6.82 dB provided by the ZC sequence gives enough attenuation on side lobes for our system. Based on above considerations, we choose to use the frequency domain interpolated ZC sequence as our transmitted signal. The root ZC sequence parametrized by u is given by: ZC[n] = e −j πun(n+1+2q) NZC , (1) where 0 ⩽ n < NZC, q is a constant integer, and NZC is the length of sequence. The parameter u is an integer with 0 < u < NZC and дcd(NZC,u) = 1. The ZC sequence has several nice properties [18] that are useful for sound signal modulation. For example, the ZC sequences have constant magnitudes. Therefore, the power of the transmitted sound is constant so that we can measure its phase at high sampling rates as shown in later sections. Note that compared to the single frequency scheme [28], the disadvantages of modulated signals including using ZC sequence are that they have to occupy the larger bandwidth and therefore require stable frequency response for the microphone. 4.2 Modulation and Demodulation We use a two-step modulation scheme to convert the raw ZC sequence into an inaudible sound signal, as illustrated in Figure 4. The first step is to use the frequency domain interpolation to reduce the bandwidth of the sequence. We first perform NZC-points FFT on the raw complex valued ZC sequence, where NZC is the length of the sequence. We then zero-pad the FFT result into N ′ ZC = NZC fs /B points by

Path Speed Distance Delay Amplitude Structure(MicI)】 >2,000m/s 4.5 cm 家0.13ms Large 2 Structure (Mic 2) >2.000m/s 12 cm x013 ms Medium LOS (Mic 1) 343m/s 4.5cm 0.13ms Large 4 LOS (Mic 2) 343m/s 12 cm 0.341ms Medium -0 Reflection (Mic 1) 343m/s >4.5cm 0.131ms Small 512 840 96 Reflection(Mic 2) Small Samples 6 343m/s >12cm >0.34ms (a)Baseband signal in the time domain Table 2:Different propagation paths due to multipath propagation.Suppose that the transmitted (01.0 baseband signal is ZCr(t)and the system is a Linear Time- Invariant(LTI)system,then the received baseband signal 611,0211) can be represented as: 19 -258 256 512 Samples ZCR(t)= >Aje-iiZCT(t-Ti)=h(t)*ZCT(t). (2) (b)Autocorrelation of baseband signal i=l Figure 5:Baseband signal of the ZC sequence where L is the number of propagation paths,ti is the de- lay of the ith propagation path and Aie represents the inserting zeros after the positive frequency components and complex path coefficient(i.e.,amplitude and phase)of the before the negative frequency components,where B is tar- ith propagation path,respectively.The received signal can geting signal bandwidth (e.g,6 kHz)and fs is the sampling be viewed as a circular convolution,h(t)*ZCr(t),of the rate of the sound (e.g,48 kHz).In this way,the interpo- Impulse Response h(t)and the periodical transmitted signal lated ZC sequence only occupies a small bandwidth of B in ZCr(t).The Impulse Response (IR)function of the multipath the frequency domain.Finally,we use IFFT to convert the propagation model is given by interpolated signal back into the time domain. < In VSkin,we choose a ZC sequence length of 127 points h(t)=Aje-i68(t-i). (3) with a parameter of u=63.We pad the 127-point ZC se- quence into 1024 points.Therefore,we have B=5.953 kHz at the sampling rate of fs =48 kHz.The interpo- where 6(t)is Dirac's delta function. lated ZC sequence is a periodical complex valued signal We use the cross-correlation,h(t)=ZCR(-t)*ZCr(t). with a period of 1024 sample points(21.3ms)as shown in of the received baseband signal ZCR(t),with the transmit- Figure 5(a). ted ZC sequence ZCr(t)as the estimation of the impulse The second step of the modulation process is to up-convert response.Due to the ideal periodic auto-correlation property the signal into the passband.In the up-convert step,the inter- of ZC code,where the auto-correlation of ZC sequence is polated ZC sequence is multiplied with a carrier frequency non-zero only at the point with a delay r of zero,the estima- of fe as shown in Figure 4.The transmitted passband sig- tion h(t)provides a good approximation for the IR function. nal is T(t)=cos(2πfet)ZC(t)-sin(2πfet)ZC号(t),where In our system,h(t)is sampled with an interval of Ts ZCI(t)and ZC(t)are the real part and imaginary part of 1/fs =0.021 ms,which corresponds to 0.7 cm (343 m/s x the time domain ZC sequence,respectively.We set fe as 20.25 0.021 ms)of the propagation distance.The sampled version kHz so that the transmitted signal occupies the bandwidth of IR estimation,h[n],has 1024 taps with n=0 ~1023. from 17.297 kHz to 23.25 kHz.This is because of frequencies Therefore,the maximum unambiguous range of our system is higher than 17 kHz are inaudible to most people [20]. 1024x0.7/2=358 cm,which is enough to avoid interferences from nearby objects.Using the cross-correlation,we obtain The signal is transmitted through the speaker on the mo- bile device and recorded by the microphones using the same one frame of IR estimation h[n]for each period of 1,024 sampling frequency of 48 kHz.After receiving the sound sig- sound samples(21.33 ms),as shown in Figure 3.Each peak nal,VSkin first demodulates the signal by down-converting in the IR estimation indicates one propagation path at the the passband signal back into the complex valued baseband corresponding delay,i.e.,a path with a delay of ri will lead signal. to a peak at the ni=ri/Ts sample point. SOUND PATH SEPARATION AND 5.2 Sound Propagation Model In our system,there are three different kinds of propa- MEASUREMENT gation paths:the structure path,the LOS air path and the 5.1 Multipath Propagation Model reflection air path,see Figure 2. The received baseband signal is a superposition of mul- Theoretically,we can estimate the delay and amplitude tiple copies of the transmitted signals with different delays of different paths based on the speed and attenuation of

0 128 256 384 512 640 768 896 1024 Samples -0.5 0 0.5 I\Q (normalized) I Q (a) Baseband signal in the time domain -512 -256 0 256 512 Samples 0 0.5 1 Absolute value (0,1.0) (-11,0.211) (b) Autocorrelation of baseband signal Figure 5: Baseband signal of the ZC sequence inserting zeros after the positive frequency components and before the negative frequency components, where B is targeting signal bandwidth (e.g., 6 kHz) and fs is the sampling rate of the sound (e.g., 48 kHz). In this way, the interpolated ZC sequence only occupies a small bandwidth of B in the frequency domain. Finally, we use IFFT to convert the interpolated signal back into the time domain. In VSkin, we choose a ZC sequence length of 127 points with a parameter of u = 63. We pad the 127-point ZC sequence into 1024 points. Therefore, we have B = 5.953 kHz at the sampling rate of fs = 48 kHz. The interpolated ZC sequence is a periodical complex valued signal with a period of 1024 sample points (21.3ms) as shown in Figure 5(a). The second step of the modulation process is to up-convert the signal into the passband. In the up-convert step, the interpolated ZC sequence is multiplied with a carrier frequency of fc as shown in Figure 4. The transmitted passband signal is T (t) = cos(2π fc t)ZCI T (t) − sin(2π fc t)ZCQ T (t), where ZCI T (t) and ZCQ T (t) are the real part and imaginary part of the time domain ZC sequence, respectively. We set fc as 20.25 kHz so that the transmitted signal occupies the bandwidth from 17.297 kHz to 23.25 kHz. This is because of frequencies higher than 17 kHz are inaudible to most people [20]. The signal is transmitted through the speaker on the mobile device and recorded by the microphones using the same sampling frequency of 48 kHz. After receiving the sound signal, VSkin first demodulates the signal by down-converting the passband signal back into the complex valued baseband signal. 5 SOUND PATH SEPARATION AND MEASUREMENT 5.1 Multipath Propagation Model The received baseband signal is a superposition of multiple copies of the transmitted signals with different delays Path Speed Distance Delay Amplitude 1 Structure (Mic 1) >2,000 m/s 4.5 cm ≪0.13 ms Large 2 Structure (Mic 2) >2,000 m/s 12 cm ≪0.13 ms Medium 3 LOS (Mic 1) 343 m/s 4.5 cm 0.13 ms Large 4 LOS (Mic 2) 343 m/s 12 cm 0.34 ms Medium 5 Reflection (Mic 1) 343 m/s >4.5 cm >0.13 ms Small 6 Reflection (Mic 2) 343 m/s >12 cm >0.34 ms Small Table 2: Different propagation paths due to multipath propagation. Suppose that the transmitted baseband signal is ZCT (t) and the system is a Linear TimeInvariant (LTI) system, then the received baseband signal can be represented as: ZCR (t) = X L i=1 Aie −jϕiZCT (t − τi ) = h(t) ∗ ZCT (t), (2) where L is the number of propagation paths, τi is the delay of the i th propagation path and Aie −jϕi represents the complex path coefficient (i.e., amplitude and phase) of the i th propagation path, respectively. The received signal can be viewed as a circular convolution, h(t) ∗ ZCT (t), of the Impulse Response h(t) and the periodical transmitted signal ZCT (t). The Impulse Response (IR) function of the multipath propagation model is given by h(t) = X L i=1 Aie −jϕi δ (t − τi ), (3) where δ (t) is Dirac’s delta function. We use the cross-correlation, ˆh(t) = ZC∗ R (−t) ∗ ZCT (t), of the received baseband signal ZCR (t), with the transmitted ZC sequence ZCT (t) as the estimation of the impulse response. Due to the ideal periodic auto-correlation property of ZC code, where the auto-correlation of ZC sequence is non-zero only at the point with a delay τ of zero, the estimation ˆh(t) provides a good approximation for the IR function. In our system, ˆh(t) is sampled with an interval of Ts = 1/fs = 0.021 ms, which corresponds to 0.7 cm (343 m/s × 0.021 ms) of the propagation distance. The sampled version of IR estimation, ˆh[n], has 1024 taps with n = 0 ∼ 1023. Therefore, the maximum unambiguous range of our system is 1024×0.7/2 = 358 cm, which is enough to avoid interferences from nearby objects. Using the cross-correlation, we obtain one frame of IR estimation ˆh[n] for each period of 1,024 sound samples (21.33 ms), as shown in Figure 3. Each peak in the IR estimation indicates one propagation path at the corresponding delay, i.e., a path with a delay of τi will lead to a peak at the ni = τi /Ts sample point. 5.2 Sound Propagation Model In our system, there are three different kinds of propagation paths: the structure path, the LOS air path and the reflection air path, see Figure 2. Theoretically, we can estimate the delay and amplitude of different paths based on the speed and attenuation of

100 sound in different materials and the propagation distance. Table 2 lists the theoretical propagation delays and ampli- tude for the six different paths between the speaker and the two microphones on the example shown in Figure 2.Given the high speed of sound for the structure-borne sound,the two structure sound paths(Path 1 and Path 2)have similar delays even if their path lengths are slightly different.Since With sampling rate increasing the acoustic attenuation coefficient of metal is close to air Without sampling rate increasing [26],the amplitude of structure sound path is close to the 200 -100 0 100 I(normalized) amplitude of the LOS air path.The LOS air paths(Path 3 and Path 4)have longer delays than the structure paths due to Figure 6:Path coefficient at different sampling rate the slower speed of sound in the air.The reflection air paths reflection air path(Path 5 and Path 6),respectively.We call (Path 5 and Path 6)arrive after the LOS air paths due to the this process as path delay calibration,which is performed longer path length.The amplitudes of reflection air paths are once when the system starts transmitting and recording the smaller than other two types of paths due to the attenuation sound signal.The path delay calibration is based on the first along the reflection and propagation process. ten data segments(213 ms)of IR estimation.We use an 1- nearest neighbor algorithm to confirm the path delays based 5.3 Sound Propagation Separation on the results of the ten segments. Typical impulse response estimations of the two micro- Note that the calibration time is 14.95 ms for one segment phones are shown in Figure 3.Although the theoretical delay (21.3 ms).Thus,we can perform calibration for each seg- difference between Path 1 and Path 3 is 0.13 ms(6 samples). ment in real-time.To save the computational cost,we only the time resolution of the interpolated ZC sequence is not calibrate the LOS path and structure-borne path delays for enough to separate Path 1 and Path 3 on Mic 1.Thus,the the first ten segments(213 ms).The path delay calibration is first peak in the IR estimation of the Mic 1 represents the only performed once after the system initialization because combination of Path 1 and Path 3.Due to the longer distance holding styles hardly change delays of the structure-borne from the speaker to Mic 2,the theoretical delay difference path and the LOS path.For the reflection path delay,we between Path 2 and Path 4 is 0.34 ms(17 samples).As a result, adaptively estimate it as shown in Section 6.2 so that our the Mic 2 has two peaks with similar amplitude,which cor- system will be robust to different holding styles. respond to the structure path(the first peak)and the LOS air path(the second peak),respectively.By locating the peaks 5.4 Path Coefficient Measurement of the IR estimation of the two microphones,we are able to After finding the delay of each propagation path,we mea- separate different propagation paths. sure the path coefficient of each path.For a path i with We use the IR estimation of both microphones to identify a delay of ni samples in the IR estimation,the path coef- different propagation paths.On commercial mobile devices, ficient is the complex value of h[ni]on the correspond- the starting point of the auto-correlation function is random ing microphone.The path coefficient indicates how the due to the randomness in the hardware/system delay of amplitude and phase of the given path change with time. sound playback and recording.The peaks corresponding to Both the amplitude and the phase of the path coefficient the structure propagation may appear at random positions are important for later movement measurement and touch every time when the system restarts.Therefore,we need detection algorithms. to first locate the structure paths in the IR estimations.Our One key challenge in path coefficient measurement is that key observation is that the two microphones are strictly cross-correlations are measured at low sampling rates.The synchronized so that their structure paths should appear basic cross-correlation algorithm presented in Section 5.1 at the same position in the IR estimations.Based on this produces one IR estimation per frame of 1,024 samples.This observation,we first locate the highest peak of Mic 1,which converts to a sampling rate of 48,000/1,024 =46.875 Hz.The corresponds to the combination of both Path 1 and Path 3. low sampling rate may lead to ambiguity in fast movements Then,we can locate the peaks of Path 2 and Path 4 in the IR where the path coefficient changes quickly.Figure 6 shows estimation of Mic 2 as the position of Path 2 should be aligned the path coefficient of a finger movement with a speed of 10 with Path 1/Path 3.Since we focus on the movement around cm/s.We observe that there are only 2~3 samples in each the mobile devices,the reflection air path is 5~15 samples phase cycle of 2m.As a phase difference of m can be caused (3.5 10.7 cm)away from LOS path for both microphones. either by a phase increases ofπor a phase decreased byπ， In this way,we get the delays of(i)combination of Path the direction of phase changing cannot be determined by 1 and Path 3,(ii)Path 2,(iii)Path 4,and (iv)the range of such low rate measurements

sound in different materials and the propagation distance. Table 2 lists the theoretical propagation delays and amplitude for the six different paths between the speaker and the two microphones on the example shown in Figure 2. Given the high speed of sound for the structure-borne sound, the two structure sound paths (Path 1 and Path 2) have similar delays even if their path lengths are slightly different. Since the acoustic attenuation coefficient of metal is close to air [26], the amplitude of structure sound path is close to the amplitude of the LOS air path. The LOS air paths (Path 3 and Path 4) have longer delays than the structure paths due to the slower speed of sound in the air. The reflection air paths (Path 5 and Path 6) arrive after the LOS air paths due to the longer path length. The amplitudes of reflection air paths are smaller than other two types of paths due to the attenuation along the reflection and propagation process. 5.3 Sound Propagation Separation Typical impulse response estimations of the two microphones are shown in Figure 3. Although the theoretical delay difference between Path 1 and Path 3 is 0.13 ms (6 samples), the time resolution of the interpolated ZC sequence is not enough to separate Path 1 and Path 3 on Mic 1. Thus, the first peak in the IR estimation of the Mic 1 represents the combination of Path 1 and Path 3. Due to the longer distance from the speaker to Mic 2, the theoretical delay difference between Path 2 and Path 4 is 0.34ms (17 samples). As a result, the Mic 2 has two peaks with similar amplitude, which correspond to the structure path (the first peak) and the LOS air path (the second peak), respectively. By locating the peaks of the IR estimation of the two microphones, we are able to separate different propagation paths. We use the IR estimation of both microphones to identify different propagation paths. On commercial mobile devices, the starting point of the auto-correlation function is random due to the randomness in the hardware/system delay of sound playback and recording. The peaks corresponding to the structure propagation may appear at random positions every time when the system restarts. Therefore, we need to first locate the structure paths in the IR estimations. Our key observation is that the two microphones are strictly synchronized so that their structure paths should appear at the same position in the IR estimations. Based on this observation, we first locate the highest peak of Mic 1, which corresponds to the combination of both Path 1 and Path 3. Then, we can locate the peaks of Path 2 and Path 4 in the IR estimation of Mic 2 as the position of Path 2 should be aligned with Path 1/Path 3. Since we focus on the movement around the mobile devices, the reflection air path is 5 ∼ 15 samples (3.5 ∼ 10.7 cm) away from LOS path for both microphones. In this way, we get the delays of (i) combination of Path 1 and Path 3, (ii) Path 2, (iii) Path 4, and (iv) the range of -200 -100 0 100 I (normalized) -200 -100 0 100 Q (normalized) With sampling rate increasing Without sampling rate increasing Figure 6: Path coefficient at different sampling rate reflection air path (Path 5 and Path 6), respectively. We call this process as path delay calibration, which is performed once when the system starts transmitting and recording the sound signal. The path delay calibration is based on the first ten data segments (213 ms) of IR estimation. We use an 1- nearest neighbor algorithm to confirm the path delays based on the results of the ten segments. Note that the calibration time is 14.95 ms for one segment (21.3 ms). Thus, we can perform calibration for each segment in real-time. To save the computational cost, we only calibrate the LOS path and structure-borne path delays for the first ten segments (213 ms). The path delay calibration is only performed once after the system initialization because holding styles hardly change delays of the structure-borne path and the LOS path. For the reflection path delay, we adaptively estimate it as shown in Section 6.2 so that our system will be robust to different holding styles. 5.4 Path Coefficient Measurement After finding the delay of each propagation path, we measure the path coefficient of each path. For a path i with a delay of ni samples in the IR estimation, the path coefficient is the complex value of ˆh[ni] on the corresponding microphone. The path coefficient indicates how the amplitude and phase of the given path change with time. Both the amplitude and the phase of the path coefficient are important for later movement measurement and touch detection algorithms. One key challenge in path coefficient measurement is that cross-correlations are measured at low sampling rates. The basic cross-correlation algorithm presented in Section 5.1 produces one IR estimation per frame of 1,024 samples. This converts to a sampling rate of 48, 000/1, 024 = 46.875 Hz. The low sampling rate may lead to ambiguity in fast movements where the path coefficient changes quickly. Figure 6 shows the path coefficient of a finger movement with a speed of 10 cm/s. We observe that there are only 2∼3 samples in each phase cycle of 2π. As a phase difference of π can be caused either by a phase increases of π or a phase decreased by π, the direction of phase changing cannot be determined by such low rate measurements

Received x(t) Low-pass of the interpolated ZC sequence.Finally,we get the path baseband signal fiiter coefficient at 48 kHz sampling rate.After the optimization, Z-m:a fixed measuring the path coefficient at a given delay only incurs one multiplication and two additions for each sample. Transmited baseband signal 21 6 MOVEMENT MEASUREMENT x(t-1023) Figure 7:Path coefficient measurement for delay n 6.1 Finger Movement Model We use the property of the circular cross-correlation to Finger movements incur both magnitude and phase upsample the path coefficient measurements.For a given changes in path coefficients.First,the delay for the peak delay of n samples,the IR estimation at time t is given by corresponding to the reflection path of the finger changes the circular cross-correlation of the received signal and the when the finger moves.Figure 8(a)shows the magnitude of transmitted sequence: the IR estimations when the finger first moves away from the microphone and then moves back.The movement distance is 10 cm on the surface of the mobile device.A"hot"region h,[m= ZCR[t+]×ZC[(l-n) mod Nc](4) indicates a peak at the corresponding distance in the IR esti- (=0 mation.While we can observe there are several peaks in the This is equivalent to take the summation of Nc point of raw IR estimation and they change with the movement,it is the received signal multiplied by a conjugated ZC sequence hard to discern the reflection path as it is much weaker than cyclically shifted by n points.The key observation is that the LOS path or the structure path.To amplify the changes, ZC sequence has constant power,i.e.,ZC[n]x ZC"[n] we take the difference of the IR estimation along the time 1,Vn.Thus,each point in the Nc multiplication results axis to remove these static paths.Figure 8(b)shows the re- in Eq.(4)contributes equally to the estimation of h [n].In sulting differential IR estimations.We observe that the finger consequence,the summation over a window with a size of moves away from the microphone during 0.7 to 1.3 seconds Nc can start from any value of t.Instead of advancing the and moves towards to the microphone from 3 to 3.5 seconds. value t by a full frame of 1,024 sample points as in ordinary The path length changes about 20 cm(10 x 2)during the cross-correlation operations,we can advance t one sample movement.In theory,we can track the position of the peak each time.In this way,we can obtain the path coefficient corresponding to the reflection path and measure the finger with a sampling rate of 48 kHz,which gives the details of movement.However,the position of the peak is measured in changes in path coefficient as shown in Figure 6. terms of the number of samples,which gives a low resolution The above upsampling scheme incurs high computational of around 0.7 cm per sample.Furthermore,estimation of the cost.To obtain all path coefficients h [n]for delay n(n=0~ peak position is susceptible to noises,which leads to large 1023),it requires 48,000 dot productions per second and each errors in distance measurements. dot product is performed with two vectors of 1,024 samples. We utilize phase changes in the path coefficient to measure This cannot be easily carried out by mobile devices.To reduce movement distance so that we can achieve mm-level distance the computational cost,we observe that not all taps in h[n] accuracy.Consider the case the reflection path of the finger are useful.We are only interested in the taps corresponding is path i and its path coefficient is: to the structure propagation paths and the reflection air paths h [ni]=Are24) (5) within a distance of 15 cm.Therefore,instead of calculating the cross-correlation,we just calculate the path coefficients where di(t)is the path length at time t.The phase for path at given delays using a fixed cyclic shift of n.Figure 7 shows iis,which changes by 2 when di(t) the process of measuring the path coefficient at a given delay. changes by the amount of sound wavelength Ac=c/f First,we synchronize the transmitted signal and received (1.69 cm)[28].Therefore,we can measure the phase change signal by cyclically shifting the transmitted signal with a of the reflection path to obtain mm-level accuracy in the path fixed offset of n;corresponding to the delay of the given path. length di(t) Second,we multiply each sample of the received baseband signal with the conjugation of the shifted transmitted sample. 6.2 Reflection Path Delay Estimation Third,we use a moving average with a window size of 1,024 The first step for measuring the finger movement is to to sum the complex values and get the path coefficients. estimate the delay of the reflection path.Due to the non- Note that the moving average can be carried out by just negligible main lobe width of the auto-correlation function, two additions per sample.Fourth,we use low-pass filter multiple IR estimations that are close to the reflection path to remove high frequency noises caused by imperfections have similar changes when the finger moves.We need to

Received baseband signal Transmitted baseband signal : a fixed cyclic shift of … Low-pass filter hˆt[n] x(t) x(t − 1023) Z−n n Z−1 Z−1 Z−1 Figure 7: Path coefficient measurement for delay n We use the property of the circular cross-correlation to upsample the path coefficient measurements. For a given delay of n samples, the IR estimation at time t is given by the circular cross-correlation of the received signal and the transmitted sequence: ˆht[n] = N ′ ZCX−1 l=0 ZCR[t + l] × ZC∗ T [(l − n) mod N ′ ZC] (4) This is equivalent to take the summation of N ′ ZC point of the received signal multiplied by a conjugated ZC sequence cyclically shifted by n points. The key observation is that ZC sequence has constant power, i.e., ZC[n] × ZC∗ [n] = 1,∀n. Thus, each point in the N ′ ZC multiplication results in Eq. (4) contributes equally to the estimation of ˆht[n]. In consequence, the summation over a window with a size of N ′ ZC can start from any value of t. Instead of advancing the value t by a full frame of 1,024 sample points as in ordinary cross-correlation operations, we can advance t one sample each time. In this way, we can obtain the path coefficient with a sampling rate of 48 kHz, which gives the details of changes in path coefficient as shown in Figure 6. The above upsampling scheme incurs high computational cost. To obtain all path coefficients ˆht[n] for delay n (n = 0 ∼ 1023), it requires 48, 000 dot productions per second and each dot product is performed with two vectors of 1,024 samples. This cannot be easily carried out by mobile devices. To reduce the computational cost, we observe that not all taps in ˆht[n] are useful. We are only interested in the taps corresponding to the structure propagation paths and the reflection air paths within a distance of 15 cm. Therefore, instead of calculating the cross-correlation, we just calculate the path coefficients at given delays using a fixed cyclic shift of n. Figure 7 shows the process of measuring the path coefficient at a given delay. First, we synchronize the transmitted signal and received signal by cyclically shifting the transmitted signal with a fixed offset of ni corresponding to the delay of the given path. Second, we multiply each sample of the received baseband signal with the conjugation of the shifted transmitted sample. Third, we use a moving average with a window size of 1, 024 to sum the complex values and get the path coefficients. Note that the moving average can be carried out by just two additions per sample. Fourth, we use low-pass filter to remove high frequency noises caused by imperfections of the interpolated ZC sequence. Finally, we get the path coefficient at 48 kHz sampling rate. After the optimization, measuring the path coefficient at a given delay only incurs one multiplication and two additions for each sample. 6 MOVEMENT MEASUREMENT 6.1 Finger Movement Model Finger movements incur both magnitude and phase changes in path coefficients. First, the delay for the peak corresponding to the reflection path of the finger changes when the finger moves. Figure 8(a) shows the magnitude of the IR estimations when the finger first moves away from the microphone and then moves back. The movement distance is 10 cm on the surface of the mobile device. A “hot” region indicates a peak at the corresponding distance in the IR estimation. While we can observe there are several peaks in the raw IR estimation and they change with the movement, it is hard to discern the reflection path as it is much weaker than the LOS path or the structure path. To amplify the changes, we take the difference of the IR estimation along the time axis to remove these static paths. Figure 8(b) shows the resulting differential IR estimations. We observe that the finger moves away from the microphone during 0.7 to 1.3 seconds and moves towards to the microphone from 3 to 3.5 seconds. The path length changes about 20 cm (10 × 2) during the movement. In theory, we can track the position of the peak corresponding to the reflection path and measure the finger movement. However, the position of the peak is measured in terms of the number of samples, which gives a low resolution of around 0.7 cm per sample. Furthermore, estimation of the peak position is susceptible to noises, which leads to large errors in distance measurements. We utilize phase changes in the path coefficient to measure movement distance so that we can achieve mm-level distance accuracy. Consider the case the reflection path of the finger is path i and its path coefficient is: ˆht[ni] = Aie −j(ϕi+2π di (t ) λc ) , (5) where di (t) is the path length at time t. The phase for path i is ϕi (t) = ϕi + 2π di (t) λc , which changes by 2π when di (t) changes by the amount of sound wavelength λc = c/fc (≈1.69 cm) [28]. Therefore, we can measure the phase change of the reflection path to obtain mm-level accuracy in the path length di (t). 6.2 Reflection Path Delay Estimation The first step for measuring the finger movement is to estimate the delay of the reflection path. Due to the nonnegligible main lobe width of the auto-correlation function, multiple IR estimations that are close to the reflection path have similar changes when the finger moves. We need to

20 -380 6-400 -WEKE --W/O EKF Time(seconds) 420 0 (a)Magnitude of the raw IR estimations 440 460 00 -380 380 -340 -320 -300 60 I(normalized) Figure 9:Path coefficients for finger reflection path. Time (seconds) (b)Magnitude of differential IR estimations with a constant attenuation of Ai in a short period.There- Figure 8:IR estimations for finger movement. fore,the trace of path coefficients should be a circle in the complex plane.However,due to additive noises,the trace in adaptively select one of these IR estimations to represent the reflection path so that noises introduced by side lobes of Figure 9 is not smooth enough for later phase measurements We propose to use the Extended Kalman Filter(EKF),a other paths can be reduced. Our heuristic to determine the delay of the non-linear filter,to track the path coefficient and reduce reflection path is based on the observation that the additive noises.The goal is to make the resulting path the reflection path will have the largest change of coefficient closer to the theoretical model so that the phase magnitude compared to other paths.Consider the change incurred by the movement can be measured with changes of magnitude in ht[ni]:h[ni]-ht-At[ni]= higher accuracy.We use the sinusoid model to predict and update the signal of both I/Q components [8].To save the Ai le)e Here we assume computational resources,we first detect whether the finger is that Ai does not change during the short period of At.When moving or not as shown in Section 6.2.When we find that the the delay ni is exactly the same as of the reflection path, finger is moving,we initialize the parameters of the EKF and the magnitude of h [ni]-h-Ar[ni]is maximized.This is perform EKF.We also downsample the path coefficient to because the magnitude of Ail is maximized at the peak 3 kHz to make the EKF affordable for mobile devices.Figure corresponds to the auto-correlation of the reflection path 9 shows that results after EKF are much smoother than the and the magnitude ofe)e) original signal. is maximized due to the largest path length change at the 6.4 Phase Based Movement Measurement reflection path delay. We use a curvature-based estimation scheme to measure In our implementation,we select l path coefficients with an the phase change of the path coefficient.Our estimation interval of three samples between each other as the candidate scheme assumes that the path coefficient is a superposition of reflection paths.The distance between these candidate of a circularly changing dynamical component,which is reflection paths and the structure path is determined by size caused by the moving finger,and a quasi-static component, of the phone,e.g..5~15 samples for the bottom Mic.We keep which is caused by nearby static objects [28,29,34].The monitoring the candidate path coefficients and select the algorithm estimates the phase of the dynamic component by path with the maximum magnitude in the time differential measuring the curvature of the trace on the complex plane IR estimations as the reflection path.When the finger is The curvature-based scheme avoids the error-prone process static,our system still keeps track of the reflection path.In of estimating the quasi-static component in LEVD [28]and this way,we can use the changes in the selected reflection is robust to noise interferences. path to detect whether the finger moves or not. Suppose that we use a trace in the two-dimensional plane y(t)=(I,i)to represent the path coefficient of the 6.3 Additive Noise Mitigation reflection.As shown in Figure 9,the instantaneous signed Although the adaptive reflection path selection scheme curvature can be estimated as: gives high SNR measurements on path coefficients,the addi- det(y'(t).y"(t)) tive noises from other paths still interfere with the measured k(t)= (6) path coefficients.Figure 9 shows the result of the trace of the y(t)3 complex path coefficient with a finger movement.In the ideal where y'(t)=dy(t)/dt is the first derivative of y(t)with case,the path coefficients is h[ni]=Aie-i(+2d(t)/A) respect to the parameter t,and det is taking the determinant

01234 Time (seconds) 20 40 60 80 Path length (cm) (a) Magnitude of the raw IR estimations 01234 Time (seconds) 20 40 60 80 Path length (cm) (b) Magnitude of differential IR estimations Figure 8: IR estimations for finger movement. adaptively select one of these IR estimations to represent the reflection path so that noises introduced by side lobes of other paths can be reduced. Our heuristic to determine the delay of the reflection path is based on the observation that the reflection path will have the largest change of magnitude compared to other paths. Consider the changes of magnitude in ˆht[ni]: ˆht[ni] − ˆht−∆t[ni] = Ai e −j(ϕi+2π di (t ) λc ) − e −j(ϕi+2π di (t−∆t ) λc ) . Here we assume that Ai does not change during the short period of ∆t. When the delay ni is exactly the same as of the reflection path, the magnitude of ˆht[ni] − ˆht−∆t[ni] is maximized. This is because the magnitude of |Ai | is maximized at the peak corresponds to the auto-correlation of the reflection path, and the magnitude of e −j(ϕi+2π di (t ) λc ) − e −j(ϕi+2π di (t−∆t ) λc ) is maximized due to the largest path length change at the reflection path delay. In our implementation, we selectl path coefficients with an interval of three samples between each other as the candidate of reflection paths. The distance between these candidate reflection paths and the structure path is determined by size of the phone, e.g., 5 ∼ 15 samples for the bottom Mic. We keep monitoring the candidate path coefficients and select the path with the maximum magnitude in the time differential IR estimations as the reflection path. When the finger is static, our system still keeps track of the reflection path . In this way, we can use the changes in the selected reflection path to detect whether the finger moves or not. 6.3 Additive Noise Mitigation Although the adaptive reflection path selection scheme gives high SNR measurements on path coefficients, the additive noises from other paths still interfere with the measured path coefficients. Figure 9 shows the result of the trace of the complex path coefficient with a finger movement. In the ideal case, the path coefficients is ˆht[ni] = Aie −j(ϕi+2πdi (t)/λc ) -400 -380 -360 -340 -320 -300 I (normalized) -460 -440 -420 -400 -380 Q (normalized) P O W EKF W/O EKF Figure 9: Path coefficients for finger reflection path. with a constant attenuation of Ai in a short period. Therefore, the trace of path coefficients should be a circle in the complex plane. However, due to additive noises, the trace in Figure 9 is not smooth enough for later phase measurements. We propose to use the Extended Kalman Filter (EKF), a non-linear filter, to track the path coefficient and reduce the additive noises. The goal is to make the resulting path coefficient closer to the theoretical model so that the phase change incurred by the movement can be measured with higher accuracy. We use the sinusoid model to predict and update the signal of both I/Q components [8]. To save the computational resources, we first detect whether the finger is moving or not as shown in Section 6.2. When we find that the finger is moving, we initialize the parameters of the EKF and perform EKF. We also downsample the path coefficient to 3 kHz to make the EKF affordable for mobile devices. Figure 9 shows that results after EKF are much smoother than the original signal. 6.4 Phase Based Movement Measurement We use a curvature-based estimation scheme to measure the phase change of the path coefficient. Our estimation scheme assumes that the path coefficient is a superposition of a circularly changing dynamical component, which is caused by the moving finger, and a quasi-static component, which is caused by nearby static objects [28, 29, 34]. The algorithm estimates the phase of the dynamic component by measuring the curvature of the trace on the complex plane. The curvature-based scheme avoids the error-prone process of estimating the quasi-static component in LEVD [28] and is robust to noise interferences. Suppose that we use a trace in the two-dimensional plane y(t) = (I hˆ t ,Qhˆ t ) to represent the path coefficient of the reflection. As shown in Figure 9, the instantaneous signed curvature can be estimated as: k(t) = det(y ′ (t),y ′′(t)) y ′ (t) 3 , (6) where y ′ (t) = dy(t)/dt is the first derivative of y(t) with respect to the parameter t, and det is taking the determinant

-12 movement in the air will change the air-borne propagation of the sound.Meanwhile,when the finger contacts the surface of the phone,the force applied on the surface will change the vibration pattern of the structure of the phone.which leads Time(seconds) to changes in the structure-borne signal [25].In other words, (a)Magnitude of differential IR estimations when touch and the structure-borne sound is able to distinguish whether the release at 7 cm away from speaker finger is hovering above the surface with a mm-level gap or is pressing on the surface.In VSkin,we mainly use the changes in the structure-borne signal to sense the finger touching, as they provide distinctive information about whether the finger touches the surface or not.However,when force is applied at different locations on the surface,the changes Time(seconds) of the structure-borne sound caused by touching will be (b)Magnitude of differential IR estimations when touch and different in magnitude and phase.Existing schemes only release at 1 cm away from speaker use the magnitude of the structure-borne sound [25],which Figure 10:Touching on different locations has different change rates at different touch positions.They of the given matrix.We assume that the instantaneous cur- rely on the touchscreen to determine the position and the vature remains constant during the time period t-1 ~t and accurate time of the touching to measure the force-level of the phase change of the dynamic component is: touching [25].However,neither the location nor the time =2arcsin (y(t)-y(t1) (7) of the touching is available for VSkin.Therefore,the key 2 challenge in touching sensing for VSkin is to perform joint The path length change for the time period 0~t is: touch detection and touch localization. d0-d0=-2A9x。 (8) Touching events lead to unique patterns in the differential 2π IR estimation.As an example,Figure 10 shows the differen- where di(t)is the path length from the speaker reflected tial IR estimations that are close to the structure-borne path through the finger to the microphone. of the top microphone in Figure 2,when the user touches the back of the phone.The y-axis is the number of samples to the 6.5 From Path Length to Movements structure-borne path,where the structure-borne path(Path The path length change for the reflection air path can 2 in Section 5.3)is at y =0.When force is applied on the be measured on both microphones.Depending on the type surface,the width of the peak corresponding to the structure- of gestures and the placement of the microphones,we can borne path increases.This leads to a small deviation in the use the path length change to derive the actual movement peak position in the path coefficient changes from the orig- distance.For example,for the phone in Figure 2,we can use inal peak of the structure-borne propagation.Figure 10(a) the path length change of the reflection air path on the bottom shows the resulting differential IR estimations when user's microphone to measure the finger movement distance for finger touches/leaves the surface of the mobile device at a the scrolling gesture(up/down movement).This is because position that is 7 cm away from the rear speaker.We observe the length of the reflection path on the bottom microphone that the "hottest"region is not at the original peak of the changes significantly when the finger moves up/down on structure-borne propagation.This is due to the force applied the back of the phone.The actual movement distance can on the surface changes the path of the structure-borne signal. be calculated by multiplying the path length change with a To further explore the change of the structure-borne propa- compensating factor as described in Section 8.For the gesture gation,we ask the user to perform finger tapping on eleven of swiping left/right,we can use path length changes of two different positions on the back of the device and measure microphones to determine the swiping direction,as swiping the position of peaks in the path coefficient changes.Figure left and right will introduce the same path length change 11 shows the relationship between the touching position pattern on the bottom microphone but different path length and the resulting peak position in coefficient changes,where change directions on the top microphone. the peak position is measured by the number of samples to the original structure-borne path.We observe that the 7 TOUCH MEASUREMENT larger the distance between the touching position and the 7.1 Touch Signal Pattern speaker,the larger the delay in coefficient changes to the Touching the surface with fingers will change both the original structure-borne path (darker color means a larger air-borne propagation and structure-borne propagation of delay).Thus,we utilize the magnitude and delay of differen- the sound.When performing the tapping action,the finger tial IR estimations to detect and localize touch events.Note

0123 Time (seconds) -12 0 12 24 Delay samples (a) Magnitude of differential IR estimations when touch and release at 7 cm away from speaker 0123 Time (seconds) -12 0 12 24 Delay samples (b) Magnitude of differential IR estimations when touch and release at 1 cm away from speaker Figure 10: Touching on different locations of the given matrix. We assume that the instantaneous curvature remains constant during the time period t − 1 ∼ t and the phase change of the dynamic component is: ∆θ t t−1 = 2 arcsin k(t) y(t) − y(t − 1) 2 . (7) The path length change for the time period 0 ∼ t is: di (t) − di (0) = − Pt i=1 ∆θ i i−1 2π × λc , (8) where di (t) is the path length from the speaker reflected through the finger to the microphone. 6.5 From Path Length to Movements The path length change for the reflection air path can be measured on both microphones. Depending on the type of gestures and the placement of the microphones, we can use the path length change to derive the actual movement distance. For example, for the phone in Figure 2, we can use the path length change of the reflection air path on the bottom microphone to measure the finger movement distance for the scrolling gesture (up/down movement). This is because the length of the reflection path on the bottom microphone changes significantly when the finger moves up/down on the back of the phone. The actual movement distance can be calculated by multiplying the path length change with a compensating factor as described in Section 8. For the gesture of swiping left/right, we can use path length changes of two microphones to determine the swiping direction, as swiping left and right will introduce the same path length change pattern on the bottom microphone but different path length change directions on the top microphone. 7 TOUCH MEASUREMENT 7.1 Touch Signal Pattern Touching the surface with fingers will change both the air-borne propagation and structure-borne propagation of the sound. When performing the tapping action, the finger movement in the air will change the air-borne propagation of the sound. Meanwhile, when the finger contacts the surface of the phone, the force applied on the surface will change the vibration pattern of the structure of the phone, which leads to changes in the structure-borne signal [25]. In other words, the structure-borne sound is able to distinguish whether the finger is hovering above the surface with a mm-level gap or is pressing on the surface. In VSkin, we mainly use the changes in the structure-borne signal to sense the finger touching, as they provide distinctive information about whether the finger touches the surface or not. However, when force is applied at different locations on the surface, the changes of the structure-borne sound caused by touching will be different in magnitude and phase. Existing schemes only use the magnitude of the structure-borne sound [25], which has different change rates at different touch positions. They rely on the touchscreen to determine the position and the accurate time of the touching to measure the force-level of touching [25]. However, neither the location nor the time of the touching is available for VSkin. Therefore, the key challenge in touching sensing for VSkin is to perform joint touch detection and touch localization. Touching events lead to unique patterns in the differential IR estimation. As an example, Figure 10 shows the differential IR estimations that are close to the structure-borne path of the top microphone in Figure 2, when the user touches the back of the phone. The y-axis is the number of samples to the structure-borne path, where the structure-borne path (Path 2 in Section 5.3) is at y = 0. When force is applied on the surface, the width of the peak corresponding to the structureborne path increases. This leads to a small deviation in the peak position in the path coefficient changes from the original peak of the structure-borne propagation. Figure 10(a) shows the resulting differential IR estimations when user’s finger touches/leaves the surface of the mobile device at a position that is 7 cm away from the rear speaker. We observe that the “hottest” region is not at the original peak of the structure-borne propagation. This is due to the force applied on the surface changes the path of the structure-borne signal. To further explore the change of the structure-borne propagation, we ask the user to perform finger tapping on eleven different positions on the back of the device and measure the position of peaks in the path coefficient changes. Figure 11 shows the relationship between the touching position and the resulting peak position in coefficient changes, where the peak position is measured by the number of samples to the original structure-borne path. We observe that the larger the distance between the touching position and the speaker, the larger the delay in coefficient changes to the original structure-borne path (darker color means a larger delay). Thus, we utilize the magnitude and delay of differential IR estimations to detect and localize touch events. Note

点击下载完整版文档（PDF格式）

共15页，试读已结束，阅读完整版请下载

点击下载（PDF格式）

浏览记录