VSkin:Sensing Touch Gestures on Surfaces of Mobile Devices Using Acoustic Signals Ke Sun Ting Zhao State Key Laboratory for Novel Software Technology State Key Laboratory for Novel Software Technology Nanjing University,China,kesun@smail.nju.edu.cn Nanjing University,zhaoting@smail.nju.edu.cn Wei Wang Lei Xie State Key Laboratory for Novel Software Technology State Key Laboratory for Novel Software Technology Nanjing University,China,ww@nju.edu.cn Nanjing University,China,lxie@nju.edu.cn ABSTRACT Enabling touch gesture sensing on all surfaces of the mo- bile device,not limited to the touchscreen area,leads to new user interaction experiences.In this paper,we propose VSkin,a system that supports fine-grained gesture-sensing on the back of mobile devices based on acoustic signals. VSkin utilizes both the structure-borne sounds,i.e.,sounds propagating through the structure of the device,and the air-borne sounds,i.e.,sounds propagating through the air,to (a)Back-Swiping (b)Back-Tapping (c)Back-Scrolling sense finger tapping and movements.By measuring both the Figure 1:Back-of-Device interactions amplitude and phase of each path of sound signals,VSkin de- 1 INTRODUCTION tects tapping events with an accuracy of 99.65%and captures Touch gesture is one of the most important ways for users finger movements with an accuracy of 3.59 mm. to interact with mobile devices.With the wide-deployment of touchscreens,a set of user-friendly touch gestures,such CCS CONCEPTS as swiping,tapping,and scrolling,have become the de facto Human-centered computing-Interface design standard user interface for mobile devices.However,due to prototyping;Gestural input; the high-cost of the touchscreen hardware,gesture-sensing is usually limited to the front surface of the device.Further- KEYWORDS more,touchscreens combine the function of gesture-sensing with the function of displaying.This leads to the occlusion Touch gestures,Ultrasound problem [30],i.e.,user fingers often block the content dis- ACM Reference Format: played on the screen during the interaction process. Ke Sun,Ting Zhao,Wei Wang,and Lei Xie.2018.VSkin:Sens- Enabling gesture-sensing on all surfaces of the mobile de- ing Touch Gestures on Surfaces of Mobile Devices Using Acoustic vice,not limited to the touchscreen area,leads to new user Signals.In MobiCom'18:24th Annual International Conference on interaction experiences.First,new touch gestures solve the Mobile Computing and Networking,October 29-November 2,2018, occlusion problem of the touchscreen.For example,Back-of- New Delhi,India.ACM,New York,NY,USA,15 pages.https://doi. Device(BoD)gestures use tapping or swiping on the back of org/10.1145/XXXXXX.XXXXXx a smartphone as a supplementary input interface [22,35].As shown in Figure 1,the screen is no longer blocked when the Permission to make digital or hard copies of all or part of this work for back-scrolling gesture is used for scrolling the content.BoD personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear gestures also enriches the user experience of mobile games this notice and the full citation on the first page.Copyrights for components by allowing players to use the back surface as a touchpad. of this work owned by others than ACM must be honored.Abstracting with Second,defining new touch gestures on different surfaces credit is permitted.To copy otherwise,or republish,to post on servers or to helps the system better understand user intentions.On tradi- redistribute to lists,requires prior specific permission and/or a fee.Request tional touchscreens,touching a webpage on the screen could permissions from permissions@acm.org. MobiCom'18,October 29-November 2,2018,New Delhi,India mean that the user wishes to click a hyperlink or the user e2018 Association for Computing Machinery. just wants to scroll down the page.Existing touchscreen ACM ISBN978-1-4503-5903-0/18/10..$15.00 schemes often confuse these two intentions,due to the over- https://doi.org/10.1145/XXXXXX.XXXXXX loaded actions on gestures that are similar to each other
VSkin: Sensing Touch Gestures on Surfaces of Mobile Devices Using Acoustic Signals Ke Sun State Key Laboratory for Novel Software Technology Nanjing University, China, kesun@smail.nju.edu.cn Ting Zhao State Key Laboratory for Novel Software Technology Nanjing University, zhaoting@smail.nju.edu.cn Wei Wang State Key Laboratory for Novel Software Technology Nanjing University, China, ww@nju.edu.cn Lei Xie State Key Laboratory for Novel Software Technology Nanjing University, China, lxie@nju.edu.cn ABSTRACT Enabling touch gesture sensing on all surfaces of the mobile device, not limited to the touchscreen area, leads to new user interaction experiences. In this paper, we propose VSkin, a system that supports fine-grained gesture-sensing on the back of mobile devices based on acoustic signals. VSkin utilizes both the structure-borne sounds, i.e., sounds propagating through the structure of the device, and the air-borne sounds, i.e., sounds propagating through the air, to sense finger tapping and movements. By measuring both the amplitude and phase of each path of sound signals, VSkin detects tapping events with an accuracy of 99.65% and captures finger movements with an accuracy of 3.59 mm. CCS CONCEPTS • Human-centered computing → Interface design prototyping; Gestural input; KEYWORDS Touch gestures, Ultrasound ACM Reference Format: Ke Sun, Ting Zhao, Wei Wang, and Lei Xie. 2018. VSkin: Sensing Touch Gestures on Surfaces of Mobile Devices Using Acoustic Signals. In MobiCom ’18: 24th Annual International Conference on Mobile Computing and Networking, October 29–November 2, 2018, New Delhi, India. ACM, New York, NY, USA, 15 pages. https://doi. org/10.1145/XXXXXX.XXXXXX Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. MobiCom’18, October 29–November 2, 2018, New Delhi, India © 2018 Association for Computing Machinery. ACM ISBN 978-1-4503-5903-0/18/10. . . $15.00 https://doi.org/10.1145/XXXXXX.XXXXXX (a) Back-Swiping (b) Back-Tapping (c) Back-Scrolling Figure 1: Back-of-Device interactions 1 INTRODUCTION Touch gesture is one of the most important ways for users to interact with mobile devices. With the wide-deployment of touchscreens, a set of user-friendly touch gestures, such as swiping, tapping, and scrolling, have become the de facto standard user interface for mobile devices. However, due to the high-cost of the touchscreen hardware, gesture-sensing is usually limited to the front surface of the device. Furthermore, touchscreens combine the function of gesture-sensing with the function of displaying. This leads to the occlusion problem [30], i.e., user fingers often block the content displayed on the screen during the interaction process. Enabling gesture-sensing on all surfaces of the mobile device, not limited to the touchscreen area, leads to new user interaction experiences. First, new touch gestures solve the occlusion problem of the touchscreen. For example, Back-ofDevice (BoD) gestures use tapping or swiping on the back of a smartphone as a supplementary input interface [22, 35]. As shown in Figure 1, the screen is no longer blocked when the back-scrolling gesture is used for scrolling the content. BoD gestures also enriches the user experience of mobile games by allowing players to use the back surface as a touchpad. Second, defining new touch gestures on different surfaces helps the system better understand user intentions. On traditional touchscreens, touching a webpage on the screen could mean that the user wishes to click a hyperlink or the user just wants to scroll down the page. Existing touchscreen schemes often confuse these two intentions, due to the overloaded actions on gestures that are similar to each other
With the new types of touch gestures performed on different inaudible sound bandwidth (around 6 kHz)available on com- surfaces of the device,these actions can be assigned to dis- mercial mobile devices,it is challenging to separate these tinct gestures,e.g.,selecting an item should be performed on paths.Moreover,to achieve accurate movement measure- the screen while scrolling or switching should be performed ment and location independent touch detection,we need on the back or the side of the device.Third,touch sensing to measure both the phase and the magnitude of each path. on the side of the phone enables virtual side-buttons that To address this challenge,we design a system that uses the could replace physical buttons and improve the waterproof Zadoff-Chu(ZC)sequence to measure different sound paths. performance of the device.Compared to in-air gestures that With the near-optimal auto-correlation function of the ZC also enrich the gesture semantics,touch gestures have better sequence,which has a peak width of 6 samples,we can sepa- user experience,due to their accurate touch detection(for rate the structure-borne and the air-borne signals when the confirmation)connected to the useful haptic feedbacks. distance between the speaker and microphone is just 12 cm. Fine-grained gesture movement distance/speed measure- Furthermore,we develop a new algorithm that measures ments are vital for enabling touch gestures that users are the phase of each sound path at a rate of 3,000 samples per already familiar with,including scrolling and swiping.How- second.Compared to traditional impulsive signal systems ever,existing accelerometer or structural vibration based that measure sound paths in a frame by frame manner (with touch sensing schemes only recognize coarse-grained ac- frame rate <170 Hz[14,34]),the higher sampling rate helps tivities,such as the tapping events [5,35].Extra informa- VSkin capture fast swiping and tapping events. tion on the tapping position or the tapping force levels usu- We implement VSkin on commercial smartphones as real- ally requires intensive training and calibration processes time Android applications.Experimental results show that [12,13,25]or additional hardware,such as a mirror on the VSkin achieves a touch detection accuracy of 99.65%and an back of the smartphone [31]. accuracy of 3.59 mm for finger movement distances.Our user In this paper,we propose VSkin,a system that supports study shows that VSkin only slightly increases the movement fine-grained gesture-sensing on the surfaces of mobile de- time used for interaction tasks,e.g.,scrolling and swiping, vices based on acoustic signals.Similar to a layer of skin by 34%and 10%when compared to touchscreens. on the surfaces of the mobile device,VSkin can sense both We made the following contributions in this work: the finger tapping and finger movement distance/direction We introduce a new approach for touch-sensing on mo- on the surface of the device.Without modifying the hard- bile devices by separating the structure-borne and the air- ware,VSkin utilizes the built-in speakers and microphones borne sound signals to send and receive sound signals for touch-sensing.More We design an algorithm that performs the phase and specifically,VSkin captures both the structure-borne sounds, magnitude measurement of multiple sound paths at a high ie.,sounds propagating through the structure of the device, sampling rate of 3 kHz. and the air-borne sounds,i.e.,sounds propagating through We implement our system on the Android platform and the air.As touching the surface can significantly change the perform real-world user studies to verify our design. structural vibration pattern of the device,the characteristics of structure-borne sounds are reliable features for touch de- tection,i.e.,whether the finger contacts the surface or not 2 RELATED WORK [12,13,25.While it is difficult to use the structure-borne We categorize researches related to VSkin into three sounds to sense finger movements,air-borne sounds can classes:Back-of-Device interactions,tapping and force sens- measure the movement with mm-level accuracy [14,28,34]. ing,and sound-based gesture sensing. Therefore,by analyzing both the structure-borne and the Back-of-Device Interactions:Back-of-device interac- air-borne sounds,it is possible to reliably recognize a rich set tion is a popular way to extend the user interface of mobile of touch gestures as if there is another touchscreen on the devices [5,11,31,32,35].Gestures performed on the back back of the phone.Moreover,VSkin does not require inten- of the device can be detected by the built-in camera [31,32] sive training,as it uses the physical properties of the sound or sensors [5,35]on the mobile device.LensGesture [32] propagation to detect touch and measure finger movements. uses the rear camera to detect finger movements that are The key challenge faced by VSkin is to measure both the structure-borne and the air-borne signals with high fidelity performed just above the camera.Back-Mirror [31]uses an additional mirror attached to the rear camera to capture BoD while the hand is very close to the mobile device.Given the small form factor of mobile devices,sounds traveling through gestures in a larger region.However,due to the limited view- ing angle of cameras,these approaches either have limited different mediums and paths arrive at the microphone within a short time interval of 0.13~0.34 ms,which is just 6~16 sam- sensing area or need extra hardware for extending sensing ple points at a sampling rate of 48 kHz.With the limited range.BackTap [35]and BTap[5]use built-in sensors,such
With the new types of touch gestures performed on different surfaces of the device, these actions can be assigned to distinct gestures, e.g., selecting an item should be performed on the screen while scrolling or switching should be performed on the back or the side of the device. Third, touch sensing on the side of the phone enables virtual side-buttons that could replace physical buttons and improve the waterproof performance of the device. Compared to in-air gestures that also enrich the gesture semantics, touch gestures have better user experience, due to their accurate touch detection (for confirmation) connected to the useful haptic feedbacks. Fine-grained gesture movement distance/speed measurements are vital for enabling touch gestures that users are already familiar with, including scrolling and swiping. However, existing accelerometer or structural vibration based touch sensing schemes only recognize coarse-grained activities, such as the tapping events [5, 35]. Extra information on the tapping position or the tapping force levels usually requires intensive training and calibration processes [12, 13, 25] or additional hardware, such as a mirror on the back of the smartphone [31]. In this paper, we propose VSkin, a system that supports fine-grained gesture-sensing on the surfaces of mobile devices based on acoustic signals. Similar to a layer of skin on the surfaces of the mobile device, VSkin can sense both the finger tapping and finger movement distance/direction on the surface of the device. Without modifying the hardware, VSkin utilizes the built-in speakers and microphones to send and receive sound signals for touch-sensing. More specifically, VSkin captures both the structure-borne sounds, i.e., sounds propagating through the structure of the device, and the air-borne sounds, i.e., sounds propagating through the air. As touching the surface can significantly change the structural vibration pattern of the device, the characteristics of structure-borne sounds are reliable features for touch detection, i.e., whether the finger contacts the surface or not [12, 13, 25]. While it is difficult to use the structure-borne sounds to sense finger movements, air-borne sounds can measure the movement with mm-level accuracy [14, 28, 34]. Therefore, by analyzing both the structure-borne and the air-borne sounds, it is possible to reliably recognize a rich set of touch gestures as if there is another touchscreen on the back of the phone. Moreover, VSkin does not require intensive training, as it uses the physical properties of the sound propagation to detect touch and measure finger movements. The key challenge faced by VSkin is to measure both the structure-borne and the air-borne signals with high fidelity while the hand is very close to the mobile device. Given the small form factor of mobile devices, sounds traveling through different mediums and paths arrive at the microphone within a short time interval of 0.13∼0.34ms, which is just 6∼16 sample points at a sampling rate of 48 kHz. With the limited inaudible sound bandwidth (around 6 kHz) available on commercial mobile devices, it is challenging to separate these paths. Moreover, to achieve accurate movement measurement and location independent touch detection, we need to measure both the phase and the magnitude of each path. To address this challenge, we design a system that uses the Zadoff-Chu (ZC) sequence to measure different sound paths. With the near-optimal auto-correlation function of the ZC sequence, which has a peak width of 6 samples, we can separate the structure-borne and the air-borne signals when the distance between the speaker and microphone is just 12 cm. Furthermore, we develop a new algorithm that measures the phase of each sound path at a rate of 3,000 samples per second. Compared to traditional impulsive signal systems that measure sound paths in a frame by frame manner (with frame rate <170 Hz [14, 34]), the higher sampling rate helps VSkin capture fast swiping and tapping events. We implement VSkin on commercial smartphones as realtime Android applications. Experimental results show that VSkin achieves a touch detection accuracy of 99.65% and an accuracy of 3.59mm for finger movement distances. Our user study shows that VSkin only slightly increases the movement time used for interaction tasks, e.g., scrolling and swiping, by 34% and 10% when compared to touchscreens. We made the following contributions in this work: • We introduce a new approach for touch-sensing on mobile devices by separating the structure-borne and the airborne sound signals. • We design an algorithm that performs the phase and magnitude measurement of multiple sound paths at a high sampling rate of 3 kHz. • We implement our system on the Android platform and perform real-world user studies to verify our design. 2 RELATED WORK We categorize researches related to VSkin into three classes: Back-of-Device interactions, tapping and force sensing, and sound-based gesture sensing. Back-of-Device Interactions: Back-of-device interaction is a popular way to extend the user interface of mobile devices [5, 11, 31, 32, 35]. Gestures performed on the back of the device can be detected by the built-in camera [31, 32] or sensors [5, 35] on the mobile device. LensGesture [32] uses the rear camera to detect finger movements that are performed just above the camera. Back-Mirror [31] uses an additional mirror attached to the rear camera to capture BoD gestures in a larger region. However, due to the limited viewing angle of cameras, these approaches either have limited sensing area or need extra hardware for extending sensing range. BackTap [35] and βTap[5] use built-in sensors, such
Top Mic (Mic 2) two consecutive frames.LLAP [28]uses Continuous Wave (CW)signal to track the moving target based on the phase information,which is susceptible to the dynamic multipath caused by other moving objects.Strata [34]combines the frame-based approach and the phase-based approach.Using the 26-bit GSM training sequence that has nice autocorrela- tion properties,Strata can track phase changes at different ucture path time delays so that objects that are more than 8.5 cm apart LOS air path - can be resolved.However,these schemes mainly focus on tracking in-air gestures that are performed at more than Bottom Mc (Mic 1) 20 cm away from the mobile device [14,23,28,34].In com- Figure 2:Sound propagation paths on a smartphone parison,our system uses both the structure-borne and the air-borne sound signals to sense gestures performed on the as the accelerometer,to sense coarse-grained gestures.How- surface of the mobile devices,which are very close (e.g.,less ever,sensor readings only provide limited information about than 12 cm)to both the speakers and the microphones.As the the gesture,and they cannot quantify the movement speed sound reflections at a short distance are often submerged by and distance.Furthermore.accelerometers are sensitive to the Line-of-Sight(LOS)signals,sensing gestures with SNR vibrations caused by hand movements while the user is hold- 2 dB at 5 cm is considerably harder than sensing in-air ing the device.Compared to camera-based and sensor-based gestures with SNR 12 dB at 30 cm. schemes,VSkin incurs no additional hardware costs and can perform fine-grained gesture measurements. Tapping and Force Sensing:Tapping and force applied 3 SYSTEM OVERVIEW to the surface can be sensed by different types of sensors VSkin uses both the structure-borne and the air-borne [4,7,9,10,12,13,15,19,25].TapSense [7]leverages the sound signals to capture gestures performed on the surface of tapping sound to recognize whether the user touches the the mobile device.We transmit and record inaudible sounds screen with a fingertip or a fist.ForceTap [9]measures the using the built-in speakers and microphones on commod- tapping force using the built-in accelerometer.VibWrite [13] ity mobile devices.As an example illustrated in Figure 2, and VibSense [12]use the vibration signal instead of the sound signals transmitted by the rear speaker travel through sound signal to sense the tapping position so that the inter- multiple paths on the back of the phone to the top and bot- ference in air-borne propagation can be avoided.However, tom microphones.On both microphones,the structure-borne they require pre-trained vibration profiles for tapping local- sound that travels through the body structure of the smart- ization.ForcePhone [25]uses linear chirp sounds to sense phone arrives first.This is because sound wave propagates force and touch based on changes in the magnitude of the much faster in the solid (>2,000m/s)than in the air (around structure-borne signal.However,fine-grained phase infor- 343m/s)[24].There might be multiple copies of air-borne mation cannot be measured through chirps and chirps only sounds arrive within a short interval following the structure- capture the magnitude of the structure-borne signal at a low borne sound.The air-borne sounds include the LOS sound sampling rate.In comparison,our system measures both the and the reflection sounds of surrounding objects,e.g.,the phase and the magnitude of multiple sound paths with a finger or the table.All these sound signals are mixed at the high sampling rate of 3 kHz so that we can perform robust recording microphones. tap sensing without intensive training. VSkin performs gesture-sensing based on the mixture of Sound-based Gesture Sensing:Several sound-based sound signals recorded by the microphones.The design of gesture recognition systems have been proposed to recog- VSkin consists of the following four components: nize in-air gestures [1,3,6,16,17,21,23,33,37].Soundwave Transmission signal design:We choose to use the [6],Multiwave [17],and AudioGest [21]use Doppler effect Zadoff-Chu(ZC)sequence modulated by a sinusoid carrier as to recognize predefined gestures.However,Doppler effect our transmitted sound signal.This transmission signal design only gives coarse-grained movement speeds.Thus,these meets three key design goals.First,the auto-correlation of schemes only recognize a small set of gestures that have ZC sequence has a narrow peak width of 6 samples so that we distinctive speed characters.Recently,three state-of-the-art can separate sound paths arrive with a small time-difference schemes (i.e.,FingerIO [14],LLAP [28],and Strata [34])use by locating the peaks corresponding to their different de- ultrasound to track fine-grained finger gestures.FingerIO lays,see Figure 3.Second,we use interpolation schemes to [14]transmits OFDM modulated sound frames and locates reduce the bandwidth of the ZC sequence to less than 6 kHz the moving finger based on the change of the echo profiles of so that it can be fitted into the narrow inaudible range of
Top Mic (Mic 2) Bottom Mic (Mic 1) Path 2 Path 1 Path 3 Path 4 Path 6 Path 5 Rear Speaker Structure path LOS air path Reflection air path Back Surface of the Phone Figure 2: Sound propagation paths on a smartphone as the accelerometer, to sense coarse-grained gestures. However, sensor readings only provide limited information about the gesture, and they cannot quantify the movement speed and distance. Furthermore, accelerometers are sensitive to vibrations caused by hand movements while the user is holding the device. Compared to camera-based and sensor-based schemes, VSkin incurs no additional hardware costs and can perform fine-grained gesture measurements. Tapping and Force Sensing: Tapping and force applied to the surface can be sensed by different types of sensors [4, 7, 9, 10, 12, 13, 15, 19, 25]. TapSense [7] leverages the tapping sound to recognize whether the user touches the screen with a fingertip or a fist. ForceTap [9] measures the tapping force using the built-in accelerometer. VibWrite [13] and VibSense [12] use the vibration signal instead of the sound signal to sense the tapping position so that the interference in air-borne propagation can be avoided. However, they require pre-trained vibration profiles for tapping localization. ForcePhone [25] uses linear chirp sounds to sense force and touch based on changes in the magnitude of the structure-borne signal. However, fine-grained phase information cannot be measured through chirps and chirps only capture the magnitude of the structure-borne signal at a low sampling rate. In comparison, our system measures both the phase and the magnitude of multiple sound paths with a high sampling rate of 3 kHz so that we can perform robust tap sensing without intensive training. Sound-based Gesture Sensing: Several sound-based gesture recognition systems have been proposed to recognize in-air gestures [1, 3, 6, 16, 17, 21, 23, 33, 37]. Soundwave [6], Multiwave [17], and AudioGest [21] use Doppler effect to recognize predefined gestures. However, Doppler effect only gives coarse-grained movement speeds. Thus, these schemes only recognize a small set of gestures that have distinctive speed characters. Recently, three state-of-the-art schemes (i.e., FingerIO [14], LLAP [28], and Strata [34]) use ultrasound to track fine-grained finger gestures. FingerIO [14] transmits OFDM modulated sound frames and locates the moving finger based on the change of the echo profiles of two consecutive frames. LLAP [28] uses Continuous Wave (CW) signal to track the moving target based on the phase information, which is susceptible to the dynamic multipath caused by other moving objects. Strata [34] combines the frame-based approach and the phase-based approach. Using the 26-bit GSM training sequence that has nice autocorrelation properties, Strata can track phase changes at different time delays so that objects that are more than 8.5 cm apart can be resolved. However, these schemes mainly focus on tracking in-air gestures that are performed at more than 20 cm away from the mobile device [14, 23, 28, 34]. In comparison, our system uses both the structure-borne and the air-borne sound signals to sense gestures performed on the surface of the mobile devices, which are very close (e.g., less than 12 cm) to both the speakers and the microphones. As the sound reflections at a short distance are often submerged by the Line-of-Sight (LOS) signals, sensing gestures with SNR ≈ 2 dB at 5 cm is considerably harder than sensing in-air gestures with SNR ≈ 12 dB at 30 cm. 3 SYSTEM OVERVIEW VSkin uses both the structure-borne and the air-borne sound signals to capture gestures performed on the surface of the mobile device. We transmit and record inaudible sounds using the built-in speakers and microphones on commodity mobile devices. As an example illustrated in Figure 2, sound signals transmitted by the rear speaker travel through multiple paths on the back of the phone to the top and bottom microphones. On both microphones, the structure-borne sound that travels through the body structure of the smartphone arrives first. This is because sound wave propagates much faster in the solid (>2,000m/s) than in the air (around 343m/s) [24]. There might be multiple copies of air-borne sounds arrive within a short interval following the structureborne sound. The air-borne sounds include the LOS sound and the reflection sounds of surrounding objects, e.g., the finger or the table. All these sound signals are mixed at the recording microphones. VSkin performs gesture-sensing based on the mixture of sound signals recorded by the microphones. The design of VSkin consists of the following four components: Transmission signal design: We choose to use the Zadoff-Chu (ZC) sequence modulated by a sinusoid carrier as our transmitted sound signal. This transmission signal design meets three key design goals. First, the auto-correlation of ZC sequence has a narrow peak width of 6 samples so that we can separate sound paths arrive with a small time-difference by locating the peaks corresponding to their different delays, see Figure 3. Second, we use interpolation schemes to reduce the bandwidth of the ZC sequence to less than 6 kHz so that it can be fitted into the narrow inaudible range of
g的and Path.3 Note that finger movement measurement and touch measurement can use signal captured by the top micro- 目←Pah5 phone,the bottom microphone,or both.How these mea- 257 513 769 Samples surements are used in specific gestures,such as scrolling (a)Bottom microphone (Mic 1) and swiping,depends on both the type of the gestures and the placement of microphones of the given device,see Section 6.5. ath 4 10 4 TRANSMISSION SIGNAL DESIGN ath 6 4.1 Baseband Sequence Selection 512 768 Sound signals propagated through the structure path,the Samples LOS path and the reflection path arrive within a very small (b)Top microphone (Mic 2) time interval of less than 0.34ms,due the small size of a Figure 3:IR estimation of dual microphones smartphone (20cm).One way to separate these paths is 17~23 kHz provided by commodity speakers and micro- to transmit short impulses of sounds so that the reflected phones.Third,we choose to modulate the ZC sequence so impulses do not overlap with each other.However,impulses that we can extract the phase information,which cannot be with short time durations have very low energy so that the measured by traditional chirp-like sequences such as FMCW received signals,especially those reflected by the finger,are sequences too weak to be reliably measured. Sound path separation and measurement:To sepa- In VSkin,we choose to transmit a periodical high-energy rate different sound paths at the receiving end,we first use signal and rely on the auto-correlation properties of the cross-correlation to estimate the Impulse Response(IR)of the signal to separate the sound paths.A continuous period- mixed sound.Second,we locate the candidate sound paths ical signal has higher energy than impulses so that the using the amplitude of the IR estimation.Third,we identify weak reflections can be reliably measured.The cyclic auto- the structure-borne path,the LOS path,and the reflection correlation function of the signal s[n]is defined as R(r)= path by aligning candidate paths on different microphones 为∑N!s[ns*[n-r)modN],where N is the length of based on the known microphone positions.Finally,we use the signal,r is the delay,and s'[n]is the conjugation of the an efficient algorithm to calculate the phase and amplitude signal.The cyclic auto-correlation function is maximized of each sound path at a high sampling rate of 48 kHz. around r =0 and we define the peak at r =0 as the main Finger movement measurement:The finger move- lobe of the auto-correlation function,see Figure 5(b).When ment measurement is based on the phase of the air-borne the cyclic auto-correlation function has a single narrow peak, path reflected by the finger.To detect the weak reflections of i.e.,R(r)0 for r +0,we can separate multiple copies of the finger,we first calculate the differential IR estimations s[n]arrived at different arrival delay r by performing cross- so that changes caused by finger movements are amplified. correlation of the mixed signal with the cyclically shifted Second,we use an adaptive algorithm to determine the de- s[n].For the cross-correlation results as shown in Figure 3, lay of the reflection path so that the phase and amplitude each delayed copy of s[n]in the mixed signal leads to a peak can be measured with high SNR.Third,we use an Extend at its corresponding delay value of r. Kalman Filter to further amplify the sound signal based on The transmitted sound signal needs to satisfy the following the finger movement model.Finally,the finger movement extra requirements to ensure both the resolution and signal- distance is calculated by measuring the phase change of the to-noise ratio of the path estimation: corresponding reflection path. Narrow autocorrelation main lobe width:The Touch measurement:We use the structure-borne path width of the main lobe is the number of points on each to detect touch events,since the structure-borne path is side of the lobe where the power has fallen to half(-3 dB) mainly determined by whether the user finger is pressing on of its maximum value.A narrow main lobe leads to better the surface or not.To detect touch events,we first calculate time resolution in sound propagation paths. the differential IR estimations of the structure-borne path Low baseband crest factor:Baseband crest factor is We then use a threshold-based scheme to detect the touch the ratio of peak values to the effective value of the baseband and release events.To locate the touch position,we found signal.A signal with a low crest factor has higher energy that the delay of the changes in structure-borne sound is than a high crest factor signal with the same peak power [2] closely related to the distance from the touch position to the Therefore,it produces cross-correlation results with higher speaker.Using this observation,we classify the touch event signal-to-noise ratio while the peak power is still below the into three different regions with an accuracy of 87.8%. audible power threshold
1 257 513 769 Samples 0 1 2 3 Absolute value 106 (301, 2.25 106) Path 1 and Path 3 Path 5 (a) Bottom microphone (Mic 1) 0 256 512 768 1024 Samples 0 5 10 15 Absolute value 104 Path 2 (301, 1.19 105) Path 4 (313, 1.17 105) Path 6 (b) Top microphone (Mic 2) Figure 3: IR estimation of dual microphones 17 ∼ 23 kHz provided by commodity speakers and microphones. Third, we choose to modulate the ZC sequence so that we can extract the phase information, which cannot be measured by traditional chirp-like sequences such as FMCW sequences. Sound path separation and measurement: To separate different sound paths at the receiving end, we first use cross-correlation to estimate the Impulse Response (IR) of the mixed sound. Second, we locate the candidate sound paths using the amplitude of the IR estimation. Third, we identify the structure-borne path, the LOS path, and the reflection path by aligning candidate paths on different microphones based on the known microphone positions. Finally, we use an efficient algorithm to calculate the phase and amplitude of each sound path at a high sampling rate of 48 kHz. Finger movement measurement: The finger movement measurement is based on the phase of the air-borne path reflected by the finger. To detect the weak reflections of the finger, we first calculate the differential IR estimations so that changes caused by finger movements are amplified. Second, we use an adaptive algorithm to determine the delay of the reflection path so that the phase and amplitude can be measured with high SNR. Third, we use an Extend Kalman Filter to further amplify the sound signal based on the finger movement model. Finally, the finger movement distance is calculated by measuring the phase change of the corresponding reflection path. Touch measurement: We use the structure-borne path to detect touch events, since the structure-borne path is mainly determined by whether the user finger is pressing on the surface or not. To detect touch events, we first calculate the differential IR estimations of the structure-borne path. We then use a threshold-based scheme to detect the touch and release events. To locate the touch position, we found that the delay of the changes in structure-borne sound is closely related to the distance from the touch position to the speaker. Using this observation, we classify the touch event into three different regions with an accuracy of 87.8%. Note that finger movement measurement and touch measurement can use signal captured by the top microphone, the bottom microphone, or both. How these measurements are used in specific gestures, such as scrolling and swiping, depends on both the type of the gestures and the placement of microphones of the given device, see Section 6.5. 4 TRANSMISSION SIGNAL DESIGN 4.1 Baseband Sequence Selection Sound signals propagated through the structure path, the LOS path and the reflection path arrive within a very small time interval of less than 0.34ms, due the small size of a smartphone (< 20cm). One way to separate these paths is to transmit short impulses of sounds so that the reflected impulses do not overlap with each other. However, impulses with short time durations have very low energy so that the received signals, especially those reflected by the finger, are too weak to be reliably measured. In VSkin, we choose to transmit a periodical high-energy signal and rely on the auto-correlation properties of the signal to separate the sound paths. A continuous periodical signal has higher energy than impulses so that the weak reflections can be reliably measured. The cyclic autocorrelation function of the signal s[n] is defined as R(τ ) = 1 N PN n=1 s[n]s ∗ [(n − τ ) mod N], where N is the length of the signal, τ is the delay, and s ∗ [n] is the conjugation of the signal. The cyclic auto-correlation function is maximized around τ = 0 and we define the peak at τ = 0 as the main lobe of the auto-correlation function, see Figure 5(b). When the cyclic auto-correlation function has a single narrow peak, i.e., R(τ ) ≈ 0 for τ , 0, we can separate multiple copies of s[n] arrived at different arrival delay τ by performing crosscorrelation of the mixed signal with the cyclically shifted s[n]. For the cross-correlation results as shown in Figure 3, each delayed copy of s[n] in the mixed signal leads to a peak at its corresponding delay value of τ . The transmitted sound signal needs to satisfy the following extra requirements to ensure both the resolution and signalto-noise ratio of the path estimation: • Narrow autocorrelation main lobe width: The width of the main lobe is the number of points on each side of the lobe where the power has fallen to half (−3 dB) of its maximum value. A narrow main lobe leads to better time resolution in sound propagation paths. • Low baseband crest factor: Baseband crest factor is the ratio of peak values to the effective value of the baseband signal. A signal with a low crest factor has higher energy than a high crest factor signal with the same peak power [2]. Therefore, it produces cross-correlation results with higher signal-to-noise ratio while the peak power is still below the audible power threshold
Interpolation Auto-correlation Baseband crest Auto-correlation Auto-correlation Method main lobe width factor gain side lobe level Time domain 14 samples 8.10dB 11.80dB -4.64dB GSM(26 bits) Frequency domain 8 samples 6.17dB 11.43dB -3.60dB Time domain 10.50dB 11.81dB -9.57dB Barker(13 bits) 16 samples Frequency domain 8 samples 5.12dB 13.46dB -6.50dB Time domain 16 samples 5.04dB 12.04dB -11.63dB M-sequence(127 bits) Frequency domain 8 samples 6.68dB 13.90dB -6.58dB Time domain 16 samples 3.85dB 12.14dB -12.45dB ZC(127 bits) Frequency domain 6 samples 2.56dB 13.93dB -6.82dB Table 1:Performance of different types of sequences cos2nfet High auto-correlation gain:The auto-correlation gain is the peak power of the main lobe divided by the FFT Up- sample average power of the auto-correlation function.A higher auto-correlation gain leads to a higher signal-to-noise ratio in the correlation result.Usually a longer code sequence has -sin 2mfet a higher auto-correlation gain. Figure 4:Sound signal modulation structure Low auto-correlation side lobe level:Side lobes are and crest factor,the sharp transitions between"0"and"1" the small peaks(local maxima)other than the main lobe in in M-sequences make the interpolated version worse than the auto-correlation function.A large side lobe level will chirp-like polyphase sequences [2].In general,frequency cause interference in the impulse response estimation. domain interpolation is better than the time domain interpo- We compare the performance of the transmission sig- lation,due to their narrow main lobe width.While the side nals with different code sequence designs and interpolation lobe level of frequency domain interpolation is higher than methods.For code sequence design,we compare commonly the time domain interpolation,the side lobe level of-6.82 used pseudo-noise (PN)sequences (i.e,GSM training se- dB provided by the ZC sequence gives enough attenuation quence,Barker sequence,and M-sequence)with a chirp-like on side lobes for our system. polyphase sequence(ZC sequence [18])in Table 1.Note that Based on above considerations,we choose to use the fre- the longest Barker sequence and GSM training sequence quency domain interpolated ZC sequence as our transmitted are 13 bits and 26 bits,respectively.For M-sequence and ZC signal.The root ZC sequence parametrized by u is given by: sequence,we use a sequence length of 127 bits. We interpolate the raw code sequences before transmit- ZCIn]=e-jzun(na NZC (1) ting them.The purpose of the interpolation is to reduce the bandwidth of the code sequence so that it can be fit into a where 0 s n Nzc,q is a constant integer,and Nzc is the length of sequence.The parameter u is an integer with narrow transmission band that is inaudible to humans.There 0<u Nzc and gcd(Nzc,u)=1.The ZC sequence has are two methods to interpolate the sequence,the time do- several nice properties [18]that are useful for sound signal main method and the frequency domain method.For the modulation.For example,the ZC sequences have constant time domain method [34],we first upsample the sequences magnitudes.Therefore,the power of the transmitted sound by repeating each sample by k times(usually k=6~8)and is constant so that we can measure its phase at high sam- then use a low-pass filter to ensure that the signal occupies pling rates as shown in later sections.Note that compared the desired bandwidth.For the frequency domain method, to the single frequency scheme [28],the disadvantages of we first perform Fast Fourier Transform(FFT)of the raw modulated signals including using ZC sequence are that they sequence,perform zero padding in the frequency domain to have to occupy the larger bandwidth and therefore require increase the length of the signal,and then use Inverse Fast stable frequency response for the microphone. Fourier Transform(IFFT)to convert the signal back into the time domain.For both methods,we reduce the bandwidth 4.2 Modulation and Demodulation of all sequences to 6 kHz with a sampling rate of 48 kHz so We use a two-step modulation scheme to convert the raw that the modulated signal can be fit into the 17~23 kHz ZC sequence into an inaudible sound signal,as illustrated inaudible range supported by commercial devices. in Figure 4.The first step is to use the frequency domain The performance of different sound signals is summa- interpolation to reduce the bandwidth of the sequence.We rized in Table 1.The ZC sequence has the best baseband first perform Nzc-points FFT on the raw complex valued ZC crest factor and auto-correlation gain.Although the raw sequence,where Nzc is the length of the sequence.We then M-sequences have the ideal auto-correlation performance zero-pad the FFT result into Nzcfs/B points by inserting
Interpolation Method Auto-correlation main lobe width Baseband crest factor Auto-correlation gain Auto-correlation side lobe level GSM (26 bits) Time domain 14 samples 8.10 dB 11.80 dB -4.64 dB Frequency domain 8 samples 6.17 dB 11.43 dB -3.60 dB Barker (13 bits) Time domain 16 samples 10.50 dB 11.81 dB -9.57 dB Frequency domain 8 samples 5.12 dB 13.46 dB -6.50 dB M-sequence (127 bits) Time domain 16 samples 5.04 dB 12.04 dB -11.63 dB Frequency domain 8 samples 6.68 dB 13.90 dB -6.58 dB ZC (127 bits) Time domain 16 samples 3.85 dB 12.14 dB -12.45 dB Frequency domain 6 samples 2.56 dB 13.93 dB -6.82 dB Table 1: Performance of different types of sequences • High auto-correlation gain: The auto-correlation gain is the peak power of the main lobe divided by the average power of the auto-correlation function. A higher auto-correlation gain leads to a higher signal-to-noise ratio in the correlation result. Usually a longer code sequence has a higher auto-correlation gain. • Low auto-correlation side lobe level: Side lobes are the small peaks (local maxima) other than the main lobe in the auto-correlation function. A large side lobe level will cause interference in the impulse response estimation. We compare the performance of the transmission signals with different code sequence designs and interpolation methods. For code sequence design, we compare commonly used pseudo-noise (PN) sequences (i.e., GSM training sequence, Barker sequence, and M-sequence) with a chirp-like polyphase sequence (ZC sequence [18]) in Table 1. Note that the longest Barker sequence and GSM training sequence are 13 bits and 26 bits, respectively. For M-sequence and ZC sequence, we use a sequence length of 127 bits. We interpolate the raw code sequences before transmitting them. The purpose of the interpolation is to reduce the bandwidth of the code sequence so that it can be fit into a narrow transmission band that is inaudible to humans. There are two methods to interpolate the sequence, the time domain method and the frequency domain method. For the time domain method [34], we first upsample the sequences by repeating each sample by k times (usually k = 6 ∼ 8) and then use a low-pass filter to ensure that the signal occupies the desired bandwidth. For the frequency domain method, we first perform Fast Fourier Transform (FFT) of the raw sequence, perform zero padding in the frequency domain to increase the length of the signal, and then use Inverse Fast Fourier Transform (IFFT) to convert the signal back into the time domain. For both methods, we reduce the bandwidth of all sequences to 6 kHz with a sampling rate of 48 kHz so that the modulated signal can be fit into the 17 ∼ 23 kHz inaudible range supported by commercial devices. The performance of different sound signals is summarized in Table 1. The ZC sequence has the best baseband crest factor and auto-correlation gain. Although the raw M-sequences have the ideal auto-correlation performance ZC IFFT Upsample I Q FFT Figure 4: Sound signal modulation structure and crest factor, the sharp transitions between “0” and “1” in M-sequences make the interpolated version worse than chirp-like polyphase sequences [2]. In general, frequency domain interpolation is better than the time domain interpolation, due to their narrow main lobe width. While the side lobe level of frequency domain interpolation is higher than the time domain interpolation, the side lobe level of −6.82 dB provided by the ZC sequence gives enough attenuation on side lobes for our system. Based on above considerations, we choose to use the frequency domain interpolated ZC sequence as our transmitted signal. The root ZC sequence parametrized by u is given by: ZC[n] = e −j πun(n+1+2q) NZC , (1) where 0 ⩽ n < NZC, q is a constant integer, and NZC is the length of sequence. The parameter u is an integer with 0 < u < NZC and дcd(NZC,u) = 1. The ZC sequence has several nice properties [18] that are useful for sound signal modulation. For example, the ZC sequences have constant magnitudes. Therefore, the power of the transmitted sound is constant so that we can measure its phase at high sampling rates as shown in later sections. Note that compared to the single frequency scheme [28], the disadvantages of modulated signals including using ZC sequence are that they have to occupy the larger bandwidth and therefore require stable frequency response for the microphone. 4.2 Modulation and Demodulation We use a two-step modulation scheme to convert the raw ZC sequence into an inaudible sound signal, as illustrated in Figure 4. The first step is to use the frequency domain interpolation to reduce the bandwidth of the sequence. We first perform NZC-points FFT on the raw complex valued ZC sequence, where NZC is the length of the sequence. We then zero-pad the FFT result into NZC fs /B points by inserting
Path Speed Distance Delay Amplitude Structure(MicI)】 >2,000m/s 4.5 cm 家0.13ms Large 2 Structure (Mic 2) >2.000m/s 12 cm x013 ms Medium LOS (Mic 1) 343m/s 4.5cm 0.13ms Large 4 LOS (Mic 2) 343m/s 12 cm 0.341ms Medium -0 Reflection (Mic 1) 343m/s >4.5cm 0.131ms Small 512 840 9 6 Reflection(Mic 2) Small Samples 343m/s >12cm >0.34ms (a)Baseband signal in the time domain Table 2:Different propagation paths due to multipath propagation.Suppose that the transmitted (01.0 baseband signal is ZCr(t)and the system is a Linear Time- Invariant(LTI)system,then the received baseband signal 611,0211) can be represented as: 1 -258 0 256 512 Samples ZCR(t)= >Aje-iiZCT(t-Ti)=h(t)*ZCT(t). (2) (b)Autocorrelation of baseband signal i=l Figure 5:Baseband signal of the ZC sequence where L is the number of propagation paths,ti is the de- lay of the ith propagation path and Aie represents the zeros after the positive frequency components and before complex path coefficient(i.e.,amplitude and phase)of the the negative frequency components,where B is targeting ith propagation path,respectively.The received signal can signal bandwidth (e.g.,6 kHz)and fs is the sampling rate be viewed as a circular convolution,h(t)*ZCr(t),of the of the sound (e.g,48 kHz).In this way,the interpolated Impulse Response h(t)and the periodical transmitted signal ZC sequence only occupies a small bandwidth of B in the ZCr(t).The Impulse Response (IR)function of the multipath frequency domain.Finally,we use IFFT to convert the inter- propagation model is given by polated signal back into the time domain. < In VSkin,we choose a ZC sequence length of 127 points h(t)=Aje-i68(t-i). (3) with a parameter of u=63.We pad the 127-point ZC se- 台 quence into 1024 points.Therefore,we have B=5.953 kHz at the sampling rate of fs=48 kHz.The interpo- where 6(t)is Dirac's delta function. lated ZC sequence is a periodical complex valued signal We use the cross-correlation,h(t)=ZCR(-t)*ZCr(t), with a period of 1024 sample points(21.3ms)as shown in of the received baseband signal ZCR(t),with the transmit- Figure 5(a). ted ZC sequence ZCr(t)as the estimation of the impulse The second step of the modulation process is to up-convert response.Due to the ideal periodic auto-correlation property the signal into the passband.In the up-convert step,the inter- of ZC code,where the auto-correlation of ZC sequence is polated ZC sequence is multiplied with a carrier frequency non-zero only at the point with a delay r of zero,the estima- of fe as shown in Figure 4.The transmitted passband sig- tion h(t)provides a good approximation for the IR function. nal is T(t)=cos(2πfet)ZC(t)-sin(2πfet)ZC号(t),where In our system,h(t)is sampled with an interval of Ts ZCI(t)and ZC(t)are the real part and imaginary part of 1/fs =0.021 ms,which corresponds to 0.7 cm (343 m/s x time domain ZC sequence,respectively.We set fe as 20.25 0.021 ms)of the propagation distance.The sampled version kHz so that the transmitted signal occupies the bandwidth of IR estimation,h[n],has 1024 taps with n=0 ~1023. from 17.297 kHz to 23.25 kHz.This is because frequencies Therefore,the maximum unambiguous range of our system is higher than 17 kHz are inaudible to most people [20]. 1024x0.7/2=358 cm,which is enough to avoid interferences The signal is transmitted through the speaker on the mo- from nearby objects.Using the cross-correlation,we obtain bile device and recorded by the microphones using the same one frame of IR estimation h[n]for each period of 1,024 sampling frequency of 48 kHz.After receiving the sound sig- sound samples(21.33 ms),as shown in Figure 3.Each peak nal,VSkin first demodulates the signal by down-converting in the IR estimation indicates one propagation path at the the passband signal back into the complex valued baseband corresponding delay,i.e.,a path with a delay of ri will lead signal. to a peak at the ni=ri/Ts sample point. SOUND PATH SEPARATION AND 5.2 Sound Propagation Model In our system,there are three different kinds of propa- MEASUREMENT gation paths:the structure path,the LOS air path and the 5.1 Multipath Propagation Model reflection air path,see Figure 2. The received baseband signal is a superposition of mul- Theoretically,we can estimate the delay and amplitude tiple copies of the transmitted signals with different delays of different paths based on the speed and attenuation of
0 128 256 384 512 640 768 896 1024 Samples -0.5 0 0.5 I\Q (normalized) I Q (a) Baseband signal in the time domain -512 -256 0 256 512 Samples 0 0.5 1 Absolute value (0,1.0) (-11,0.211) (b) Autocorrelation of baseband signal Figure 5: Baseband signal of the ZC sequence zeros after the positive frequency components and before the negative frequency components, where B is targeting signal bandwidth (e.g., 6 kHz) and fs is the sampling rate of the sound (e.g., 48 kHz). In this way, the interpolated ZC sequence only occupies a small bandwidth of B in the frequency domain. Finally, we use IFFT to convert the interpolated signal back into the time domain. In VSkin, we choose a ZC sequence length of 127 points with a parameter of u = 63. We pad the 127-point ZC sequence into 1024 points. Therefore, we have B = 5.953 kHz at the sampling rate of fs = 48 kHz. The interpolated ZC sequence is a periodical complex valued signal with a period of 1024 sample points (21.3ms) as shown in Figure 5(a). The second step of the modulation process is to up-convert the signal into the passband. In the up-convert step, the interpolated ZC sequence is multiplied with a carrier frequency of fc as shown in Figure 4. The transmitted passband signal is T (t) = cos(2π fc t)ZCI T (t) − sin(2π fc t)ZCQ T (t), where ZCI T (t) and ZCQ T (t) are the real part and imaginary part of time domain ZC sequence, respectively. We set fc as 20.25 kHz so that the transmitted signal occupies the bandwidth from 17.297 kHz to 23.25 kHz. This is because frequencies higher than 17 kHz are inaudible to most people [20]. The signal is transmitted through the speaker on the mobile device and recorded by the microphones using the same sampling frequency of 48 kHz. After receiving the sound signal, VSkin first demodulates the signal by down-converting the passband signal back into the complex valued baseband signal. 5 SOUND PATH SEPARATION AND MEASUREMENT 5.1 Multipath Propagation Model The received baseband signal is a superposition of multiple copies of the transmitted signals with different delays Path Speed Distance Delay Amplitude 1 Structure (Mic 1) >2,000 m/s 4.5 cm ≪0.13 ms Large 2 Structure (Mic 2) >2,000 m/s 12 cm ≪0.13 ms Medium 3 LOS (Mic 1) 343 m/s 4.5 cm 0.13 ms Large 4 LOS (Mic 2) 343 m/s 12 cm 0.34 ms Medium 5 Reflection (Mic 1) 343 m/s >4.5 cm >0.13 ms Small 6 Reflection (Mic 2) 343 m/s >12 cm >0.34 ms Small Table 2: Different propagation paths due to multipath propagation. Suppose that the transmitted baseband signal is ZCT (t) and the system is a Linear TimeInvariant (LTI) system, then the received baseband signal can be represented as: ZCR (t) = X L i=1 Aie −jϕiZCT (t − τi ) = h(t) ∗ ZCT (t), (2) where L is the number of propagation paths, τi is the delay of the i th propagation path and Aie −jϕi represents the complex path coefficient (i.e., amplitude and phase) of the i th propagation path, respectively. The received signal can be viewed as a circular convolution, h(t) ∗ ZCT (t), of the Impulse Response h(t) and the periodical transmitted signal ZCT (t). The Impulse Response (IR) function of the multipath propagation model is given by h(t) = X L i=1 Aie −jϕi δ (t − τi ), (3) where δ (t) is Dirac’s delta function. We use the cross-correlation, ˆh(t) = ZC∗ R (−t) ∗ ZCT (t), of the received baseband signal ZCR (t), with the transmitted ZC sequence ZCT (t) as the estimation of the impulse response. Due to the ideal periodic auto-correlation property of ZC code, where the auto-correlation of ZC sequence is non-zero only at the point with a delay τ of zero, the estimation ˆh(t) provides a good approximation for the IR function. In our system, ˆh(t) is sampled with an interval of Ts = 1/fs = 0.021 ms, which corresponds to 0.7 cm (343 m/s × 0.021 ms) of the propagation distance. The sampled version of IR estimation, ˆh[n], has 1024 taps with n = 0 ∼ 1023. Therefore, the maximum unambiguous range of our system is 1024×0.7/2 = 358 cm, which is enough to avoid interferences from nearby objects. Using the cross-correlation, we obtain one frame of IR estimation ˆh[n] for each period of 1,024 sound samples (21.33 ms), as shown in Figure 3. Each peak in the IR estimation indicates one propagation path at the corresponding delay, i.e., a path with a delay of τi will lead to a peak at the ni = τi /Ts sample point. 5.2 Sound Propagation Model In our system, there are three different kinds of propagation paths: the structure path, the LOS air path and the reflection air path, see Figure 2. Theoretically, we can estimate the delay and amplitude of different paths based on the speed and attenuation of
sound in different materials and the propagation distance. reflection air path(Path 5 and Path 6),respectively.We call Table 2 lists the theoretical propagation delays and ampli- this process as path delay calibration,which is performed tude for the six different paths between the speaker and the once when the system starts transmitting and recording the two microphones on the example shown in Figure 2.Given sound signal.The path delay calibration is based on the first the high speed of sound for the structure-borne sound,the ten data segments(213 ms)of IR estimation.We use an 1- two structure sound paths(Path 1 and Path 2)have similar nearest neighbor algorithm to confirm the path delays based delays even if their path lengths are slightly different.Since on the results of the ten segments. the acoustic attenuation coefficient of metal is close to air Note that the calibration time is 14.95 ms for one segment [26],the amplitude of structure sound path is close to the (21.3 ms).Thus,we can perform calibration for each seg- amplitude of the LOS air path.The LOS air paths(Path 3 and ment in real-time.To save the computational cost,we only Path 4)have longer delays than the structure paths due to calibrate the LOS path and structure-borne path delays for the slower speed of sound in the air.The reflection air paths the first ten segments(213 ms).The path delay calibration is (Path 5 and Path 6)arrive after the LOS air paths due to the only performed once after the system initialization because longer path length.The amplitudes of reflection air paths are holding styles hardly change delays of the structure-borne smaller than other two types of paths due to the attenuation path and the LOS path.For the reflection path delay,we along the reflection and propagation process. adaptively estimate it as shown in Section 6.2 so that our system will be robust to different holding styles. 5.3 Sound Propagation Separation Typical impulse response estimations of the two micro- 5.4 Path Coefficient Measurement phones are shown in Figure 3.Although the theoretical delay After finding the delay of each propagation path,we mea- difference between Path 1 and Path 3 is 0.13 ms(6 samples). sure the path coefficient of each path.For a path i with the time resolution of the interpolated ZC sequence is not a delay of ni samples in the IR estimation,the path coef- enough to separate Path 1 and Path 3 on Mic 1.Thus.the ficient is the complex value of h[ni]on the correspond- first peak in the IR estimation of the Mic 1 represents the ing microphone.The path coefficient indicates how the combination of Path 1 and Path 3.Due to the longer distance amplitude and phase of the given path change with time. from the speaker to Mic 2,the theoretical delay difference Both the amplitude and the phase of the path coefficient between Path 2 and Path 4 is 0.34 ms(17 samples).As a result, are important for later movement measurement and touch the Mic 2 has two peaks with similar amplitude,which cor- detection algorithms. respond to the structure path (the first peak)and the LOS air One key challenge in path coefficient measurement is that path(the second peak),respectively.By locating the peaks cross-correlations are measured at low sampling rates.The of the IR estimation of the two microphones,we are able to basic cross-correlation algorithm presented in Section 5.1 separate different propagation paths. produces one IR estimation per frame of 1,024 samples.This We use the IR estimation of both microphones to identify converts to a sampling rate of 48,000/1,024 =46.875 Hz.The different propagation paths.On commercial mobile devices, low sampling rate may lead to ambiguity in fast movements the starting point of the auto-correlation function is random where the path coefficient changes quickly.Figure 6 shows due to the randomness in the hardware/system delay of the path coefficient of a finger movement with a speed of 10 sound playback and recording.The peaks corresponding to cm/s.We observe that there are only 2~3 samples in each the structure propagation may appear at random positions phase cycle of 2.As a phase difference of can be caused every time when the system restarts.Therefore,we need either by a phase increases ofπor a phase decreased byπ, to first locate the structure paths in the IR estimations.Our the direction of phase changing cannot be determined by key observation is that the two microphones are strictly such low rate measurements. synchronized so that their structure paths should appear We use the property of the circular cross-correlation to at the same position in the IR estimations.Based on this upsample the path coefficient measurements.For a given observation,we first locate the highest peak of Mic 1,which delay of n samples,the IR estimation at time t is given by corresponds to the combination of both Path 1 and Path 3. the circular cross-correlation of the received signal and the Then,we can locate the peaks of Path 2 and Path 4 in the IR transmitted sequence: estimation of Mic 2 as the position of Path 2 should be aligned with Path 1/Path 3.Since we focus on the movement around the mobile devices,the reflection air path is 5~15 samples hr[n] ZCR[t +I]x ZC[(l-n)mod Nzc](4) (3.5 10.7 cm)away from LOS path for both microphones. In this way,we get the delays of(i)combination of Path This is equivalent to take the summation of Nzc point of 1 and Path 3,(ii)Path 2,(iii)Path 4,and(iv)the range of the received signal multiplied by a conjugated ZC sequence
sound in different materials and the propagation distance. Table 2 lists the theoretical propagation delays and amplitude for the six different paths between the speaker and the two microphones on the example shown in Figure 2. Given the high speed of sound for the structure-borne sound, the two structure sound paths (Path 1 and Path 2) have similar delays even if their path lengths are slightly different. Since the acoustic attenuation coefficient of metal is close to air [26], the amplitude of structure sound path is close to the amplitude of the LOS air path. The LOS air paths (Path 3 and Path 4) have longer delays than the structure paths due to the slower speed of sound in the air. The reflection air paths (Path 5 and Path 6) arrive after the LOS air paths due to the longer path length. The amplitudes of reflection air paths are smaller than other two types of paths due to the attenuation along the reflection and propagation process. 5.3 Sound Propagation Separation Typical impulse response estimations of the two microphones are shown in Figure 3. Although the theoretical delay difference between Path 1 and Path 3 is 0.13 ms (6 samples), the time resolution of the interpolated ZC sequence is not enough to separate Path 1 and Path 3 on Mic 1. Thus, the first peak in the IR estimation of the Mic 1 represents the combination of Path 1 and Path 3. Due to the longer distance from the speaker to Mic 2, the theoretical delay difference between Path 2 and Path 4 is 0.34ms (17 samples). As a result, the Mic 2 has two peaks with similar amplitude, which correspond to the structure path (the first peak) and the LOS air path (the second peak), respectively. By locating the peaks of the IR estimation of the two microphones, we are able to separate different propagation paths. We use the IR estimation of both microphones to identify different propagation paths. On commercial mobile devices, the starting point of the auto-correlation function is random due to the randomness in the hardware/system delay of sound playback and recording. The peaks corresponding to the structure propagation may appear at random positions every time when the system restarts. Therefore, we need to first locate the structure paths in the IR estimations. Our key observation is that the two microphones are strictly synchronized so that their structure paths should appear at the same position in the IR estimations. Based on this observation, we first locate the highest peak of Mic 1, which corresponds to the combination of both Path 1 and Path 3. Then, we can locate the peaks of Path 2 and Path 4 in the IR estimation of Mic 2 as the position of Path 2 should be aligned with Path 1/Path 3. Since we focus on the movement around the mobile devices, the reflection air path is 5 ∼ 15 samples (3.5 ∼ 10.7 cm) away from LOS path for both microphones. In this way, we get the delays of (i) combination of Path 1 and Path 3, (ii) Path 2, (iii) Path 4, and (iv) the range of reflection air path (Path 5 and Path 6), respectively. We call this process as path delay calibration, which is performed once when the system starts transmitting and recording the sound signal. The path delay calibration is based on the first ten data segments (213 ms) of IR estimation. We use an 1- nearest neighbor algorithm to confirm the path delays based on the results of the ten segments. Note that the calibration time is 14.95 ms for one segment (21.3 ms). Thus, we can perform calibration for each segment in real-time. To save the computational cost, we only calibrate the LOS path and structure-borne path delays for the first ten segments (213 ms). The path delay calibration is only performed once after the system initialization because holding styles hardly change delays of the structure-borne path and the LOS path. For the reflection path delay, we adaptively estimate it as shown in Section 6.2 so that our system will be robust to different holding styles. 5.4 Path Coefficient Measurement After finding the delay of each propagation path, we measure the path coefficient of each path. For a path i with a delay of ni samples in the IR estimation, the path coefficient is the complex value of ˆh[ni] on the corresponding microphone. The path coefficient indicates how the amplitude and phase of the given path change with time. Both the amplitude and the phase of the path coefficient are important for later movement measurement and touch detection algorithms. One key challenge in path coefficient measurement is that cross-correlations are measured at low sampling rates. The basic cross-correlation algorithm presented in Section 5.1 produces one IR estimation per frame of 1,024 samples. This converts to a sampling rate of 48, 000/1, 024 = 46.875 Hz. The low sampling rate may lead to ambiguity in fast movements where the path coefficient changes quickly. Figure 6 shows the path coefficient of a finger movement with a speed of 10 cm/s. We observe that there are only 2∼3 samples in each phase cycle of 2π. As a phase difference of π can be caused either by a phase increases of π or a phase decreased by π, the direction of phase changing cannot be determined by such low rate measurements. We use the property of the circular cross-correlation to upsample the path coefficient measurements. For a given delay of n samples, the IR estimation at time t is given by the circular cross-correlation of the received signal and the transmitted sequence: ˆht[n] = N XZC −1 l=0 ZCR[t + l] × ZC∗ T [(l − n) mod NZC] (4) This is equivalent to take the summation of NZC point of the received signal multiplied by a conjugated ZC sequence
100 of the interpolated ZC sequence.Finally,we get the path coefficient at 48 kHz sampling rate.After the optimization, measuring the path coefficient at a given delay only incurs one multiplication and two additions for each sample. 6 MOVEMENT MEASUREMENT 6.1 Finger Movement Model Without sampling rate increasin Finger movements incur both magnitude and phase 200 100 0 00 changes in path coefficients.Firstly,the delay for the peak I(normalized) Figure 6:Path coefficient at different sampling rate corresponding to the reflection path of the finger changes when the finger moves.Figure 8(a)shows the magnitude Received x(t) Low-pass baseband signal filter →hn of the IR estimations when the finger is first moving away from the microphone and then moves back.The movement Z-m:a fixed distance is 10 cm on the surface of the mobile device.A"hot" yclic shift of n region indicates a peak at the corresponding distance in the Transmited IR estimation.While we can observe there are several peaks baseband signal x(t-1023)】 is the raw IR estimation and they change with the move- Figure 7:Path coefficient measurement for delay n ment,it is hard to discern the reflection path as it is much cyclically shifted by n points.The key observation is that weaker than the LOS path or the structure path.To amplify ZC sequence has constant power,ie.,ZC[n]x ZC'[n] the changes,we take the difference of the IR estimation along 1,Vn.Thus,each point in the Nzc multiplication results the time axis to remove these static paths.Figure 8(b)shows in Eq.(4)contributes equally to the estimation of h,[n].In the resulting differential IR estimations.We observe that the consequence,the summation over a window with a size of finger moves away from the microphone during 0.7 to 1.3 Nzc can start from any value of t.Instead of advancing the seconds and moves towards to the microphone from 3 to value t by a full frame of 1,024 sample points as in ordinary 3.5 seconds.The path length changes about 20 cm(10 x 2) cross-correlation operations,we can advance t one sample during the movement.In theory,we can track the position of each time.In this way,we can obtain the path coefficient the peak corresponding to the reflection path and measure with a sampling rate of 48 kHz,which gives the details of the finger movement.However,the position of the peak is changes in path coefficient as shown in Figure 6. measured in terms of number of samples,which gives a low The above upsampling scheme incurs high computational resolution of around 0.7 cm per sample.Furthermore,esti- cost.To obtain all path coefficients h[n]for delay n(n=0~ mation of the peak position is susceptible to noises,which 1023),it requires 48,000 dot productions per second and each lead to large errors in distance measurements. dot product is performed with two vectors of 1,024 samples. We utilize phase changes in the path coefficient to measure This cannot be easily carried out by mobile devices.To reduce movement distance so that we can achieve mm-level distance the computational cost,we observe that not all taps in h[n] accuracy.Consider the case the reflection path of the finger is path i and its path coefficient is: are useful.We are only interested in the taps corresponding to the structure propagation paths and the reflection air paths h [ni]=Are-(2) (5) within a distance of 15 cm.Therefore,instead of calculating the cross-correlation,we just calculate the path coefficients where di(t)is the path length at time t.The phase for path at given delays using a fixed cyclic shift of n.Figure 7 shows iis,which changes by 2 when di(t) the process of measuring the path coefficient at a given delay. changes by the amount of sound wavelength Ac=c/f First,we synchronize the transmitted signal and received (1.69 cm)[28].Therefore,we can measure the phase change signal by cyclically shifting the transmitted signal with a of the reflection path to obtain mm-level accuracy in the path fixed offset of ni corresponding to the delay of the given path. length di(t) Second,we multiply each sample of the received baseband signal with the conjugation of the shifted transmitted sample. 6.2 Reflection Path Delay Estimation Third,we use a moving average with a window size of 1,024 The first step for measuring the finger movement is to to sum the complex values and get the path coefficients. estimate the delay of the reflection path.Due to the non- Note that the moving average can be carried out by just negligible main lobe width of the auto-correlation function, two additions per sample.Fourth,we use low-pass filter multiple IR estimations that are close to the reflection path to remove high frequency noises caused by imperfection have similar changes when the finger moves.We need to
-200 -100 0 100 I (normalized) -200 -100 0 100 Q (normalized) With sampling rate increasing Without sampling rate increasing Figure 6: Path coefficient at different sampling rate Received baseband signal Transmitted baseband signal : a fixed cyclic shift of … Low-pass filter hˆt[n] x(t) x(t − 1023) Z−n n Z−1 Z−1 Z−1 Figure 7: Path coefficient measurement for delay n cyclically shifted by n points. The key observation is that ZC sequence has constant power, i.e., ZC[n] × ZC∗ [n] = 1,∀n. Thus, each point in the NZC multiplication results in Eq. (4) contributes equally to the estimation of ˆht[n]. In consequence, the summation over a window with a size of NZC can start from any value of t. Instead of advancing the value t by a full frame of 1,024 sample points as in ordinary cross-correlation operations, we can advance t one sample each time. In this way, we can obtain the path coefficient with a sampling rate of 48 kHz, which gives the details of changes in path coefficient as shown in Figure 6. The above upsampling scheme incurs high computational cost. To obtain all path coefficients ˆht[n] for delay n (n = 0 ∼ 1023), it requires 48, 000 dot productions per second and each dot product is performed with two vectors of 1,024 samples. This cannot be easily carried out by mobile devices. To reduce the computational cost, we observe that not all taps in ˆht[n] are useful. We are only interested in the taps corresponding to the structure propagation paths and the reflection air paths within a distance of 15 cm. Therefore, instead of calculating the cross-correlation, we just calculate the path coefficients at given delays using a fixed cyclic shift of n. Figure 7 shows the process of measuring the path coefficient at a given delay. First, we synchronize the transmitted signal and received signal by cyclically shifting the transmitted signal with a fixed offset of ni corresponding to the delay of the given path. Second, we multiply each sample of the received baseband signal with the conjugation of the shifted transmitted sample. Third, we use a moving average with a window size of 1, 024 to sum the complex values and get the path coefficients. Note that the moving average can be carried out by just two additions per sample. Fourth, we use low-pass filter to remove high frequency noises caused by imperfection of the interpolated ZC sequence. Finally, we get the path coefficient at 48 kHz sampling rate. After the optimization, measuring the path coefficient at a given delay only incurs one multiplication and two additions for each sample. 6 MOVEMENT MEASUREMENT 6.1 Finger Movement Model Finger movements incur both magnitude and phase changes in path coefficients. Firstly, the delay for the peak corresponding to the reflection path of the finger changes when the finger moves. Figure 8(a) shows the magnitude of the IR estimations when the finger is first moving away from the microphone and then moves back. The movement distance is 10 cm on the surface of the mobile device. A “hot” region indicates a peak at the corresponding distance in the IR estimation. While we can observe there are several peaks is the raw IR estimation and they change with the movement, it is hard to discern the reflection path as it is much weaker than the LOS path or the structure path. To amplify the changes, we take the difference of the IR estimation along the time axis to remove these static paths. Figure 8(b) shows the resulting differential IR estimations. We observe that the finger moves away from the microphone during 0.7 to 1.3 seconds and moves towards to the microphone from 3 to 3.5 seconds. The path length changes about 20 cm (10 × 2) during the movement. In theory, we can track the position of the peak corresponding to the reflection path and measure the finger movement. However, the position of the peak is measured in terms of number of samples, which gives a low resolution of around 0.7 cm per sample. Furthermore, estimation of the peak position is susceptible to noises, which lead to large errors in distance measurements. We utilize phase changes in the path coefficient to measure movement distance so that we can achieve mm-level distance accuracy. Consider the case the reflection path of the finger is path i and its path coefficient is: ˆht[ni] = Aie −j(ϕi+2π di (t ) λc ) , (5) where di (t) is the path length at time t. The phase for path i is ϕi (t) = ϕi + 2π di (t) λc , which changes by 2π when di (t) changes by the amount of sound wavelength λc = c/fc (≈1.69 cm) [28]. Therefore, we can measure the phase change of the reflection path to obtain mm-level accuracy in the path length di (t). 6.2 Reflection Path Delay Estimation The first step for measuring the finger movement is to estimate the delay of the reflection path. Due to the nonnegligible main lobe width of the auto-correlation function, multiple IR estimations that are close to the reflection path have similar changes when the finger moves. We need to
20 -380 6-400 -WEKE --W/O EKF Time(seconds) 420 0 (a)Magnitude of the raw IR estimations 440 460 00 -380 380 -340 -320 -300 60 I(normalized) Figure 9:Path coefficients for finger reflection path. Time (seconds) (b)Magnitude of differential IR estimations with a constant attenuation of Ai in a short period.There- Figure 8:IR estimations for finger movement. fore,the trace of path coefficients should be a circle in the complex plane.However,due to additive noises,the trace in adaptively select one of these IR estimation to represent the reflection path so that noises introduced by side lobes of Figure 9 is not smooth enough for later phase measurements We propose to use the Extended Kalman Filter(EKF),a other paths can be reduced. Our heuristic to determine the delay of the non-linear filter,to track the path coefficient and reduce reflection path is based on the observation that the additive noise.The goal is to make the resulting path the reflection path will have the largest change of coefficient closer to the theoretical model so that the phase magnitude compared to other paths.Consider the change incurred by the movement can be measured with changes of magnitude in ht[ni]:h[ni]-ht-At[ni]= higher accuracy.We use the sinusoid model to predict and update the signal of both I/O components [8.To save the Ai le)e Here we assume computational resources,we first detect whether finger is that Ai does not change during the short period of At.When moving or not as shown in Section 6.2.When we find that the the delay ni is exactly the same as of the reflection path, finger is moving,we initialize the parameters of the EKF and the magnitude of h [ni]-h-Ar[ni]is maximized.This is perform EKF.We also downsample the path coefficient to because the magnitude of Ail is maximized at the peak 3 kHz to make the EKF affordable for mobile devices.Figure corresponds to the auto-correlation of the reflection path 9 shows that results after EKF are much smoother than the and the magnitude ofe)e) original signal. is maximized due to the largest path length change at the 6.4 Phase Based Movement Measurement reflection path delay. We use a curvature-based estimation scheme to measure In our implementation,we select l path coefficients with an the phase change of the path coefficient.Our estimation interval of three samples between each other as the candidate scheme assumes that the path coefficient is a superposition of reflection paths.The distance between these candidate of a circularly changing dynamical component,which is reflection paths and the structure path is determined by size caused by the moving finger,and a quasi-static component, of the phone,e.g..5~15 samples for the bottom Mic.We keep which is caused by nearby static objects [28,29,34].The monitoring the candidate path coefficients and select the algorithm estimates the phase of the dynamic component by path with the maximum magnitude in the time differential measuring the curvature of the trace on the complex plane IR estimations as the reflection path.When the finger is The curvature-based scheme avoids the error-prone process static,our system still keeps track of the reflection path.In of estimating the quasi-static component in LEVD [28]and this way,we can use the changes in the selected reflection is robust to noise interferences. path to detect whether the finger moves or not. Suppose that we use a trace in the two-dimensional plane y(t)=(I,i)to represent the path coefficient of the 6.3 Additive Noise Mitigation reflection.As shown in Figure 9,the instantaneous signed Although the adaptive reflection path selection scheme curvature can be estimated as: gives high SNR measurements on path coefficients,the addi- det(y'(t).y"(t)) tive noises from other paths still interfere with the measured k(t)= (6) path coefficients.Figure 9 shows the result of the trace of the y(t)3 complex path coefficient with a finger movement.In the ideal where y'(t)=dy(t)/dt is the first derivative of y(t)with case,the path coefficients is h[ni]=Aie-i(+2d(t)/A) respect to the parameter t,and det is taking the determinant
01234 Time (seconds) 20 40 60 80 Path length (cm) (a) Magnitude of the raw IR estimations 01234 Time (seconds) 20 40 60 80 Path length (cm) (b) Magnitude of differential IR estimations Figure 8: IR estimations for finger movement. adaptively select one of these IR estimation to represent the reflection path so that noises introduced by side lobes of other paths can be reduced. Our heuristic to determine the delay of the reflection path is based on the observation that the reflection path will have the largest change of magnitude compared to other paths. Consider the changes of magnitude in ˆht[ni]: ˆht[ni] − ˆht−∆t[ni] = Ai e −j(ϕi+2π di (t ) λc ) − e −j(ϕi+2π di (t−∆t ) λc ) . Here we assume that Ai does not change during the short period of ∆t. When the delay ni is exactly the same as of the reflection path, the magnitude of ˆht[ni] − ˆht−∆t[ni] is maximized. This is because the magnitude of |Ai | is maximized at the peak corresponds to the auto-correlation of the reflection path, and the magnitude of e −j(ϕi+2π di (t ) λc ) − e −j(ϕi+2π di (t−∆t ) λc ) is maximized due to the largest path length change at the reflection path delay. In our implementation, we selectl path coefficients with an interval of three samples between each other as the candidate of reflection paths. The distance between these candidate reflection paths and the structure path is determined by size of the phone, e.g., 5 ∼ 15 samples for the bottom Mic. We keep monitoring the candidate path coefficients and select the path with the maximum magnitude in the time differential IR estimations as the reflection path. When the finger is static, our system still keeps track of the reflection path . In this way, we can use the changes in the selected reflection path to detect whether the finger moves or not. 6.3 Additive Noise Mitigation Although the adaptive reflection path selection scheme gives high SNR measurements on path coefficients, the additive noises from other paths still interfere with the measured path coefficients. Figure 9 shows the result of the trace of the complex path coefficient with a finger movement. In the ideal case, the path coefficients is ˆht[ni] = Aie −j(ϕi+2πdi (t)/λc ) -400 -380 -360 -340 -320 -300 I (normalized) -460 -440 -420 -400 -380 Q (normalized) P O W EKF W/O EKF Figure 9: Path coefficients for finger reflection path. with a constant attenuation of Ai in a short period. Therefore, the trace of path coefficients should be a circle in the complex plane. However, due to additive noises, the trace in Figure 9 is not smooth enough for later phase measurements. We propose to use the Extended Kalman Filter (EKF), a non-linear filter, to track the path coefficient and reduce the additive noise. The goal is to make the resulting path coefficient closer to the theoretical model so that the phase change incurred by the movement can be measured with higher accuracy. We use the sinusoid model to predict and update the signal of both I/Q components [8]. To save the computational resources, we first detect whether finger is moving or not as shown in Section 6.2. When we find that the finger is moving, we initialize the parameters of the EKF and perform EKF. We also downsample the path coefficient to 3 kHz to make the EKF affordable for mobile devices. Figure 9 shows that results after EKF are much smoother than the original signal. 6.4 Phase Based Movement Measurement We use a curvature-based estimation scheme to measure the phase change of the path coefficient. Our estimation scheme assumes that the path coefficient is a superposition of a circularly changing dynamical component, which is caused by the moving finger, and a quasi-static component, which is caused by nearby static objects [28, 29, 34]. The algorithm estimates the phase of the dynamic component by measuring the curvature of the trace on the complex plane. The curvature-based scheme avoids the error-prone process of estimating the quasi-static component in LEVD [28] and is robust to noise interferences. Suppose that we use a trace in the two-dimensional plane y(t) = (I hˆ t ,Qhˆ t ) to represent the path coefficient of the reflection. As shown in Figure 9, the instantaneous signed curvature can be estimated as: k(t) = det(y ′ (t),y ′′(t)) y ′ (t) 3 , (6) where y ′ (t) = dy(t)/dt is the first derivative of y(t) with respect to the parameter t, and det is taking the determinant
-12 movement in the air will change the air-borne propagation of the sound.Meanwhile,when the finger contacts the surface of the phone,the force applied on the surface will change the vibration pattern of the structure of the phone.which leads Time(seconds) to changes in the structure-borne signal [25].In other words, (a)Magnitude of differential IR estimations when touch and the structure-borne sound is able to distinguish whether the release at 7 cm away from speaker finger is hovering above the surface with a mm-level gap or is pressing on the surface.In VSkin,we mainly use the changes in the structure-borne signal to sense the finger touching, as they provide distinctive information about whether the finger touches the surface or not.However,when force is applied at different locations on the surface,the changes Time(seconds) of the structure-borne sound caused by touching will be (b)Magnitude of differential IR estimations when touch and different in magnitude and phase.Existing schemes only release at 1 cm away from speaker use the magnitude of the structure-borne sound [25],which Figure 10:Touching on different locations has different change rates at different touch positions.They of the given matrix.We assume that the instantaneous cur- rely on the touchscreen to determine the position and the vature remains constant during the time period t-1 ~t and accurate time of the touching to measure the force-level of the phase change of the dynamic component is: touching [25].However,neither the location nor the time △8-1=2 arcsin ly(t)-y(t-1)川 (7) of the touching is available for VSkin.Therefore,the key 2k(t) challenge in touching sensing for VSkin is to perform joint The path length change for the time period 0~t is: touch detection and touch localization. d.(t)-d()= (8) Touching events lead to unique patterns in the differential 2π IR estimation.As an example,Figure 10 shows the differen- where di(t)is the path length from the speaker reflected tial IR estimations that are close to the structure-borne path through the finger to the microphone. of the top microphone in Figure 2,when the user touch the back of the phone.The y axis is the number of samples to the 6.5 From Path Length to Movements structure-borne path,where the structure-borne path(Path The path length change for the reflection air path can 2 in Section 5.3)is at y =0.When force is applied on the be measured on both microphones.Depending on the type surface,the width of peak corresponding to the structure- of gestures and the placement of the microphones,we can borne path increases.This leads to a small deviation in the use the path length change to derive the actual movement peak position in the path coefficient changes from the orig- distance.For example,for the phone in Figure 2,we can use inal peak of the structure-borne propagation.Figure 10(a) the path length change of the reflection air path on the bottom shows the resulting differential IR estimations when user microphone to measure the finger movement distance for finger touches/leaves the surface of the mobile device at a the scrolling gesture(up/down movement).This is because position that is 7 cm away from the rear speaker.We observe the length of the reflection path on the bottom microphone that the "hottest"region is not at the original peak of the changes significantly when the finger moves up/down on structure-borne propagation.This is due to the force applied the back of the phone.The actual movement distance can on the surface changes the path of the structure-borne signal. be calculated by multiplying the path length change with a To further explore the change of the structure-borne propa- compensating factor as described in Section 8.For the gesture gation,we ask the user to perform finger tapping on eleven of swiping left/right,we can use path length changes of two different positions on the back of the device and measure microphones to determine the swiping direction,as swiping the position of peaks in the path coefficient changes.Figure left and right will introduce the same path length change 11 shows the relationship between the touching position pattern on the bottom microphone but different path length and the resulting peak position in coefficient changes,where change directions on the top microphone. the peak position is measured by the number of samples to the original structure-borne path.We observe that the 7 TOUCH MEASUREMENT larger the distance between the touching position and the 7.1 Touch Signal Pattern speaker,the larger the delay in coefficient changes to the Touching the surface with fingers will change both the original structure-borne path (darker color means a larger air-borne propagation and structure-borne propagation of delay).Thus,we utilize the magnitude and delay of differen- the sound.When performing the tapping action,the finger tial IR estimations to detect and localize touch events.Note
0123 Time (seconds) -12 0 12 24 Delay samples (a) Magnitude of differential IR estimations when touch and release at 7 cm away from speaker 0123 Time (seconds) -12 0 12 24 Delay samples (b) Magnitude of differential IR estimations when touch and release at 1 cm away from speaker Figure 10: Touching on different locations of the given matrix. We assume that the instantaneous curvature remains constant during the time period t − 1 ∼ t and the phase change of the dynamic component is: ∆θ t t−1 = 2 arcsin y(t) − y(t − 1) 2k(t) . (7) The path length change for the time period 0 ∼ t is: di (t) − di (0) = − Pt i=1 ∆θ i i−1 2π × λc , (8) where di (t) is the path length from the speaker reflected through the finger to the microphone. 6.5 From Path Length to Movements The path length change for the reflection air path can be measured on both microphones. Depending on the type of gestures and the placement of the microphones, we can use the path length change to derive the actual movement distance. For example, for the phone in Figure 2, we can use the path length change of the reflection air path on the bottom microphone to measure the finger movement distance for the scrolling gesture (up/down movement). This is because the length of the reflection path on the bottom microphone changes significantly when the finger moves up/down on the back of the phone. The actual movement distance can be calculated by multiplying the path length change with a compensating factor as described in Section 8. For the gesture of swiping left/right, we can use path length changes of two microphones to determine the swiping direction, as swiping left and right will introduce the same path length change pattern on the bottom microphone but different path length change directions on the top microphone. 7 TOUCH MEASUREMENT 7.1 Touch Signal Pattern Touching the surface with fingers will change both the air-borne propagation and structure-borne propagation of the sound. When performing the tapping action, the finger movement in the air will change the air-borne propagation of the sound. Meanwhile, when the finger contacts the surface of the phone, the force applied on the surface will change the vibration pattern of the structure of the phone, which leads to changes in the structure-borne signal [25]. In other words, the structure-borne sound is able to distinguish whether the finger is hovering above the surface with a mm-level gap or is pressing on the surface. In VSkin, we mainly use the changes in the structure-borne signal to sense the finger touching, as they provide distinctive information about whether the finger touches the surface or not. However, when force is applied at different locations on the surface, the changes of the structure-borne sound caused by touching will be different in magnitude and phase. Existing schemes only use the magnitude of the structure-borne sound [25], which has different change rates at different touch positions. They rely on the touchscreen to determine the position and the accurate time of the touching to measure the force-level of touching [25]. However, neither the location nor the time of the touching is available for VSkin. Therefore, the key challenge in touching sensing for VSkin is to perform joint touch detection and touch localization. Touching events lead to unique patterns in the differential IR estimation. As an example, Figure 10 shows the differential IR estimations that are close to the structure-borne path of the top microphone in Figure 2, when the user touch the back of the phone. The y axis is the number of samples to the structure-borne path, where the structure-borne path (Path 2 in Section 5.3) is at y = 0. When force is applied on the surface, the width of peak corresponding to the structureborne path increases. This leads to a small deviation in the peak position in the path coefficient changes from the original peak of the structure-borne propagation. Figure 10(a) shows the resulting differential IR estimations when user finger touches/leaves the surface of the mobile device at a position that is 7 cm away from the rear speaker. We observe that the “hottest” region is not at the original peak of the structure-borne propagation. This is due to the force applied on the surface changes the path of the structure-borne signal. To further explore the change of the structure-borne propagation, we ask the user to perform finger tapping on eleven different positions on the back of the device and measure the position of peaks in the path coefficient changes. Figure 11 shows the relationship between the touching position and the resulting peak position in coefficient changes, where the peak position is measured by the number of samples to the original structure-borne path. We observe that the larger the distance between the touching position and the speaker, the larger the delay in coefficient changes to the original structure-borne path (darker color means a larger delay). Thus, we utilize the magnitude and delay of differential IR estimations to detect and localize touch events. Note