Top Mic (Mic 2) two consecutive frames.LLAP [28]uses Continuous Wave (CW)signal to track the moving target based on the phase information,which is susceptible to the dynamic multipath caused by other moving objects.Strata [34]combines the frame-based approach and the phase-based approach.Using the 26-bit GSM training sequence that has nice autocorrela- tion properties,Strata can track phase changes at different ucture path time delays so that objects that are more than 8.5 cm apart LOS air path - can be resolved.However,these schemes mainly focus on tracking in-air gestures that are performed at more than Bottom Mc (Mic 1) 20 cm away from the mobile device [14,23,28,34].In com- Figure 2:Sound propagation paths on a smartphone parison,our system uses both the structure-borne and the air-borne sound signals to sense gestures performed on the as the accelerometer,to sense coarse-grained gestures.How- surface of the mobile devices,which are very close (e.g.,less ever,sensor readings only provide limited information about than 12 cm)to both the speakers and the microphones.As the the gesture,and they cannot quantify the movement speed sound reflections at a short distance are often submerged by and distance.Furthermore.accelerometers are sensitive to the Line-of-Sight(LOS)signals,sensing gestures with SNR vibrations caused by hand movements while the user is hold- 2 dB at 5 cm is considerably harder than sensing in-air ing the device.Compared to camera-based and sensor-based gestures with SNR 12 dB at 30 cm. schemes,VSkin incurs no additional hardware costs and can perform fine-grained gesture measurements. Tapping and Force Sensing:Tapping and force applied 3 SYSTEM OVERVIEW to the surface can be sensed by different types of sensors VSkin uses both the structure-borne and the air-borne [4,7,9,10,12,13,15,19,25].TapSense [7]leverages the sound signals to capture gestures performed on the surface of tapping sound to recognize whether the user touches the the mobile device.We transmit and record inaudible sounds screen with a fingertip or a fist.ForceTap [9]measures the using the built-in speakers and microphones on commod- tapping force using the built-in accelerometer.VibWrite [13] ity mobile devices.As an example illustrated in Figure 2, and VibSense [12]use the vibration signal instead of the sound signals transmitted by the rear speaker travel through sound signal to sense the tapping position so that the inter- multiple paths on the back of the phone to the top and bot- ference in air-borne propagation can be avoided.However, tom microphones.On both microphones,the structure-borne they require pre-trained vibration profiles for tapping local- sound that travels through the body structure of the smart- ization.ForcePhone [25]uses linear chirp sounds to sense phone arrives first.This is because sound wave propagates force and touch based on changes in the magnitude of the much faster in the solid (>2,000m/s)than in the air (around structure-borne signal.However,fine-grained phase infor- 343m/s)[24].There might be multiple copies of air-borne mation cannot be measured through chirps and chirps only sounds arrive within a short interval following the structure- capture the magnitude of the structure-borne signal at a low borne sound.The air-borne sounds include the LOS sound sampling rate.In comparison,our system measures both the and the reflection sounds of surrounding objects,e.g.,the phase and the magnitude of multiple sound paths with a finger or the table.All these sound signals are mixed at the high sampling rate of 3 kHz so that we can perform robust recording microphones. tap sensing without intensive training. VSkin performs gesture-sensing based on the mixture of Sound-based Gesture Sensing:Several sound-based sound signals recorded by the microphones.The design of gesture recognition systems have been proposed to recog- VSkin consists of the following four components: nize in-air gestures [1,3,6,16,17,21,23,33,37].Soundwave Transmission signal design:We choose to use the [6],Multiwave [17],and AudioGest [21]use Doppler effect Zadoff-Chu(ZC)sequence modulated by a sinusoid carrier as to recognize predefined gestures.However,Doppler effect our transmitted sound signal.This transmission signal design only gives coarse-grained movement speeds.Thus,these meets three key design goals.First,the auto-correlation of schemes only recognize a small set of gestures that have ZC sequence has a narrow peak width of 6 samples so that we distinctive speed characters.Recently,three state-of-the-art can separate sound paths arrive with a small time-difference schemes (i.e.,FingerIO [14],LLAP [28],and Strata [34])use by locating the peaks corresponding to their different de- ultrasound to track fine-grained finger gestures.FingerIO lays,see Figure 3.Second,we use interpolation schemes to [14]transmits OFDM modulated sound frames and locates reduce the bandwidth of the ZC sequence to less than 6 kHz the moving finger based on the change of the echo profiles of so that it can be fitted into the narrow inaudible range ofTop Mic (Mic 2) Bottom Mic (Mic 1) Path 2 Path 1 Path 3 Path 4 Path 6 Path 5 Rear Speaker Structure path LOS air path Reflection air path Back Surface of the Phone Figure 2: Sound propagation paths on a smartphone as the accelerometer, to sense coarse-grained gestures. However, sensor readings only provide limited information about the gesture, and they cannot quantify the movement speed and distance. Furthermore, accelerometers are sensitive to vibrations caused by hand movements while the user is holding the device. Compared to camera-based and sensor-based schemes, VSkin incurs no additional hardware costs and can perform fine-grained gesture measurements. Tapping and Force Sensing: Tapping and force applied to the surface can be sensed by different types of sensors [4, 7, 9, 10, 12, 13, 15, 19, 25]. TapSense [7] leverages the tapping sound to recognize whether the user touches the screen with a fingertip or a fist. ForceTap [9] measures the tapping force using the built-in accelerometer. VibWrite [13] and VibSense [12] use the vibration signal instead of the sound signal to sense the tapping position so that the interference in air-borne propagation can be avoided. However, they require pre-trained vibration profiles for tapping localization. ForcePhone [25] uses linear chirp sounds to sense force and touch based on changes in the magnitude of the structure-borne signal. However, fine-grained phase information cannot be measured through chirps and chirps only capture the magnitude of the structure-borne signal at a low sampling rate. In comparison, our system measures both the phase and the magnitude of multiple sound paths with a high sampling rate of 3 kHz so that we can perform robust tap sensing without intensive training. Sound-based Gesture Sensing: Several sound-based gesture recognition systems have been proposed to recognize in-air gestures [1, 3, 6, 16, 17, 21, 23, 33, 37]. Soundwave [6], Multiwave [17], and AudioGest [21] use Doppler effect to recognize predefined gestures. However, Doppler effect only gives coarse-grained movement speeds. Thus, these schemes only recognize a small set of gestures that have distinctive speed characters. Recently, three state-of-the-art schemes (i.e., FingerIO [14], LLAP [28], and Strata [34]) use ultrasound to track fine-grained finger gestures. FingerIO [14] transmits OFDM modulated sound frames and locates the moving finger based on the change of the echo profiles of two consecutive frames. LLAP [28] uses Continuous Wave (CW) signal to track the moving target based on the phase information, which is susceptible to the dynamic multipath caused by other moving objects. Strata [34] combines the frame-based approach and the phase-based approach. Using the 26-bit GSM training sequence that has nice autocorrelation properties, Strata can track phase changes at different time delays so that objects that are more than 8.5 cm apart can be resolved. However, these schemes mainly focus on tracking in-air gestures that are performed at more than 20 cm away from the mobile device [14, 23, 28, 34]. In comparison, our system uses both the structure-borne and the air-borne sound signals to sense gestures performed on the surface of the mobile devices, which are very close (e.g., less than 12 cm) to both the speakers and the microphones. As the sound reflections at a short distance are often submerged by the Line-of-Sight (LOS) signals, sensing gestures with SNR ≈ 2 dB at 5 cm is considerably harder than sensing in-air gestures with SNR ≈ 12 dB at 30 cm. 3 SYSTEM OVERVIEW VSkin uses both the structure-borne and the air-borne sound signals to capture gestures performed on the surface of the mobile device. We transmit and record inaudible sounds using the built-in speakers and microphones on commodity mobile devices. As an example illustrated in Figure 2, sound signals transmitted by the rear speaker travel through multiple paths on the back of the phone to the top and bottom microphones. On both microphones, the structure-borne sound that travels through the body structure of the smartphone arrives first. This is because sound wave propagates much faster in the solid (>2,000m/s) than in the air (around 343m/s) [24]. There might be multiple copies of air-borne sounds arrive within a short interval following the structureborne sound. The air-borne sounds include the LOS sound and the reflection sounds of surrounding objects, e.g., the finger or the table. All these sound signals are mixed at the recording microphones. VSkin performs gesture-sensing based on the mixture of sound signals recorded by the microphones. The design of VSkin consists of the following four components: Transmission signal design: We choose to use the Zadoff-Chu (ZC) sequence modulated by a sinusoid carrier as our transmitted sound signal. This transmission signal design meets three key design goals. First, the auto-correlation of ZC sequence has a narrow peak width of 6 samples so that we can separate sound paths arrive with a small time-difference by locating the peaks corresponding to their different delays, see Figure 3. Second, we use interpolation schemes to reduce the bandwidth of the ZC sequence to less than 6 kHz so that it can be fitted into the narrow inaudible range of