MobiSys'18.June 10-15,2018,Munich,Germany Ke Sun et al. for deep finger tappings.Suppose that the initial finger length is more than one finger will move at the same time,even if the user 1(0)at time 0 and the shortest finger length during the tapping is only intends to use a single finger to press the key.For example, I(t)at time t.Then,the bending angle of finger is given by: for most people,when they press their little finger,the ring finger 0(t)=arccos 1(t) will move together at the same time.Therefore,both the video 1(o) (6) and the ultrasound will detect multiple fingers moving at the same Gentle finger tapping:When the bending angle of finger is time.To determine the exact finger that is used for pressing,we use the depth information measured in Section 6.3.The finger used for small,the duration of the finger tap is short and the camera is not pressing the key always has larger bending angle than other moving able to capture enough video frames during the tapping.Further- fingers.Therefore,we calculate the bending angle for all moving more,due to the small bending angle of finger,the finger length change can hardly been detected by the video frame.The finger fingers in the view,using the geometric model as shown in Figure 10(a).The finger with the largest bending angle is determined as length changes less than 10 pixels in 352x288 resolution.Therefore, the pressing finger.Once we confirm which finger is the pressing we use the ultrasound phase change to estimate the bending angle for gentle finger tapping.The propagation path change during the finger,we use the same methods as in the single-finger case to locate the keystroke. finger tap from time 0~t can be measured by the phase change as: △d=d)-do)=-pa9-pa@ (7) 8 EXPERIMENTAL RESULTS 8.1 Implementation and Evaluation Setup where A is the wavelength of the ultrasound,od(0)and (t)are the We implemented our system on both the Android and MacOS initial and the final phase of the ultrasound,respectively.However platforms.On the Android platform,our implementation works as the propagation path change is different from the depth of finger an APP that allows the user to tap in the air in realtime on recent tap.As shown in Figure 10(b),the propagation path change during Android devices,e.g Samsung Galaxy S5 with Android 5.0 OS. the finger tap is For the video capturing,due to the limitations in hardware,the Ad=5网+o-5-网, maximum video frame rate is 30 fps.To save the computational (8) resource,we set the video resolution to 352 x 288 and the average where M(,0,0)is the location of the microphone,S(12,0,0)is video frame rate under this setting is 25 fps.We emit continuous the location of the speaker,Fo(x(0),y(0),z(0))is the location of wave signal of Acos 2ft,where A is the amplitude and f is the the fingertip at time 0,F((t),y(t),z(t))is the location of the fin- frequency of the sound,which is in the range of 17~22kHz.For gertip at time t,and d is the euclidean distance between Fo and audio capturing,we chose a data segment size of 512 samples in our Ft,respectively.According to the triangle inequality,d Ad/2 implementation,which has time duration of 10.7ms when the sam Meanwhile,the different locations of the finger have different pling rate is 48kHz.We implemented most signal processing and Fo(x(0).y(0),z(0)),which will result in different lengths of Ad computer vision processing algorithms as C/C++functions using given the same finger tapping depth of d.When users are tapping Android NDK to achieve better efficiency.We used the opening li- slight山y,we can assume that x(O)≈x(t)andz(O)≈z(t)during the brary OpenCV C++interfaces in computer vision processing when finger tap.As a result,the final position is Ft(x(0),y(0)-d,z(0)) implementing on the Android platform.On the MacOS platform, and we can get the relationship between d and Ad as: we implemented the system using the camera on the MacBook and streamed the audio signal using a smartphone.The MacOS-based △d=x(0)-2,y(0.z(0)j+1-x(0-y(0.-z0 implementation uses C++.On the laptop,both the video and virtual x(0)-2.y(0)-d,z0j-h-x(0.d-y(0.-z(0 keyboard are displayed in real time on the screen.The user operates (Ad>d>Ad/2) in front of the lap top screen.To obtain the ground truth in our user study,we also captured the user movement by a 120 fps camera.The In Eq.(6),we get x(0)and z(0)from the locations of fingertips high speed video was processed offline and manually annotated to in Section 4 and set the parameter y(0)adaptively based on the serve as the ground truth.Due to the speaker placement(near the finger size in the frame.As a result,we can get d from Ad by Eq. ear,instead of facing forward)and SDK limitations on commercial (9)by compensating the different location of Fo.Consequently,the AR devices,we are unable to implement our system on existing bending angle is given by: devices,such as the Hololens [22].Instead,we use a cardboard VR d/2 setup as shown in Figure 1(b)in our case study. 0(t)=2arccos 1(0) (10) We conducted experiments on Samsung Galaxy S5 smartphone, using its rear speaker,two microphones and the rear camera in both 7 KEYSTROKE LOCALIZATION office and home environments,as shown in Figure 11.Experiments The final step is to map the finger tapping to the virtual key were conducted by eight users,who are graduate students with that is pressed by the user.When the user only uses a single finger the age of 22 ~26 years.Five out of the eight users have prior to perform tapping gestures,we can determine the identity of the experiences on using VR/AR devices.The users interacted with virtual key with very low cost.The identity can be determined by the phone using their bare hands behind the rear camera without calculating the location of the moving fingertip during the"locating wearing any accessory.The performance evaluation process lasted state". 90 minutes with 18 sessions of five minutes.There is a five minutes Locating the keystroke when the user uses multiple fingers to break between two sessions.If not specified,the smartphone was perform tapping gesture is quite challenging.This is because that fixed on a selfie stick during the experiments.MobiSys’18, June 10–15, 2018, Munich, Germany Ke Sun et al. for deep finger tappings. Suppose that the initial finger length is l(0) at time 0 and the shortest finger length during the tapping is l(t) at time t. Then, the bending angle of finger is given by: θ (t) = arccos l(t) l(0) . (6) Gentle finger tapping: When the bending angle of finger is small, the duration of the finger tap is short and the camera is not able to capture enough video frames during the tapping. Furthermore, due to the small bending angle of finger, the finger length change can hardly been detected by the video frame. The finger length changes less than 10 pixels in 352×288 resolution. Therefore, we use the ultrasound phase change to estimate the bending angle for gentle finger tapping. The propagation path change during the finger tap from time 0 ∼ t can be measured by the phase change as: ∆d = d(t) − d(0) = − φd (t) − φd (0) 2π λ, (7) where λ is the wavelength of the ultrasound, φd (0) and φd (t) are the initial and the final phase of the ultrasound, respectively. However, the propagation path change is different from the depth of finger tap. As shown in Figure 10(b), the propagation path change during the finger tap is ∆d = −−→SF0 + −−−→F0M − −−→SFt − −−−→Ft M , (8) where M(l1, 0, 0) is the location of the microphone, S (l2, 0, 0) is the location of the speaker, F0 (x (0),y(0), z(0)) is the location of the fingertip at time 0, Ft (x (t),y(t), z(t)) is the location of the fingertip at time t, and d is the euclidean distance between F0 and Ft , respectively. According to the triangle inequality, d > ∆d/2. Meanwhile, the different locations of the finger have different F0 (x (0),y(0), z(0)), which will result in different lengths of ∆d given the same finger tapping depth of d. When users are tapping slightly, we can assume that x (0) ≈ x (t) and z(0) ≈ z(t) during the finger tap. As a result, the final position is Ft (x (0),y(0) − d, z(0)) and we can get the relationship between d and ∆d as: ∆d = −−−−−−−−−−−−−−−−−−−−→ (x (0) − l2, y(0), z (0)) + −−−−−−−−−−−−−−−−−−−−−−−→ (l1 − x (0), −y(0), −z (0)) − −−−−−−−−−−−−−−−−−−−−−−−−→ (x (0) − l2, y(0) − d, z (0)) − −−−−−−−−−−−−−−−−−−−−−−−−−−→ (l1 − x (0), d − y(0), −z (0)) (∆d > d > ∆d/2). (9) In Eq. (6), we get x (0) and z(0) from the locations of fingertips in Section 4 and set the parameter y(0) adaptively based on the finger size in the frame. As a result, we can get d from ∆d by Eq. (9) by compensating the different location of F0. Consequently, the bending angle is given by: θ (t) = 2 arccos d/2 l(0) . (10) 7 KEYSTROKE LOCALIZATION The final step is to map the finger tapping to the virtual key that is pressed by the user. When the user only uses a single finger to perform tapping gestures, we can determine the identity of the virtual key with very low cost. The identity can be determined by calculating the location of the moving fingertip during the “locating state”. Locating the keystroke when the user uses multiple fingers to perform tapping gesture is quite challenging. This is because that more than one finger will move at the same time, even if the user only intends to use a single finger to press the key. For example, for most people, when they press their little finger, the ring finger will move together at the same time. Therefore, both the video and the ultrasound will detect multiple fingers moving at the same time. To determine the exact finger that is used for pressing, we use the depth information measured in Section 6.3. The finger used for pressing the key always has larger bending angle than other moving fingers. Therefore, we calculate the bending angle for all moving fingers in the view, using the geometric model as shown in Figure 10(a). The finger with the largest bending angle is determined as the pressing finger. Once we confirm which finger is the pressing finger, we use the same methods as in the single-finger case to locate the keystroke. 8 EXPERIMENTAL RESULTS 8.1 Implementation and Evaluation Setup We implemented our system on both the Android and MacOS platforms. On the Android platform, our implementation works as an APP that allows the user to tap in the air in realtime on recent Android devices, e.g., Samsung Galaxy S5 with Android 5.0 OS. For the video capturing, due to the limitations in hardware, the maximum video frame rate is 30 fps. To save the computational resource, we set the video resolution to 352 × 288 and the average video frame rate under this setting is 25 fps. We emit continuous wave signal of Acos 2π f t, where A is the amplitude and f is the frequency of the sound, which is in the range of 17 ∼ 22 kHz. For audio capturing, we chose a data segment size of 512 samples in our implementation, which has time duration of 10.7ms when the sampling rate is 48kHz. We implemented most signal processing and computer vision processing algorithms as C/C++ functions using Android NDK to achieve better efficiency. We used the opening library OpenCV C++ interfaces in computer vision processing when implementing on the Android platform. On the MacOS platform, we implemented the system using the camera on the MacBook and streamed the audio signal using a smartphone. The MacOS-based implementation uses C++. On the laptop, both the video and virtual keyboard are displayed in real time on the screen. The user operates in front of the lap top screen. To obtain the ground truth in our user study, we also captured the user movement by a 120 fps camera. The high speed video was processed offline and manually annotated to serve as the ground truth. Due to the speaker placement (near the ear, instead of facing forward) and SDK limitations on commercial AR devices, we are unable to implement our system on existing devices, such as the Hololens [22]. Instead, we use a cardboard VR setup as shown in Figure 1(b) in our case study. We conducted experiments on Samsung Galaxy S5 smartphone, using its rear speaker, two microphones and the rear camera in both office and home environments, as shown in Figure 11. Experiments were conducted by eight users, who are graduate students with the age of 22 ∼ 26 years. Five out of the eight users have prior experiences on using VR/AR devices. The users interacted with the phone using their bare hands behind the rear camera without wearing any accessory. The performance evaluation process lasted 90 minutes with 18 sessions of five minutes. There is a five minutes break between two sessions. If not specified, the smartphone was fixed on a selfie stick during the experiments