Depth Aware Finger Tapping on Virtual Displays MobiSys'18,June 10-15,2018,Munich,Germany System Sensing methods Sensors Range Depth arruracy Iteraction Kinect v1[21,34] Light Coding IR projector&IR camera 0.8~4m about 4cm Human pose Kinect v2[21,34] Time of Flight IR projector&IR camera 054.57m about 1cm Human pose Leap Motion[25,41] Binocular camera IR cameras&IR LEDs 2.560cm about 0.7mm Hand track and gesture Hololens[22] Time of Flight IR projector&IR camera 10~60cm about 1cm Hand gesture and gaze RealSensel15 Light Coding IR projector&IR camera 20~120cm about 1cm Hand track and gesture Air+Touch[8] Infrared image IR projector&IR camera 5≈20cm about 1em Single finger gesture Our scheme Phase change Microphone&mono-camera 5~60cm 4.32mm Hand track and gesture Table 1:Existing interface schemes for augmented reality systems 2 RELATED WORK so that they cannot be easily ported to mobile devices.RF based Related work can be categorized into four classes:AR/VR gesture systems use the radio waves reflected by hands to recognize prede- recognition,in-air tapping-based interaction on virtual displays, fined gestures [1,7,13,16,20].However,they cannot provide high tapping based interaction for mobile devices,and device-free ges- accuracy tracking capability,which is crucial for in-air tappings. ture recognition and tracking. In comparison,our scheme provides fine-grained localization for AR/VR Gesture Recognition:Most existing AR/VR devices use fingertips and can measure the bending angle of the moving finger. IR projectors/IR cameras to capture the depth information for ges- Sound-based systems,such as LLAP [38]and Strata [44],use phase ture recognition based on structured light [2]or time of flight [10] changes to track hands and achieve cm-level accuracy for 1D and as shown in Table 1.Structured light has been widely used for 2D tracking.respectively.FingerIO [27]proposes an OFDM based 3D scene reconstruction [2].Its accuracy depends on the width of hand tracking system and achieves a hand location accuracy of the stripes used and their optical quality.A time-of-flight camera 8mm and allows 2D drawing in the air using COTS mobile devices. (ToF camera)[10]is a range imaging camera system that resolves However,both schemes treat the hand as a single object and only distance based on the time-of-flight measurements of a light signal provide tracking in the 2D space.The key advantage of our scheme between the camera and the subject for each point of the image is on achieving fine-grained multi-finger tracking in the 3D space However,neither of them focuses on moving object detection and as we fuse information from both ultrasound and vision. they often incur high computational cost.There are other interac- 3 SYSTEM OVERVIEW tion schemes,including gaze-based interactions [33],voice-based Our system is a tapping-in-the-air scheme on virtual displays.It interactions [4,46],and brain-computer interfaces [32].However, uses a mono-camera,a speaker,and two microphones to sense the tapping on virtual buttons is one of the most natural ways for users in-air tapping.The camera captures the video of users'fingers at a to input text on AR/VR devices. speed of 30 fps,without the depth information.The speaker emits In-air Tapping-based Interaction on Virtual Displays:Exist- human inaudible ultrasound at a frequency in the range of 18 ing interaction schemes for VR/AR environments are usually based 22kHz.The microphones capture ultrasound signals reflected by on in-air tapping [14,15,21,22,25,42].Due to the high compu- users'fingers to detect finger movements.The system architecture tational cost and low frame rate,commercial schemes are incon- consists of four components as shown in Figure 3. venient for users [15,21,22,25].Higuchi et al.used 120 fps video Fingertip Localization (Section 4):Our system uses a light- cameras to capture the gesture and enable a multi-finger aR typing weight fingertip localization algorithm in video processing.We first interface [14].However,due to the high computational cost,the use skin color to separate the hand from the background and detects video frames are processed on a PC instead of the mobile device. the contour of the hand,which is a commonly used technique for Comparing with such systems,our scheme uses a light-weight ap- hand recognition [30].Then,we use a light-weight algorithm to proach that achieves high tapping speed and low latency on widely locate all the fingertips captured in the video frame. available mobile devices Ultrasound Signal Phase Extraction(Section 5):First,we down Tapping Based Interaction for Mobile Devices:Recently,vari- convert the ultrasound signal.Second,we extract the phase of the ous novel tapping based approaches for mobile devices have been reflected ultrasound signal.The ultrasound phase change corre- proposed,such as camera-based schemes [26,43],acoustic signals sponds to the movement distance of fingers in the depth direction. based schemes [18,37],and Wi-Fi based schemes [3,6].These ap- proaches focus on exploring alternatives for tapping on the physical Tapping Detection and Tapping Depth Measurement(Sec- materials in the 2D space [3,6,18,26,37,43].In comparison,our tion 6):We use a finite state machine based algorithm to detect approach is an in-air tapping scheme addressing the 3D space lo- the start of the finger tapping action using the ultrasound phase calization problem,which is more challenging and provides more information.Once the finger tapping action is detected,we trace flexibility for AR/VR. back the last few video frames to confirm the tapping motion.To measure the strength of tapping,we combine the depth acquired Device-free Gesture Recognition and Tracking:Device-free from the ultrasound phase change with the depth acquired from gesture recognition is widely used for human-computer interaction the video frames to get the bending angle of the finger. which mainly includes vision-based [8,21,22,25,35,45],RF-based [1,11,16,20,31,36,39,40]and sound-based[7,13,27,38,44.Vi- Keystroke Localization(Section 7):When the user tries to press sion based systems have been widely used in AR/VR systems that a key,both the finger that presses the key and the neighboring have enough computational resources [8,21,22,25,35].However, fingers will move at the same time.Therefore,we combine the they incur high computational cost and have limited frame rates tapping depth measurement with the videos to determine the finger that has the largest bending angle to recognize the pressed key.Depth Aware Finger Tapping on Virtual Displays MobiSys’18, June 10–15, 2018, Munich, Germany System Sensing methods Sensors Range Depth arruracy Iteraction Kinect v1[21, 34] Light Coding IR projector&IR camera 0.8 ∼ 4m about 4cm Human pose Kinect v2[21, 34] Time of Flight IR projector&IR camera 0.5 ∼ 4.5m about 1cm Human pose Leap Motion[25, 41] Binocular camera IR cameras&IR LEDs 2.5 ∼ 60cm about 0.7mm Hand track and gesture Hololens[22] Time of Flight IR projector&IR camera 10 ∼ 60cm about 1cm Hand gesture and gaze RealSense[15] Light Coding IR projector&IR camera 20 ∼ 120cm about 1cm Hand track and gesture Air+Touch[8] Infrared image IR projector&IR camera 5 ∼ 20cm about 1cm Single finger gesture Our scheme Phase change Microphone&mono-camera 5 ∼ 60cm 4.32mm Hand track and gesture Table 1: Existing interface schemes for augmented reality systems 2 RELATED WORK Related work can be categorized into four classes: AR/VR gesture recognition, in-air tapping-based interaction on virtual displays, tapping based interaction for mobile devices, and device-free gesture recognition and tracking. AR/VR Gesture Recognition: Most existing AR/VR devices use IR projectors/IR cameras to capture the depth information for gesture recognition based on structured light [2] or time of flight [10], as shown in Table 1. Structured light has been widely used for 3D scene reconstruction [2]. Its accuracy depends on the width of the stripes used and their optical quality. A time-of-flight camera (ToF camera) [10] is a range imaging camera system that resolves distance based on the time-of-flight measurements of a light signal between the camera and the subject for each point of the image. However, neither of them focuses on moving object detection and they often incur high computational cost. There are other interaction schemes, including gaze-based interactions [33], voice-based interactions [4, 46], and brain-computer interfaces [32]. However, tapping on virtual buttons is one of the most natural ways for users to input text on AR/VR devices. In-air Tapping-based Interaction on Virtual Displays: Existing interaction schemes for VR/AR environments are usually based on in-air tapping [14, 15, 21, 22, 25, 42]. Due to the high computational cost and low frame rate, commercial schemes are inconvenient for users [15, 21, 22, 25]. Higuchi et al. used 120 fps video cameras to capture the gesture and enable a multi-finger AR typing interface [14]. However, due to the high computational cost, the video frames are processed on a PC instead of the mobile device. Comparing with such systems, our scheme uses a light-weight approach that achieves high tapping speed and low latency on widely available mobile devices. Tapping Based Interaction for Mobile Devices: Recently, various novel tapping based approaches for mobile devices have been proposed, such as camera-based schemes [26, 43], acoustic signals based schemes [18, 37], and Wi-Fi based schemes [3, 6]. These approaches focus on exploring alternatives for tapping on the physical materials in the 2D space [3, 6, 18, 26, 37, 43]. In comparison, our approach is an in-air tapping scheme addressing the 3D space localization problem, which is more challenging and provides more flexibility for AR/VR. Device-free Gesture Recognition and Tracking: Device-free gesture recognition is widely used for human-computer interaction, which mainly includes vision-based [8, 21, 22, 25, 35, 45], RF-based [1, 11, 16, 20, 31, 36, 39, 40] and sound-based [7, 13, 27, 38, 44]. Vision based systems have been widely used in AR/VR systems that have enough computational resources [8, 21, 22, 25, 35]. However, they incur high computational cost and have limited frame rates so that they cannot be easily ported to mobile devices. RF based systems use the radio waves reflected by hands to recognize predefined gestures [1, 7, 13, 16, 20]. However, they cannot provide high accuracy tracking capability, which is crucial for in-air tappings. In comparison, our scheme provides fine-grained localization for fingertips and can measure the bending angle of the moving finger. Sound-based systems, such as LLAP [38] and Strata [44] , use phase changes to track hands and achieve cm-level accuracy for 1D and 2D tracking, respectively. FingerIO [27] proposes an OFDM based hand tracking system and achieves a hand location accuracy of 8mm and allows 2D drawing in the air using COTS mobile devices. However, both schemes treat the hand as a single object and only provide tracking in the 2D space. The key advantage of our scheme is on achieving fine-grained multi-finger tracking in the 3D space as we fuse information from both ultrasound and vision. 3 SYSTEM OVERVIEW Our system is a tapping-in-the-air scheme on virtual displays. It uses a mono-camera, a speaker, and two microphones to sense the in-air tapping. The camera captures the video of users’ fingers at a speed of 30 fps, without the depth information. The speaker emits human inaudible ultrasound at a frequency in the range of 18 ∼ 22 kHz. The microphones capture ultrasound signals reflected by users’ fingers to detect finger movements. The system architecture consists of four components as shown in Figure 3. Fingertip Localization (Section 4): Our system uses a lightweight fingertip localization algorithm in video processing. We first use skin color to separate the hand from the background and detects the contour of the hand, which is a commonly used technique for hand recognition [30]. Then, we use a light-weight algorithm to locate all the fingertips captured in the video frame. Ultrasound Signal Phase Extraction (Section 5): First, we down convert the ultrasound signal. Second, we extract the phase of the reflected ultrasound signal. The ultrasound phase change corresponds to the movement distance of fingers in the depth direction. Tapping Detection and Tapping Depth Measurement (Section 6): We use a finite state machine based algorithm to detect the start of the finger tapping action using the ultrasound phase information. Once the finger tapping action is detected, we trace back the last few video frames to confirm the tapping motion. To measure the strength of tapping, we combine the depth acquired from the ultrasound phase change with the depth acquired from the video frames to get the bending angle of the finger. Keystroke Localization (Section 7): When the user tries to press a key, both the finger that presses the key and the neighboring fingers will move at the same time. Therefore, we combine the tapping depth measurement with the videos to determine the finger that has the largest bending angle to recognize the pressed key