正在加载图片...
MobiSys'18.June 10-15,2018,Munich,Germany Ke Sun et al. For the ultrasound-based approach,the detection accuracy is lim- ited by the interference of finger movements because it is difficult to tell whether the ultrasound phase change is caused by finger tapping or lateral finger movements.To address this challenge,we combine the ultrasound and the camera data to achieve higher tap- ping detection accuracy.We first detect the finger movements using ultrasound signal.We then look back at the results of previously captured video frames to determine which finger is moving and the Figure 2:Comparison between video and audio streams movement direction of the given finger.Our joint finger tapping detection algorithm improves the detection accuracy for gentle users and degrades user experience [23];however,it is challeng. finger tappings from 58.2%(camera-only)to 97.6%. ing to provide visual feedback within 100ms,because vision-based The second challenge is to achieve low-latency finger tapping schemes require a series of high latency operations,such as captur- detection.In our experiments,the average duration of finger tap- ing the video signal,recognizing gestures using computer vision ping gestures is 354ms,where the tapping down(from an initial algorithms,and rendering the virtual object on the display.While movement to "touching"the virtual key)lasts 152ms and the tap- high-end cameras on smartphones can now provide high speed ping up(moving back from the virtual key to the normal position) video capture at more than 120 fps,the high computational costs lasts 202ms.Therefore,a 30-fps camera only captures less than 4 still limit the processing to a low frame rate in realtime,e.g 15 frames for the tapping down gesture in the worst case.However, fps [43].This explains why commercial AR systems such as Leap the feedback should be provided to the user as soon as the finger Motion [25]rely on the computational power of a desktop and "touches"the virtual key;otherwise,the user tends to move for cannot be easily implemented on low-end mobile devices an extra distance on each tapping.which slows down the tapping In this paper,we propose a fine-grained depth-aware tapping process and worsen user experience.To provide fast feedback,a scheme for AR/VR systems that allows users to tap in-the-air,as system should detect finger movements during the tapping down shown in Figure 1.Our basic idea is to use light-weight ultrasound stage.Accurately recognizing such detailed movements in just four based sensing,along with one Commercial Off-The-Shelf(COTS) video frames is challenging,while waiting for more video frames mono-camera,to enable 3D tracking of users'fingers.To track fin- leads to higher feedback latency.To address this challenge,we use gers in the 2D space,the mono-camera is enough for us to achieve the ultrasound to capture the detailed movement information as that with light-weight computer vision algorithms.To capture the shown in Figure 2.We design a state machine to capture the differ- depth information in the 3D space,the mono-camera is no longer ent movement states of user's fingers.As soon as the state machine sufficient.Prior vision-based schemes require extra cameras and enters the "tapping state",we analyze both the ultrasound signal complex computer vision algorithms to obtain the depth infor- and the captured video frames to provide a robust and prompt de- mation [17,41].In this paper,we propose to use light-weight ul- cision on the tapping event.Thus,our scheme can feedback at the trasound based sensing to get the depth information.Using the precise timing of"touching",rather than waiting for more frames speakers and microphones that already exist on most AR/VR de- to see that the finger starts moving back. vices,we emit inaudible sound wave from the speaker and capture The third challenge is to achieve affordable hardware and com the signal reflected by the finger with the microphone.We first putational cost on mobile devices.Traditional depth-camera based use ultrasound information to detect that there exists a finger that approaches need dual-camera or extra time-of-flight depth sensors performs the tapping down motion,and then use the vision in- [2,10].Furthermore,the computer vision algorithm for 3D fingertip formation to distinguish which finger performs the tapping down localization incurs high computational costs.It is challenging to motion.By measuring the phase changes of the ultrasound signals, achieve 30 fps 3D finger localization,especially on mobile devices we accurately measure fine-grained finger movements in the depth such as the Head-Mounted Display (HMD)or mobile phones.To direction and estimate the bending angles of finger tappings.With address this challenge,we use speakers/microphones as the depth fast and light-weight ultrasound signal processing algorithms,we sensor and combine it with the 2D position information obtained can track finger movements within the gap between two video from ordinary mono-camera with light-weight computer vision frames.Therefore,both detecting finger tapping motion and updat- algorithms.Thus,3D finger location can be measured using existing ing the virtual objects on virtual display can be achieved within sensors on mobile devices with affordable computational costs. one-video frame latency.This fast feedback is crucial for tapping- We implemented and evaluated our scheme using commercial in-the-air as the system can immediately highlight the object that smartphones without any hardware modification.Compared to the is being pressed on user display right after detecting a user tapping video-only scheme,our scheme improves the detection accuracy for motion. gentle finger tappings from 58.2%to 97.6%and reduces the detection There are three challenges to implement a fine-grained depth- latency by 57.7ms.Our scheme achieves 98.4%detection accuracy aware tapping scheme.The first challenge is to achieve high recog- with FPR of 1.6%and FNR of 1.4%.Furthermore,the fine-grained nition accuracy and fine-grained depth measurements for finger tap- bending angle measurements provided by our scheme enables new pings.Using either the video or the ultrasound alone is not enough dimensions for 3D interaction as shown by our case study.However, to achieve the desired detection accuracy.For the camera-based compared to a video-only solution,our system incurs a significant approach,the detection accuracy is limited by the low frame-rate power consumption overhead of 48.4%on a Samsung Galaxy S5. where the tapping gesture is only captured in a few video frames.MobiSys’18, June 10–15, 2018, Munich, Germany Ke Sun et al. Time (millisecond) 0 50 100 150 200 250 300 350 400 450 500 I/Q (normalized) -300 -200 -100 0 100 200 300 400 I Q Figure 2: Comparison between video and audio streams users and degrades user experience [23]; however, it is challeng￾ing to provide visual feedback within 100ms, because vision-based schemes require a series of high latency operations, such as captur￾ing the video signal, recognizing gestures using computer vision algorithms, and rendering the virtual object on the display. While high-end cameras on smartphones can now provide high speed video capture at more than 120 fps, the high computational costs still limit the processing to a low frame rate in realtime, e.g., 15 fps [43]. This explains why commercial AR systems such as Leap Motion [25] rely on the computational power of a desktop and cannot be easily implemented on low-end mobile devices. In this paper, we propose a fine-grained depth-aware tapping scheme for AR/VR systems that allows users to tap in-the-air, as shown in Figure 1. Our basic idea is to use light-weight ultrasound based sensing, along with one Commercial Off-The-Shelf (COTS) mono-camera, to enable 3D tracking of users’ fingers. To track fin￾gers in the 2D space, the mono-camera is enough for us to achieve that with light-weight computer vision algorithms. To capture the depth information in the 3D space, the mono-camera is no longer sufficient. Prior vision-based schemes require extra cameras and complex computer vision algorithms to obtain the depth infor￾mation [17, 41]. In this paper, we propose to use light-weight ul￾trasound based sensing to get the depth information. Using the speakers and microphones that already exist on most AR/VR de￾vices, we emit inaudible sound wave from the speaker and capture the signal reflected by the finger with the microphone. We first use ultrasound information to detect that there exists a finger that performs the tapping down motion, and then use the vision in￾formation to distinguish which finger performs the tapping down motion. By measuring the phase changes of the ultrasound signals, we accurately measure fine-grained finger movements in the depth direction and estimate the bending angles of finger tappings. With fast and light-weight ultrasound signal processing algorithms, we can track finger movements within the gap between two video frames. Therefore, both detecting finger tapping motion and updat￾ing the virtual objects on virtual display can be achieved within one-video frame latency. This fast feedback is crucial for tapping￾in-the-air as the system can immediately highlight the object that is being pressed on user display right after detecting a user tapping motion. There are three challenges to implement a fine-grained depth￾aware tapping scheme. The first challenge is to achieve high recog￾nition accuracy and fine-grained depth measurements for finger tap￾pings. Using either the video or the ultrasound alone is not enough to achieve the desired detection accuracy. For the camera-based approach, the detection accuracy is limited by the low frame-rate where the tapping gesture is only captured in a few video frames. For the ultrasound-based approach, the detection accuracy is lim￾ited by the interference of finger movements because it is difficult to tell whether the ultrasound phase change is caused by finger tapping or lateral finger movements. To address this challenge, we combine the ultrasound and the camera data to achieve higher tap￾ping detection accuracy. We first detect the finger movements using ultrasound signal. We then look back at the results of previously captured video frames to determine which finger is moving and the movement direction of the given finger. Our joint finger tapping detection algorithm improves the detection accuracy for gentle finger tappings from 58.2% (camera-only) to 97.6%. The second challenge is to achieve low-latency finger tapping detection. In our experiments, the average duration of finger tap￾ping gestures is 354ms, where the tapping down (from an initial movement to “touching” the virtual key) lasts 152ms and the tap￾ping up (moving back from the virtual key to the normal position) lasts 202ms. Therefore, a 30-fps camera only captures less than 4 frames for the tapping down gesture in the worst case. However, the feedback should be provided to the user as soon as the finger “touches” the virtual key; otherwise, the user tends to move for an extra distance on each tapping, which slows down the tapping process and worsen user experience. To provide fast feedback, a system should detect finger movements during the tapping down stage. Accurately recognizing such detailed movements in just four video frames is challenging, while waiting for more video frames leads to higher feedback latency. To address this challenge, we use the ultrasound to capture the detailed movement information as shown in Figure 2. We design a state machine to capture the differ￾ent movement states of user’s fingers. As soon as the state machine enters the “tapping state”, we analyze both the ultrasound signal and the captured video frames to provide a robust and prompt de￾cision on the tapping event. Thus, our scheme can feedback at the precise timing of “touching”, rather than waiting for more frames to see that the finger starts moving back. The third challenge is to achieve affordable hardware and com￾putational cost on mobile devices. Traditional depth-camera based approaches need dual-camera or extra time-of-flight depth sensors [2, 10]. Furthermore, the computer vision algorithm for 3D fingertip localization incurs high computational costs. It is challenging to achieve 30 fps 3D finger localization, especially on mobile devices such as the Head-Mounted Display (HMD) or mobile phones. To address this challenge, we use speakers/microphones as the depth sensor and combine it with the 2D position information obtained from ordinary mono-camera with light-weight computer vision algorithms. Thus, 3D finger location can be measured using existing sensors on mobile devices with affordable computational costs. We implemented and evaluated our scheme using commercial smartphones without any hardware modification. Compared to the video-only scheme, our scheme improves the detection accuracy for gentle finger tappings from 58.2% to 97.6% and reduces the detection latency by 57.7ms. Our scheme achieves 98.4% detection accuracy with FPR of 1.6% and FNR of 1.4%. Furthermore, the fine-grained bending angle measurements provided by our scheme enables new dimensions for 3D interaction as shown by our case study. However, compared to a video-only solution, our system incurs a significant power consumption overhead of 48.4% on a Samsung Galaxy S5
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有