《分布式计算实验室》课程教学资源（阅读文献）IEEE INTERNET OF THINGS JOURNAL, VOL. XX, NO. XX, XX 2021 1 DynaKey：Dynamic Keystroke Tracking using a Head-Mounted Camera Device.pdf_大学文库

This article has been accepted for publication in a future issue of this journal,but has not been fully edited.Content may change prior to final publication.Citation information:DOI 10.1109/JIOT.2021.3114224.IEEE Internet of Things Journal IEEE INTERNET OF THINGS JOURNAL. VOL.XX.NO.XX.XX 2021 Dynakey:Dynamic Keystroke Tracking using a Head-Mounted Camera Device Hao Zhang,Student Member IEEE,Yafeng Yin,Member IEEE,Lei Xie,Member IEEE, Tao Gu,Senior Member,IEEE,Minghui You,and Sanglu Lu,Member,IEEE Abstract-Mobile and wearable devices have become more and Please TYPE: more popular.However,the tiny touch screen leads to inefficient gyroscope this is a dynakey interaction with these devices,especially for text input.In this paper,we propose Dynakey,which allows people to type on a virtual keyboard printed on a piece of paper or drawn on a desk,for inputting text into a head-mounted camera device (e.g,smart glasses).By using the built-in camera and gyroscope, we capture image frames during typing and detect possible head movements,then track keys,detect fingertips and locate keystrokes.To track the changes of keys'coordinates in images Fig.1.Typing on a virtual keyboard in dynamic scenarios.Camera move- caused by natural head (i.e.,camera)movements,we introduce ments change the keys'coordinates in image frames and lead to the mismatch perspective transformation to transform keys'coordinates among between fingertip and key. different frames.To detect and locate keystrokes,we utilize the variation of fingertip's coordinates across multiple frames to Based on the observation that each finger's typing move- detect possible keystrokes for localization.To reduce the time ment is associated with a unique keystroke,recognizing finger cost,we combine gyroscope and camera to adaptively track the keys,and introduce a series of optimizations such as keypoint de- movements has been proposed as a novel text input method, tection,frame skipping,multi-thread processing,etc.Finally,we which is achieved by additional wearable sensors (e.g.,finger- implement DynaKey on Android powered devices.The extensive mounted sensors [3]-[7])and incurs an additional cost.Con- experimental results show that our system can efficiently track sidering the users'habits in typing on a common QWERTY and locate the keystrokes in real time.Specifically,the average keyboard layout,a projection keyboard [24],[27 generated tracking deviation of the keyboard layout is less than 3 pixels by casting the standard keyboard layout onto a surface via and the intersection over union (IoU)of a key in two consecutive images is above 93%.The average keystroke localization accuracy a projector has been proposed,which is used to recognize reaches 95.5%. keystrokes based on light reflection and depends on the Index Terms-Dynamic keystroke tracking,Camera,Inertial dedicated equipment for projection.Recently.with the advance sensor,Head-mounted device of contactless sensing,recognition of keystrokes can be done via WiFi signals or acoustic signals.For example,WiFi CSI I.INTRODUCTION signals have been explored in [8],[19]to capture keystrokes' Recent years have witnessed an ever-growing popularity typing patterns,the built-in microphone of a smartphone has of mobile and wearable devices such as smartphones,smart been used in [16],[20]to infer keystrokes on a solid surface watches and smart glasses.These devices usually impose a However,contactless sensing is usually vulnerable to environ- small form factor design so that they can be carried by users mental noises,hence limiting its performance in real-world everywhere conveniently.The portable design brings much applications.Therefore,the camera-based approaches [26]. mobility to these devices,but on the other hand it creates [29],[30]have also been proposed to recognize keystrokes many challenges for human-computer interaction,especially on a predefined keyboard layout using image processing. for text input.Some of these devices adopt an on-screen However,existing camera-based text input methods assume virtual keyboard [1],[2]for text input,but others may require a fixed camera and the coordinates of a keyboard layout keep intelligent methods due to the tiny screen or even no screen. unchanged in the fixed camera view.In reality,the camera of a head-mounted device can hardly keep still.Existing methods Manuscript received XXXX.2021.This work is supported by National Key R&D Program of China under Grant No.2018AAA0102302,National may not work in such dynamic moving scenes where the Natural Science Foundation of China under Grant Nos.61802169,61872174, camera will suffer from unavoidable movements.Specifically. 61832008,61906085;JiangSu Natural Science Foundation under Grant No. as shown in Fig.1,head movements cause camera jitters which BK20180325.This work is partially supported by Collaborative Innovation Center of Novel Software Technology and Industrialization.(Corresponding lead to changes of keyboard's coordinate in image frames,and author:Yafeng Yin.) eventually cause the mismatch between fingertip and key.The Hao Zhang.Yafeng Yin,Lei Xie.Minghui You,and Sanglu Lu are with limitation of existing camera-based text input methods strongly the State Key Laboratory of Novel Software Technology,Nanjing University. Nanjing 210023,China (e-mail:yafeng@nju.edu.cn). motivate the work in this paper. Tao Gu is with the Department of Computing at Macquarie University, In this paper,we propose a novel scheme named DynaKey Sydney.Australia. using camera and gyroscope for text input on a virtual key- Copyright (c)20xx IEEE.Personal use of this material is permitted. However,permission to use this material for any other purposes must be board in dynamic moving scenes.DynaKey does not impose a obtained from the IEEE by sending a request to pubs-permissions@ieee.org. fixed camera,hence it works in more realistic scenarios.Fig

2327-4662 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2021.3114224, IEEE Internet of Things Journal IEEE INTERNET OF THINGS JOURNAL, VOL. XX, NO. XX, XX 2021 1 DynaKey: Dynamic Keystroke Tracking using a Head-Mounted Camera Device Hao Zhang, Student Member, IEEE, Yafeng Yin, Member, IEEE, Lei Xie, Member, IEEE, Tao Gu, Senior Member, IEEE, Minghui You, and Sanglu Lu, Member, IEEE Abstract—Mobile and wearable devices have become more and more popular. However, the tiny touch screen leads to inefficient interaction with these devices, especially for text input. In this paper, we propose DynaKey, which allows people to type on a virtual keyboard printed on a piece of paper or drawn on a desk, for inputting text into a head-mounted camera device (e.g., smart glasses). By using the built-in camera and gyroscope, we capture image frames during typing and detect possible head movements, then track keys, detect fingertips and locate keystrokes. To track the changes of keys’ coordinates in images caused by natural head (i.e., camera) movements, we introduce perspective transformation to transform keys’ coordinates among different frames. To detect and locate keystrokes, we utilize the variation of fingertip’s coordinates across multiple frames to detect possible keystrokes for localization. To reduce the time cost, we combine gyroscope and camera to adaptively track the keys, and introduce a series of optimizations such as keypoint detection, frame skipping, multi-thread processing, etc. Finally, we implement DynaKey on Android powered devices. The extensive experimental results show that our system can efficiently track and locate the keystrokes in real time. Specifically, the average tracking deviation of the keyboard layout is less than 3 pixels and the intersection over union (IoU) of a key in two consecutive images is above 93%. The average keystroke localization accuracy reaches 95.5%. Index Terms—Dynamic keystroke tracking, Camera, Inertial sensor, Head-mounted device I. INTRODUCTION Recent years have witnessed an ever-growing popularity of mobile and wearable devices such as smartphones, smart watches and smart glasses. These devices usually impose a small form factor design so that they can be carried by users everywhere conveniently. The portable design brings much mobility to these devices, but on the other hand it creates many challenges for human-computer interaction, especially for text input. Some of these devices adopt an on-screen virtual keyboard [1], [2] for text input, but others may require intelligent methods due to the tiny screen or even no screen. Manuscript received XX XX, 2021. This work is supported by National Key R&D Program of China under Grant No. 2018AAA0102302, National Natural Science Foundation of China under Grant Nos. 61802169, 61872174, 61832008, 61906085; JiangSu Natural Science Foundation under Grant No. BK20180325. This work is partially supported by Collaborative Innovation Center of Novel Software Technology and Industrialization. (Corresponding author: Yafeng Yin.) Hao Zhang, Yafeng Yin, Lei Xie, Minghui You, and Sanglu Lu are with the State Key Laboratory of Novel Software Technology, Nanjing University, Nanjing 210023, China (e-mail: yafeng@nju.edu.cn). Tao Gu is with the Department of Computing at Macquarie University, Sydney, Australia. Copyright (c) 20xx IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained from the IEEE by sending a request to pubs-permissions@ieee.org. Please TYPE: this is a dynakey._ 1 2 3 4 5 6 7 8 9 0 Q W E camera gyroscope Fig. 1. Typing on a virtual keyboard in dynamic scenarios. Camera movements change the keys’ coordinates in image frames and lead to the mismatch between fingertip and key. Based on the observation that each finger’s typing movement is associated with a unique keystroke, recognizing finger movements has been proposed as a novel text input method, which is achieved by additional wearable sensors (e.g., fingermounted sensors [3]-[7]) and incurs an additional cost. Considering the users’ habits in typing on a common QWERTY keyboard layout, a projection keyboard [24], [27] generated by casting the standard keyboard layout onto a surface via a projector has been proposed, which is used to recognize keystrokes based on light reflection and depends on the dedicated equipment for projection. Recently, with the advance of contactless sensing, recognition of keystrokes can be done via WiFi signals or acoustic signals. For example, WiFi CSI signals have been explored in [8], [19] to capture keystrokes’ typing patterns, the built-in microphone of a smartphone has been used in [16], [20] to infer keystrokes on a solid surface. However, contactless sensing is usually vulnerable to environmental noises, hence limiting its performance in real-world applications. Therefore, the camera-based approaches [26], [29], [30] have also been proposed to recognize keystrokes on a predefined keyboard layout using image processing. However, existing camera-based text input methods assume a fixed camera and the coordinates of a keyboard layout keep unchanged in the fixed camera view. In reality, the camera of a head-mounted device can hardly keep still. Existing methods may not work in such dynamic moving scenes where the camera will suffer from unavoidable movements. Specifically, as shown in Fig. 1, head movements cause camera jitters which lead to changes of keyboard’s coordinate in image frames, and eventually cause the mismatch between fingertip and key. The limitation of existing camera-based text input methods strongly motivate the work in this paper. In this paper, we propose a novel scheme named DynaKey using camera and gyroscope for text input on a virtual keyboard in dynamic moving scenes. DynaKey does not impose a fixed camera, hence it works in more realistic scenarios. Fig. Authorized licensed use limited to: Nanjing University. Downloaded on December 03,2021 at 08:56:41 UTC from IEEE Xplore. Restrictions apply

This article has been accepted for publication in a future issue of this journal,but has not been fully edited.Content may change prior to final publication.Citation information:DOI 10.1109/JIOT.2021.3114224.IEEE Internet of Things Journal EEE INTERNET OF THINGS JOURNAL,VOL.XX,NO.XX,XX 2021 2 1 illustrates a typical scenario where a user wears a head- ing module to ensure DynaKey work dynamically in real time. mounted camera device (e.g.,smart glasses),while a standard In summary,we make three main contributions in this paper. keyboard layout can be printed on a piece of paper or drawn on 1)To the best of our knowledge,this paper appears the first a desk surface.DynaKey combines the embedded camera and work focusing on efficient text input using the built-in camera gyroscope to track finger movements and recognize keystrokes of a head-mounted device (e.g.,smart glasses)in dynamic in real time.Specifically,while the user types on a virtual moving scenes.To adapt to the dynamic camera views,we keyboard,DynaKey utilizes camera to capture image frames propose a Perspective Transformation-based technique to track continuously,then detects fingertips and locates keystrokes the changes of keyboard's coordinate.Besides,without the using image processing techniques.During the tying process, depth information of fingertips in a single camera view,we when the head movement is detected by gyroscope,DynaKey utilize the variation of fingertip's coordinate across multiple needs to track the changes of keyboard coordinate caused by frames for keystroke detection.2)To ensure the real-time camera movements.This keyboard tracking is crucial due to response,DynaKey proposes a gyroscope-based lightweight natural head movements in real application scenarios design to adaptively detect the camera movement and remove The design of DynaKey creates three key challenges that unnecessary image processing for keyboard tracking.Besides, we aim to address in this paper we introduce a series of optimizations such as keypoint The first challenge is how to track changes of keyboard's selection,frame skipping and multi-thread processing for coordinate accurately so that Dynakey is able to adapt to image processing.3)We implement DynaKey on off-the-shelf dynamic moving scenes.In reality,the camera moves naturally Android devices,and conduct comprehensive experiments to along with the head.Such movements will cause dynamic evaluate the performance of DynaKey.Results show that the changes of the camera coordinate system.The different camera average tracking deviation of keyboard layout is less than 3 views and unavoidable image distortion eventually result in pixels and the intersection over union (IoU)[25]of a key changes of keyboard coordinate in image frames.An intuitive in two consecutive images is above 93%.The accuracy of solution is to re-extract keyboard layout from each image,but keystroke localization reaches 95.5%on average.The time it is costly.In addition,we may not be able to obtain keyboard response is 63 ms and such latency is below human response layout from each image properly due to unavoidable occlusion time [23]. by hands.Our intuitive idea asks a fundamental question- II.RELATED WORK can we build a fixed coordinate system no matter how the keyboard coordinate changes?In DynaKey,we propose a Virtual keyboards have been used as an alternative of Perspective Transformation-based technique that converts any on-screen keyboards [1],[2]to support text input for mo- previous coordinate to the current coordinate system.To obtain bile or wearable devices with small or no screen.These appropriate feature point pairs for facilitating transformation, virtual keyboards can be mainly classified into five cate- we propose a keypoint selection method to dynamically select gories,i.e.,wearable sensor-based,projection-based,WiFi- appropriate cross point pairs from the keyboard layout,while based,acoustic-based,and camera-based keyboards. tolerating the occlusion of keyboard. Wearable sensor-based keyboards:Wearable sensors have The second challenge is how to detect and locate keystrokes been used to capture the movements of fingers for text input. efficiently and accurately from a single camera view.This iKey [4]utilizes a wrist-worn piezoelectric ceramic sensor is a non-trivial task due to the lack of depth information of to recognize keystrokes on the back of hand.DigiTouch [5] fingertips from single camera view.In the setting of a head- introduces a glove-based input device which enables thumb-to- mounted camera and a keyboard located in the front of and finger touch interaction by sensing touch position and pressure. below the camera,the camera view from top and behind can MagBoard [3]leverages the triaxial magnetometer embedded hardly get the perpendicular distance between the fingertip and in mobile phones to locate a magnet on a printed keyboard. the keyboard plane,i.e.,it is difficult to determine whether a FingerSound 6 utilizes a thumb-mounted ring which consists finger is typing and which finger is typing.To address this of a microphone and a gyroscope,to recognize unistroke challenge,we utilize the variation of a fingertip's coordinate thumb gestures for text input.These approaches introduce across multiple frames to detect a keystroke,i.e..whether a additional hardwares to capture typing behaviors. finger is typing.In addition to the fingertip movement,we Projection-based keyboards:Projection keyboards [24]. further match a key's coordinate with the fingertip's coordinate [27]have been proposed for mobile devices,by adopting a to locate which finger is typing. conventional QWERTY keyboard layout.They usually require The third challenge is how to trade off between dy- a light projector to cast a keyboard layout onto a flat surface, namic tracking of keyboard and tracking cost for resource- and then recognize keystrokes based on light reflection.This constrained devices.If the camera does not move or has approach requires dedicated equipment.Microsoft Hololens negligible movements,tracking keyboard's coordinate is [13]provides a projection keyboard in front of a user using unnecessary.To achieve the best trade-off for resource-a pair of mixed-reality smart glasses.During text input,the constrained head-mounted devices,we introduce a gyroscope- user needs to move her/his head to pick a key and then make based lightweight method to detect non-negligible camera a specific 'tap'gesture to select the character.This tedious movements,including short-time sharp movement and long- process may slow down text input and affect user experience. time accumulated micro movement.Only the detected non- WiFi-based keyboards:By utilizing the unique pattern of negligible camera movements will trigger the keyboard track- channel state information(CSI)in time series,WiFinger [19]

2327-4662 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2021.3114224, IEEE Internet of Things Journal IEEE INTERNET OF THINGS JOURNAL, VOL. XX, NO. XX, XX 2021 2 1 illustrates a typical scenario where a user wears a headmounted camera device (e.g., smart glasses), while a standard keyboard layout can be printed on a piece of paper or drawn on a desk surface. DynaKey combines the embedded camera and gyroscope to track finger movements and recognize keystrokes in real time. Specifically, while the user types on a virtual keyboard, DynaKey utilizes camera to capture image frames continuously, then detects fingertips and locates keystrokes using image processing techniques. During the tying process, when the head movement is detected by gyroscope, DynaKey needs to track the changes of keyboard coordinate caused by camera movements. This keyboard tracking is crucial due to natural head movements in real application scenarios. The design of DynaKey creates three key challenges that we aim to address in this paper. The first challenge is how to track changes of keyboard’s coordinate accurately so that DynaKey is able to adapt to dynamic moving scenes. In reality, the camera moves naturally along with the head. Such movements will cause dynamic changes of the camera coordinate system. The different camera views and unavoidable image distortion eventually result in changes of keyboard coordinate in image frames. An intuitive solution is to re-extract keyboard layout from each image, but it is costly. In addition, we may not be able to obtain keyboard layout from each image properly due to unavoidable occlusion by hands. Our intuitive idea asks a fundamental question– can we build a fixed coordinate system no matter how the keyboard coordinate changes? In DynaKey, we propose a Perspective Transformation-based technique that converts any previous coordinate to the current coordinate system. To obtain appropriate feature point pairs for facilitating transformation, we propose a keypoint selection method to dynamically select appropriate cross point pairs from the keyboard layout, while tolerating the occlusion of keyboard. The second challenge is how to detect and locate keystrokes efficiently and accurately from a single camera view. This is a non-trivial task due to the lack of depth information of fingertips from single camera view. In the setting of a headmounted camera and a keyboard located in the front of and below the camera, the camera view from top and behind can hardly get the perpendicular distance between the fingertip and the keyboard plane, i.e., it is difficult to determine whether a finger is typing and which finger is typing. To address this challenge, we utilize the variation of a fingertip’s coordinate across multiple frames to detect a keystroke, i.e., whether a finger is typing. In addition to the fingertip movement, we further match a key’s coordinate with the fingertip’s coordinate to locate which finger is typing. The third challenge is how to trade off between dynamic tracking of keyboard and tracking cost for resourceconstrained devices. If the camera does not move or has negligible movements, tracking keyboard’s coordinate is unnecessary. To achieve the best trade-off for resourceconstrained head-mounted devices, we introduce a gyroscopebased lightweight method to detect non-negligible camera movements, including short-time sharp movement and longtime accumulated micro movement. Only the detected nonnegligible camera movements will trigger the keyboard tracking module to ensure DynaKey work dynamically in real time. In summary, we make three main contributions in this paper. 1) To the best of our knowledge, this paper appears the first work focusing on efficient text input using the built-in camera of a head-mounted device (e.g., smart glasses) in dynamic moving scenes. To adapt to the dynamic camera views, we propose a Perspective Transformation-based technique to track the changes of keyboard’s coordinate. Besides, without the depth information of fingertips in a single camera view, we utilize the variation of fingertip’s coordinate across multiple frames for keystroke detection. 2) To ensure the real-time response, DynaKey proposes a gyroscope-based lightweight design to adaptively detect the camera movement and remove unnecessary image processing for keyboard tracking. Besides, we introduce a series of optimizations such as keypoint selection, frame skipping and multi-thread processing for image processing. 3) We implement DynaKey on off-the-shelf Android devices, and conduct comprehensive experiments to evaluate the performance of DynaKey. Results show that the average tracking deviation of keyboard layout is less than 3 pixels and the intersection over union (IoU) [25] of a key in two consecutive images is above 93%. The accuracy of keystroke localization reaches 95.5% on average. The time response is 63 ms and such latency is below human response time [23]. II. RELATED WORK Virtual keyboards have been used as an alternative of on-screen keyboards [1], [2] to support text input for mobile or wearable devices with small or no screen. These virtual keyboards can be mainly classified into five categories, i.e., wearable sensor-based, projection-based, WiFibased, acoustic-based, and camera-based keyboards. Wearable sensor-based keyboards: Wearable sensors have been used to capture the movements of fingers for text input. iKey [4] utilizes a wrist-worn piezoelectric ceramic sensor to recognize keystrokes on the back of hand. DigiTouch [5] introduces a glove-based input device which enables thumb-to- finger touch interaction by sensing touch position and pressure. MagBoard [3] leverages the triaxial magnetometer embedded in mobile phones to locate a magnet on a printed keyboard. FingerSound [6] utilizes a thumb-mounted ring which consists of a microphone and a gyroscope, to recognize unistroke thumb gestures for text input. These approaches introduce additional hardwares to capture typing behaviors. Projection-based keyboards: Projection keyboards [24], [27] have been proposed for mobile devices, by adopting a conventional QWERTY keyboard layout. They usually require a light projector to cast a keyboard layout onto a flat surface, and then recognize keystrokes based on light reflection. This approach requires dedicated equipment. Microsoft Hololens [13] provides a projection keyboard in front of a user using a pair of mixed-reality smart glasses. During text input, the user needs to move her/his head to pick a key and then make a specific ‘tap’ gesture to select the character. This tedious process may slow down text input and affect user experience. WiFi-based keyboards: By utilizing the unique pattern of channel state information (CSI) in time series, WiFinger [19] Authorized licensed use limited to: Nanjing University. Downloaded on December 03,2021 at 08:56:41 UTC from IEEE Xplore. Restrictions apply

This article has been accepted for publication in a future issue of this journal,but has not been fully edited.Content may change prior to final publication.Citation information:DOI 10.1109/JIOT.2021.3114224.IEEE Internet of Things Journal IEEE INTERNET OF THINGS JOURNAL,VOL.XX,NO.XX,XX 2021 original frame current frame paper keyboard two image plane keyboard in real world (a)Experimental setup. (b)Unconscious head movements can lead to the large coordinate deviations of keys. 0 detec0n e cost -mean time cost 03 --"duration for a frame 2 0) g° 0.2 90.15 0 0 30 10 The number of detection Timestamp(min) (c)Extracting all keys from each image leads to (d)Head movements occur occasionally and last (e)A frame from camera view can hardly detect unacceptable time cost for real-time systems. for several frames instead of all frames the depth information of fingertips. Fig.2.Observations about coordinate changes of keys,captured frames for keystrokes,and time cost in image processing. is designed to recognize a set of finger-grained gestures to almost comparable to that of single-hand text input on tablet input text for off-the-shelf WiFi devices.Similarly,when a computers.K.Sun et al.[26]propose a depth aware tapping user types on a keyboard,WiKey [8]recognizes the typed scheme for VR/AR devices by combining a microphone with keys based on how the CSI value changes at the WiFi signal a COTS mono-camera.It enables tracking of user's fingers receiver.However,the WiFi-based approach can be easily based on ultrasound and image frames.Yin et al.[29]leverage affected by environments,such as changes of transceiver's a built-in camera in mobile device to recognize keystrokes orientation or location,and unexpected human motions in sur- by comparing the fingertip's location with a key's location in rounding areas.They are often expected to work in controlled image frames.However,these methods assume that the text environments,rather than real-world scenarios. input space has a fixed location in the camera view,i.e.,the Acoustic-based keyboards:By utilizing the built-in mi- coordinates of the keyboard or keys keep unchanged. crophones of mobile and wearable devices,acoustic-based Our work is motivated by the recent advance of camera- keyboards have been recently proposed.UbiTap [16]presents based text input methods.We move an important step towards an input method by turning the solid surface into a touch dynamic scenarios where the camera moves naturally with input space,based on the sound collected by the microphones. user's head.In our work,the keyboard coordinate in the To infer the keystroke's position,it requires three phones to camera's view changes dynamically,creating more challenges estimate the arrival time of acoustic signals.KeyListener [20] in achieving high accuracy in keystroke localization and low infers keystrokes on the QWERTY keyboard of touch screen latency for resource limited head-mounted devices. by leveraging the microphones of a smartphone,while it is designed for indirect eavesdropping attacks,the accuracy of III.OBSERVATIONS keystroke inference is usually not sufficient for text input. We first conduct our preliminary experiments to study UbiK [28]leverages the microphone of a mobile device to how the changes of keyboard coordinate affect key tracking locate the keystrokes,while it requires the user to click a and keystroke localization in a dynamic scenario.In our key with the fingertip and nail margin,which may be not experiments,we use a Samsung Galaxy S9 smartphone as typical.Some Auto Speech Recognition (ASR)tools [31]are a head-mounted camera device,as shown in Fig.2(a).We also designed for text input by decoding the speaker's voice, use a A4-sized paper keyboard with the Microsoft Hololens but they can be vulnerable to environmental sounds and not [13]keyboard layout and keep its location unchanged.Unless suitable to work in public space needing to keep quiet. otherwise specified,the frame rate of camera is set to 30fps. Camera-based keyboards:By using a built-in camera, The sampling rate of gyroscope is set to 200Hz. TiPoint [18]detects keystrokes for interactions with smart Observation 1.Unconscious head movements can lead to glasses,it requires a finger to move and click on the mini- large coordinate deviations of the keyboard.As shown in trackball to input a character.However,its input speed and user Fig.2(a),the head-mounted camera moves along with the experience need further improvement for real applications. head.The head movements will lead to the dynamic changes Chang et al.[11]design a text input system for HMDs of camera view.When the location of keyboard keeps un- by cutting a keyboard into two parts.Its performance is changed,the camera view changes will lead to the changes

2327-4662 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2021.3114224, IEEE Internet of Things Journal IEEE INTERNET OF THINGS JOURNAL, VOL. XX, NO. XX, XX 2021 3 camera Y-axis X-axis Z-axis paper keyboard gyroscope (a) Experimental setup. (Δdx, Δdy) two image plane keyboard in real world Δθ x x′ y O O′ K K1 K2 (b) The unconscious head movements can lead to the large coordinate deviations of keys. y′ 1 2 3 4 5 6 7 8 9 0 Q W E R T Y U I O P A S D F G H J K L Z X C V B N M , . current frame original frame (b) Unconscious head movements can lead to the large coordinate deviations of keys. 0 20 40 60 80 100 The number of detection 30 40 50 60 70 80 Time cost (ms) detection time cost mean time cost duration for a frame (c) Extracting all keys from each image leads to unacceptable time cost for real-time systems. current frame original frame 0 5000 10000 15000 Timestamp (ms) 0 0.05 0.1 0.15 0.2 0.25 0.3 Angular velocity (rad/s) (min) 1 2 3 (d) Head movements occur occasionally and last for several frames instead of all frames. (e) A frame from camera view can hardly detect the depth information of fingertips. Fig. 2. Observations about coordinate changes of keys, captured frames for keystrokes, and time cost in image processing. is designed to recognize a set of finger-grained gestures to input text for off-the-shelf WiFi devices. Similarly, when a user types on a keyboard, WiKey [8] recognizes the typed keys based on how the CSI value changes at the WiFi signal receiver. However, the WiFi-based approach can be easily affected by environments, such as changes of transceiver’s orientation or location, and unexpected human motions in surrounding areas. They are often expected to work in controlled environments, rather than real-world scenarios. Acoustic-based keyboards: By utilizing the built-in microphones of mobile and wearable devices, acoustic-based keyboards have been recently proposed. UbiTap [16] presents an input method by turning the solid surface into a touch input space, based on the sound collected by the microphones. To infer the keystroke’s position, it requires three phones to estimate the arrival time of acoustic signals. KeyListener [20] infers keystrokes on the QWERTY keyboard of touch screen by leveraging the microphones of a smartphone, while it is designed for indirect eavesdropping attacks, the accuracy of keystroke inference is usually not sufficient for text input. UbiK [28] leverages the microphone of a mobile device to locate the keystrokes, while it requires the user to click a key with the fingertip and nail margin, which may be not typical. Some Auto Speech Recognition (ASR) tools [31] are also designed for text input by decoding the speaker’s voice, but they can be vulnerable to environmental sounds and not suitable to work in public space needing to keep quiet. Camera-based keyboards: By using a built-in camera, TiPoint [18] detects keystrokes for interactions with smart glasses, it requires a finger to move and click on the minitrackball to input a character. However, its input speed and user experience need further improvement for real applications. Chang et al. [11] design a text input system for HMDs by cutting a keyboard into two parts. Its performance is almost comparable to that of single-hand text input on tablet computers. K. Sun et al. [26] propose a depth aware tapping scheme for VR/AR devices by combining a microphone with a COTS mono-camera. It enables tracking of user’s fingers based on ultrasound and image frames. Yin et al. [29] leverage a built-in camera in mobile device to recognize keystrokes by comparing the fingertip’s location with a key’s location in image frames. However, these methods assume that the text input space has a fixed location in the camera view, i.e., the coordinates of the keyboard or keys keep unchanged. Our work is motivated by the recent advance of camerabased text input methods. We move an important step towards dynamic scenarios where the camera moves naturally with user’s head. In our work, the keyboard coordinate in the camera’s view changes dynamically, creating more challenges in achieving high accuracy in keystroke localization and low latency for resource limited head-mounted devices. III. OBSERVATIONS We first conduct our preliminary experiments to study how the changes of keyboard coordinate affect key tracking and keystroke localization in a dynamic scenario. In our experiments, we use a Samsung Galaxy S9 smartphone as a head-mounted camera device, as shown in Fig. 2(a). We use a A4-sized paper keyboard with the Microsoft Hololens [13] keyboard layout and keep its location unchanged. Unless otherwise specified, the frame rate of camera is set to 30fps. The sampling rate of gyroscope is set to 200Hz. Observation 1. Unconscious head movements can lead to large coordinate deviations of the keyboard. As shown in Fig. 2(a), the head-mounted camera moves along with the head. The head movements will lead to the dynamic changes of camera view. When the location of keyboard keeps unchanged, the camera view changes will lead to the changes Authorized licensed use limited to: Nanjing University. Downloaded on December 03,2021 at 08:56:41 UTC from IEEE Xplore. Restrictions apply

This article has been accepted for publication in a future issue of this journal,but has not been fully edited.Content may change prior to final publication.Citation information:DOI 10.1109/JIOT.2021.3114224.IEEE Internet of Things Journal EEE INTERNET OF THINGS JOURNAL,VOL.XX,NO.XX,XX 2021 of keyboard coordinate in the image frames.As shown in distance between the fingertip and the keyboard plane.As Fig.2(b),the paper keyboard is represented as K,and the shown in Fig.2(e),all fingers hover above the keyboard.Some captured keyboard from the camera view is K1.We take fingertips appear above the keys from camera view,hence it is the case of rotating around the y-axis (marked in Fig2(a)) easy to recognize non-keystrokes as keystrokes by mistake.To as an example of head (i.e.,camera)movements.When the address the confusion,we may dynamically track the moving camera slightly rotates A=5 around y-axis anticlockwise, patterns of a fingertip and detect a keystroke from several the image frame changes from x-y plane to z'-y'plane,and frames instead of a single frame.We also need an efficient way the captured keyboard in the image frame changes to K2. to distinguish the fingertip pressing a key from other fingertips. Correspondingly,the location offset of the keyboard achieves (Ad,Ad)=(78,27)pixels,which can lead to the mismatch IV.SYSTEM DESIGN between coordinates and keys.As shown in the right part of We now present the design of DynaKey,which provides Fig.2(b),due to the camera movement,the captured keyboard a text-input scheme for a head-mounted camera device in in the current image is shown in blue,while that in the original dynamic scenes,as shown in Fig.1.DynaKey works in image is shown in black.In the current frame,i.e.,blue realistic scenarios where a user types on a virtual keyboard keyboard,the user types letter 'y'.When using the coordinates with natural head movement.The keyboard layout can be of keys in the original frame,it may mismatch letter 'h'with printed on a piece of paper or drawn on a desk surface.Unless the keystroke. otherwise specified,we use an Android smartphone as the Observation 2.Extracting all keys from each image suffers head-mounted camera device,where the embedded camera is from unavoidable occlusion of hands and has an unacceptable used to capture user's typing behaviors,then track and locate cost of processing.To track the coordinate changes of keys, keystrokes.The embedded gyroscope is used to detect head an intuitive solution is to extract keys from each image movements.In regard to the keyboard layout,it is printed on frame.However,considering the hand occlusion which is a piece of paper,as shown in Fig.2(a). unavoidable,as shown in Fig.2(b),it is difficult to extract each key from the image frame accurately.Besides,considering the A.System Overview limited resources of a head-mounted device and the real-time requirement of text input,the processing cost of extracting Fig.3 shows the framework of DynaKey.The inputs are keys from each image frame is expensive.Specifically,we image frames captured by camera and the angular velocity use At to represent the processing cost of key extraction collected by gyroscope,while the output is the character of the from an image frame,i.e.,processing an input image and pressed key.Initially,the user keeps the head unchanged and extracting all keys from the image.In Fig.2(c).we show moves the hand out of the camera view for about 3 seconds, the cost of key extraction in 100 different frames.The result while using Key Tracking to detect the keyboard and extract shows that the processing cost At ranges from 40ms to 60ms, each key from the initial image frame.When the screen shows while the average cost is 49ms,which is larger than the inter- "Please TYPE",the user begins typing.During typing process, frame duration (i.e..33ms).Therefore,extracting all keys from we use Key Tracking to select keypoints of images to trans- each image frame to track the coordinates of keys may be form the coordinates of keys among different frames.At the unacceptable for real applications.More time-efficient key same time,we use Adaptive Tracking to analyze the angular tracking methods are expected. velocity of gyroscope to detect head (i.e.,camera)movements, Observation 3.Head movements occur occasionally and and then determine whether to update the coordinates of keys last for several frames instead of all frames.According to or not.In addition,DynaKey uses Fingertip Detection to seg- Observation 2,extracting all keys in each image can hardly ment the hand region from the frame and detect the fingertips. work.In fact,we find that performing key extraction in After that,we use Keystroke Detection and Localization to each frame is unnecessary.Although the user's head moves detect the keystroke occurred and locate the keystroke.To during typing,the ratio of head movement duration to the ensure Dynakey work in real time,we adopt three threads whole typing duration is small.Fig.2(d)shows that the to implement the image capturing,image processing(i.e.,key head movements cause the peaks in gyroscope data during tracking,fingertip detection,keystroke detection and localiza- a typing process (i.e.,3 minutes),the total duration of the tion),and adaptive tracking in parallel. three head movements is less than 1 minute.It implies that during the typing process,the coordinates of keys in the B.Key Tracking image frames keep unchanged for more than 67%of the time. Before typing,we first need to extract keys from the image. Consequently,we only need to re-extract the coordinates of With possible head movements.i.e..camera view changes.we keys when detecting head movements,rather than performing then need to track the coordinates of keys in the following key extraction in each frame. frames,as mentioned in Observation 1 of Section III.Key Observation 4.A frame from camera view is insufficient to tracking in DynaKey consists of key extraction and coordinate detect the depth information of fingertips.To decide whether a transformation,as described below. keystroke is occurring or not,it is critical to determine whether 1)Key Extraction:We adopt a common QWERTY key- the fingertip is pressing on a key.However,different from the board layout,which is printed in black and white on a piece of front camera view,the camera view from top and behind can paper,as shown in Fig.5(a).Given an input image in Fig.5(a). hardly detect the depth of an object,i.e.,the perpendicular we use Canny edge detection algorithm [10],[29]to obtain

2327-4662 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2021.3114224, IEEE Internet of Things Journal IEEE INTERNET OF THINGS JOURNAL, VOL. XX, NO. XX, XX 2021 4 of keyboard coordinate in the image frames. As shown in Fig. 2(b), the paper keyboard is represented as K, and the captured keyboard from the camera view is K1. We take the case of rotating around the y-axis (marked in Fig2(a)) as an example of head (i.e., camera) movements. When the camera slightly rotates ∆θ = 5◦ around y-axis anticlockwise, the image frame changes from x-y plane to x 0 -y 0 plane, and the captured keyboard in the image frame changes to K2. Correspondingly, the location offset of the keyboard achieves (∆dx, ∆dy) = (78, 27) pixels, which can lead to the mismatch between coordinates and keys. As shown in the right part of Fig. 2(b), due to the camera movement, the captured keyboard in the current image is shown in blue, while that in the original image is shown in black. In the current frame, i.e., blue keyboard, the user types letter ‘y’. When using the coordinates of keys in the original frame, it may mismatch letter ‘h’ with the keystroke. Observation 2. Extracting all keys from each image suffers from unavoidable occlusion of hands and has an unacceptable cost of processing. To track the coordinate changes of keys, an intuitive solution is to extract keys from each image frame. However, considering the hand occlusion which is unavoidable, as shown in Fig. 2(b), it is difficult to extract each key from the image frame accurately. Besides, considering the limited resources of a head-mounted device and the real-time requirement of text input, the processing cost of extracting keys from each image frame is expensive. Specifically, we use ∆t to represent the processing cost of key extraction from an image frame, i.e., processing an input image and extracting all keys from the image. In Fig. 2(c), we show the cost of key extraction in 100 different frames. The result shows that the processing cost ∆t ranges from 40ms to 60ms, while the average cost is 49ms, which is larger than the interframe duration (i.e., 33ms). Therefore, extracting all keys from each image frame to track the coordinates of keys may be unacceptable for real applications. More time-efficient key tracking methods are expected. Observation 3. Head movements occur occasionally and last for several frames instead of all frames. According to Observation 2, extracting all keys in each image can hardly work. In fact, we find that performing key extraction in each frame is unnecessary. Although the user’s head moves during typing, the ratio of head movement duration to the whole typing duration is small. Fig. 2(d) shows that the head movements cause the peaks in gyroscope data during a typing process (i.e., 3 minutes), the total duration of the three head movements is less than 1 minute. It implies that during the typing process, the coordinates of keys in the image frames keep unchanged for more than 67% of the time. Consequently, we only need to re-extract the coordinates of keys when detecting head movements, rather than performing key extraction in each frame. Observation 4. A frame from camera view is insufficient to detect the depth information of fingertips. To decide whether a keystroke is occurring or not, it is critical to determine whether the fingertip is pressing on a key. However, different from the front camera view, the camera view from top and behind can hardly detect the depth of an object, i.e., the perpendicular distance between the fingertip and the keyboard plane. As shown in Fig. 2(e), all fingers hover above the keyboard. Some fingertips appear above the keys from camera view, hence it is easy to recognize non-keystrokes as keystrokes by mistake. To address the confusion, we may dynamically track the moving patterns of a fingertip and detect a keystroke from several frames instead of a single frame. We also need an efficient way to distinguish the fingertip pressing a key from other fingertips. IV. SYSTEM DESIGN We now present the design of DynaKey, which provides a text-input scheme for a head-mounted camera device in dynamic scenes, as shown in Fig. 1. DynaKey works in realistic scenarios where a user types on a virtual keyboard with natural head movement. The keyboard layout can be printed on a piece of paper or drawn on a desk surface. Unless otherwise specified, we use an Android smartphone as the head-mounted camera device, where the embedded camera is used to capture user’s typing behaviors, then track and locate keystrokes. The embedded gyroscope is used to detect head movements. In regard to the keyboard layout, it is printed on a piece of paper, as shown in Fig. 2(a). A. System Overview Fig. 3 shows the framework of DynaKey. The inputs are image frames captured by camera and the angular velocity collected by gyroscope, while the output is the character of the pressed key. Initially, the user keeps the head unchanged and moves the hand out of the camera view for about 3 seconds, while using Key Tracking to detect the keyboard and extract each key from the initial image frame. When the screen shows “Please TYPE”, the user begins typing. During typing process, we use Key Tracking to select keypoints of images to transform the coordinates of keys among different frames. At the same time, we use Adaptive Tracking to analyze the angular velocity of gyroscope to detect head (i.e., camera) movements, and then determine whether to update the coordinates of keys or not. In addition, DynaKey uses Fingertip Detection to segment the hand region from the frame and detect the fingertips. After that, we use Keystroke Detection and Localization to detect the keystroke occurred and locate the keystroke. To ensure DynaKey work in real time, we adopt three threads to implement the image capturing, image processing (i.e., key tracking, fingertip detection, keystroke detection and localization), and adaptive tracking in parallel. B. Key Tracking Before typing, we first need to extract keys from the image. With possible head movements, i.e., camera view changes, we then need to track the coordinates of keys in the following frames, as mentioned in Observation 1 of Section III. Key tracking in DynaKey consists of key extraction and coordinate transformation, as described below. 1) Key Extraction: We adopt a common QWERTY keyboard layout, which is printed in black and white on a piece of paper, as shown in Fig. 5(a). Given an input image in Fig. 5(a), we use Canny edge detection algorithm [10], [29] to obtain Authorized licensed use limited to: Nanjing University. Downloaded on December 03,2021 at 08:56:41 UTC from IEEE Xplore. Restrictions apply

This article has been accepted for publication in a future issue of this journal,but has not been fully edited.Content may change prior to final publication.Citation information:DOI 10.1109/JIOT.2021.3114224.IEEE Internet of Things Journal EEE INTERNET OF THINGS JOURNAL,VOL.XX,NO.XX,XX 2021 Fingertip Detection Keystroke Detection Fingertip and Localization discovery Moving or pressing Key Tracking Kevs Perspective extraction j th Adaptive Tracking th enor dat R+T Fig.3.Architecture of DynaKey. Fig.4.Principle of perspective transformation all edges,and then find all possible contours from detected space,as described in Eq.(1).We then introduce a division edges,as shown in Fig.5(b)and Fig.5(c)respectively.The operation to obtain its corresponding projection point(U,V) largest contour(i.e.,the green contour shown in Fig.5(c))with in the kth frame,as described in Eq.(2). four corners corresponds to the keyboard,where the corners U C0o CoL Co2 are detected based on the angles formed by the consecutive C10C11 C12 (1) contour segments,as the red points shown in Fig.5(d).When w C20 C21 C22 1 the keyboard location is fixed,i.e.,four corner points are Co0:X:+Co1·Y+Co2 fixed,as shown in Fig.5(e),we can detect the keys from U:=W: C20·Xi+C21·Y+C22 the keyboard.Specifically,with small contours (i.e..the red = C10·X:+C11·Y:+C12 (2) contours shown in Fig.5(c))located in the keyboard.we utilize C20·Xi+C21·Y:+C22 the area of a key to eliminate pitfall contours and then extract Here,the projection points of the keyboard or keys in the each key from the keyboard,as shown in Fig.5(f).Finally,we map the extracted keys with characters based on the relative previous frame can be obtained through key extraction,as mentioned in Section IV-B1.Thus the main challenge lies locations among keys,i.e.,the known keyboard layout in the calculation of transformation matrix C.which will be 2)Coordinate Transformation:Due to head movements,it described below. is essential to track the coordinates of keys among different Keypoint Selection:In the transformation matrix C,C22 is frames.Besides,the camera view changes also bring in a scale factor and usually set to C22 =1,thus we only need the distortion of keyboard in images,as the two captured to calculate the other eight variables,which can be solved by quadrilaterals PoPiP3 P2 and QoQ1Q3Q2 shown in Fig.4. selecting four non-linear feature point pairs (e.g.,P(Xi,Yi) To tolerate the camera movement and image distortion,we and Q:(U V)(i [0,3])shown in Fig.4).The specific propose a Perspective Transformation-based method to track formula for calculating C with four feature point pairs is the coordinates of keys. shown in Eq.(3). Perspective Transformation:As shown in Fig.4,for Xo Y61 0 0 0 -Xo*U -Y%*U a fixed point Gi in the physical space,when we obtain Y 1 0 0 0 -X1*U -YU its projection point (Xi,Yi)in the jth frame,perspective X2 1 0 0 0 -X2*U码 -*U transformation [21]can use a transformation matrix C 1 0 0 -X3*U -Y3+U Yo =2 0 00 -Xo+V -Yo+Vo C11 (3) (Coo,Co1,Co2;C10,C11,C12;C20,C21,C22)to calculate its 0 0 0 -X1*V -y*好 projection (U,V)in the kth frame.Therefore,when the 0 00X2 -X2* -Y* paper keyboard is fixed,we can use the known keyboard/key 0 00X3 Ya -X3* -Y* C21 locations in the previous frames to infer the keyboard/key To get the feature point pairs,FLANN based matcher [22] locations in the following frames,without keyboard detection was often adopted,which finds an approximate(may be not and key extraction.Specifically,with the known projection point (Xi,Yi)in the jth frame,we first use C to calculate the 3D coordinate (Ui,Vi,Wi)related to (Xi,Yi)in the physical 甲 pth frame gth frame (a)Keypoint selection by FLANN based matcher (a)An input frame (b)Edge detection result (c)All detected contours 100 60 26 用用用墨 FLANN Our method Different keypoint selection methods (d)Comer point detecton (f)Key extraction result (b)The time cost of two keypoint selection methods Fig.5.Process of extracting keys. Fig.6.Feature points selection and time cost of FLANN based matcher

2327-4662 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2021.3114224, IEEE Internet of Things Journal IEEE INTERNET OF THINGS JOURNAL, VOL. XX, NO. XX, XX 2021 5 Fingertip Detection Image frames Key Tracking Perspective transformation Keypoint detection Keys extraction Fingertip discovery Hand region segmentation Adaptive Tracking Sharp increase analysis Rotation angle analysis Keystroke Detection and Localization Moving or pressing Coordinate variation of fingertips Match between keys and keystroke Camera Gyroscope 0 500 1000 1500 2000 2500 3000 3500 4000 ms -0.1 -0.05 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 rad/s X Y Z Sensor data Fig. 3. Architecture of DynaKey. Gi (Xi , Yi ) (U′ i , V′ i ) j th k th R + T K1 K2 101∘ 77∘ 78∘ 104∘ 84∘ 97∘ 100∘ 79∘ R + T K1 K2 116∘ 101∘ 64∘ 116∘ 91∘ 63∘ 80∘ 89∘ P0 P1 P2 P3 Q3 Q2 Q1 Q0 Fig. 4. Principle of perspective transformation. all edges, and then find all possible contours from detected edges, as shown in Fig. 5(b) and Fig. 5(c) respectively. The largest contour (i.e., the green contour shown in Fig. 5(c)) with four corners corresponds to the keyboard, where the corners are detected based on the angles formed by the consecutive contour segments, as the red points shown in Fig. 5(d). When the keyboard location is fixed, i.e., four corner points are fixed, as shown in Fig. 5(e), we can detect the keys from the keyboard. Specifically, with small contours (i.e., the red contours shown in Fig. 5(c)) located in the keyboard, we utilize the area of a key to eliminate pitfall contours and then extract each key from the keyboard, as shown in Fig. 5(f). Finally, we map the extracted keys with characters based on the relative locations among keys, i.e., the known keyboard layout. 2) Coordinate Transformation: Due to head movements, it is essential to track the coordinates of keys among different frames. Besides, the camera view changes also bring in the distortion of keyboard in images, as the two captured quadrilaterals P0P1P3P2 and Q0Q1Q3Q2 shown in Fig. 4. To tolerate the camera movement and image distortion, we propose a Perspective Transformation-based method to track the coordinates of keys. Perspective Transformation: As shown in Fig. 4, for a fixed point Gi in the physical space, when we obtain its projection point (Xi , Yi) in the jth frame, perspective transformation [21] can use a transformation matrix C = (C00, C01, C02; C10, C11, C12; C20, C21, C22) to calculate its projection (U 0 i , V 0 i ) in the kth frame. Therefore, when the paper keyboard is fixed, we can use the known keyboard/key locations in the previous frames to infer the keyboard/key locations in the following frames, without keyboard detection and key extraction. Specifically, with the known projection point (Xi , Yi) in the jth frame, we first use C to calculate the 3D coordinate (Ui , Vi , Wi) related to (Xi , Yi) in the physical (a) An input frame (b) Edge detection result (c) All detected contours (d) Corner point detection (e) Keyboard with corner points (f) Key extraction result 0 100 200 300 400 500 600 700 800 900 1000 Point Sequence 80 90 100 110 120 130 140 150 160 170 180 Angle(°) 0 200 400 600 800 1000 80 100 120 140 160 180 Angle ( ) ∘ Point Sequence Fig. 5. Process of extracting keys. space, as described in Eq. (1). We then introduce a division operation to obtain its corresponding projection point (U 0 i , V 0 i ) in the kth frame, as described in Eq. (2).   Ui Vi Wi   =   C00 C01 C02 C10 C11 C12 C20 C21 C22   ·   Xi Yi 1   (1) U 0 i = Ui Wi = C00 · Xi + C01 · Yi + C02 C20 · Xi + C21 · Yi + C22 V 0 i = Vi Wi = C10 · Xi + C11 · Yi + C12 C20 · Xi + C21 · Yi + C22 (2) Here, the projection points of the keyboard or keys in the previous frame can be obtained through key extraction, as mentioned in Section IV-B1. Thus the main challenge lies in the calculation of transformation matrix C, which will be described below. Keypoint Selection: In the transformation matrix C, C22 is a scale factor and usually set to C22 = 1, thus we only need to calculate the other eight variables, which can be solved by selecting four non-linear feature point pairs (e.g., Pi(Xi , Yi) and Qi(U 0 i , V 0 i )(i ∈ [0, 3]) shown in Fig. 4). The specific formula for calculating C with four feature point pairs is shown in Eq. (3).             X0 Y0 1 0 0 0 −X0 ∗ U 0 0 −Y0 ∗ U 0 0 X1 Y1 1 0 0 0 −X1 ∗ U 0 1 −Y1 ∗ U 0 1 X2 Y2 1 0 0 0 −X2 ∗ U 0 2 −Y2 ∗ U 0 2 X3 Y3 1 0 0 0 −X3 ∗ U 0 3 −Y3 ∗ U 0 3 0 0 0 X0 Y0 1 −X0 ∗ V 0 0 −Y0 ∗ V 0 0 0 0 0 X1 Y1 1 −X1 ∗ V 0 1 −Y1 ∗ V 0 1 0 0 0 X2 Y2 1 −X2 ∗ V 0 2 −Y2 ∗ V 0 2 0 0 0 X3 Y3 1 −X3 ∗ V 0 3 −Y3 ∗ V 0 3             ·             C00 C01 C02 C10 C11 C12 C20 C21             = C22 ·             U 0 0 U 0 1 U 0 2 U 0 3 V 0 0 V 0 1 V 0 2 V 0 3             (3) To get the feature point pairs, FLANN based matcher [22] was often adopted, which finds an approximate (may be not (a) Keypoint selection by FLANN based matcher (b) The time cost of two keypoint selection methods Fig. 6. Feature points selection and time cost of FLANN based matcher. Authorized licensed use limited to: Nanjing University. Downloaded on December 03,2021 at 08:56:41 UTC from IEEE Xplore. Restrictions apply

This article has been accepted for publication in a future issue of this journal,but has not been fully edited.Content may change prior to final publication.Citation information:DOI 10.1109/JIOT.2021.3114224.IEEE Internet of Things Journal EEE INTERNET OF THINGS JOURNAL,VOL.XX,NO.XX,XX 2021 oressina'l (a)Pressing 'I' (b)Leaving I' (a)An input frame (b)Hand segmentatio (e)Line detection (c)Pressing 'Y (d)LeavingY (d)Optimized line detection (e)Comner point detectio (f)Keypoint selection Fig.7.Hands move on the keyboard and lead to the different occlusion in Fig.8.Process of selecting keypoints. the process of pressing 'I'and Y'. a detected cross point can not be directly inferred.To solve this the best)nearest neighbor point in image p for the point problem,we introduce the corner point of keyboard to infer the in image g,and then pairs the two points.Considering the location of a detected cross point,based on the relative position possible wrongly-selected feature point pair,such as P and between corner point and other cross points.Specifically,we P;in Fig.6(a).the FLANN based method often needs to observe that there usually exist one or more corner points in detect a large number of feature point pairs,and then selects captured images during typing,as shown in Fig.7.Particularly, the top-k(k is usually larger than 4)feature point pairs the top left or the top right corner often exists.Therefore,we to calculate the transformation matrix with the least square utilize the top,leftmost,rightmost and bottom detected line method.However,selecting a larger number of feature points by priority to detect possible corner points,until one corner will lead to non-negligible time latency (e.g.,60ms),which is point is detected. larger than inter-frame duration (i.e.,33ms)and unacceptable Take the top line as an example,we trace the points of the in real-time systems,as shown in Fig.6(b).Therefore,it is top line from the leftmost point to right,to detect the top left necessary to quickly and accurately select appropriate number corner point.As shown in Fig.8(e),for a point P(,)in of feature point pairs for transformation matrix calculation. the top line,we use a square area S=[(xi,y)i-< To achieve the above goal,we introduce keypoint selection 6x,yi-y<y}to verify whether P is a corner point. to calculate C with only four keypoint pairs,where keypoints When the ratio of the number of black pixels(i.e.,the possible mean cross points of lines in the keyboard.As shown in Fig.7, contour of a corner)to the number of all pixels in S is larger due to the size differences of the keyboard and hands.whatever than Ape,P'can be a candidate corner point.After that,we the location of the occlusion is,the cross points in the keyboard fit a line for the black pixels satisfying i-<Ox and the will not be occluded completely at the same time.In addition. black pixels satisfying lyi-y<y,respectively,as shown during a typing process,we observe that the hand movements in Fig.8(e).If the angle y between the two fitted lines satisfies between two consecutive frames are not violent,i.e,there often ly-90<Ae,P will be selected as the top left corner,(i.e., exist several common cross points for the two frames,as the P).Based on extensive experiments,we set Ape =0.25. green points shown in Fig.7(a)and Fig.7(b).Therefore,we △e=6°，6x=6y=5 by default..It is worth noting that can detect the common cross points appearing on both of the if all borders (i.e.,all corner points)of the keyboard are two consecutive frames (i.e.,cross point pairs),and select four not detected.we will skip this frame.This is because there non-colinear keypoint pairs for perspective transformation,as usually have no valid keystrokes,when all borders are blocked. shown in Alg.1. Otherwise,if any border of the keyboard is detected,we will 1)Line Detection:With an input image as shown in then detect the corner points for key tracking. Fig.8(a)which equals to Fig.7(b),we first utilize skin 3)Common Cross Point Detection:For other detected lines, segmentation [29]to segment the hand region from the image, we extend the length of each line to detect the cross points, shown as the white region in Fig.8(b).Then,we get the edges as the green points shown in Fig.8(e).To extract the common in Fig.8(b)using Canny edge detector [10],as shown in Fig. cross point set detected in two frames,we first utilize the 8(c).After that,to detect the lines of keyboard and reduce detected top-left corner point P to infer the location of a cross the interference of other edges,we use Hough transformation point.Specifically,we represent the location of a cross point [15]to detect the long lines in image,shown as the red lines Pi with a distance d;and an angle 0;.As shown in Fig.8(f). in Fig.8(c).However,there are too many lines,which may the distance d;is measured as the Euclidean distance between confuse the cross point detection.Therefore,we merge the Pi and P,and the 0;is computed as the angle between PP detected lines.For convenience,we represent each line in polar and PP.Here,the point P is a randomly selected point on the coordinates with a vector (p,)For the lines close to each right of P in top line.By comparing di and i of each point other,which satisfy△p<50 pixels and△9<5.7°，we only in two frames,we pair two keypoints with similar distance select one of them.The optimized line detection result for Fig. and angle,i.e.,the distance difference in two frames satisfies 8(c)is shown in Fig.8(d). od 20 pixels while the angle difference satisfies 60 <4. 2)Corner Point Detection:As shown in Fig.8(d),not all In Fig.8(f),the yellow and the green keypoints are selected lines of the keyboard(i.e.,not all cross points)can be detected, as common cross points. due to the occlusion of hands.Correspondingly,the location of 4)Keypoint Pair Determination:Finally,we select

2327-4662 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2021.3114224, IEEE Internet of Things Journal IEEE INTERNET OF THINGS JOURNAL, VOL. XX, NO. XX, XX 2021 6 (a) Pressing ‘I’ (b) Leaving ‘I’ (c) Pressing ‘Y’ (d) Leaving ‘Y’ SUHVVLQJņ<Ň OHDYHņ<Ň SUHVVLQJņ,Ň OHDYHņ,Ň SUHVVLQJņ<Ň OHDYHņ<Ň FRUQHUSRLQW FRUQHUSRLQW FRUQHUSRLQW FRUQHUSRLQW FRUQHUSRLQW FRUQHUSRLQW FRUQHUSRLQW FRUQHUSRLQW FRUQHUSRLQW FRUQHUSRLQW FRUQHUSRLQW Fig. 7. Hands move on the keyboard and lead to the different occlusion in the process of pressing ‘I’ and ‘Y’. the best) nearest neighbor point in image p for the point in image q, and then pairs the two points. Considering the possible wrongly-selected feature point pair, such as Pi and Pj in Fig. 6(a), the FLANN based method often needs to detect a large number of feature point pairs, and then selects the top-k (k is usually larger than 4) feature point pairs to calculate the transformation matrix with the least square method. However, selecting a larger number of feature points will lead to non-negligible time latency (e.g., 60ms), which is larger than inter-frame duration (i.e., 33ms) and unacceptable in real-time systems, as shown in Fig. 6(b). Therefore, it is necessary to quickly and accurately select appropriate number of feature point pairs for transformation matrix calculation. To achieve the above goal, we introduce keypoint selection to calculate C with only four keypoint pairs, where keypoints mean cross points of lines in the keyboard. As shown in Fig. 7, due to the size differences of the keyboard and hands, whatever the location of the occlusion is, the cross points in the keyboard will not be occluded completely at the same time. In addition, during a typing process, we observe that the hand movements between two consecutive frames are not violent, i.e, there often exist several common cross points for the two frames, as the green points shown in Fig. 7(a) and Fig. 7(b). Therefore, we can detect the common cross points appearing on both of the two consecutive frames (i.e., cross point pairs), and select four non-colinear keypoint pairs for perspective transformation, as shown in Alg. 1. 1) Line Detection: With an input image as shown in Fig. 8(a) which equals to Fig. 7(b), we first utilize skin segmentation [29] to segment the hand region from the image, shown as the white region in Fig. 8(b). Then, we get the edges in Fig. 8(b) using Canny edge detector [10], as shown in Fig. 8(c). After that, to detect the lines of keyboard and reduce the interference of other edges, we use Hough transformation [15] to detect the long lines in image, shown as the red lines in Fig. 8(c). However, there are too many lines, which may confuse the cross point detection. Therefore, we merge the detected lines. For convenience, we represent each line in polar coordinates with a vector (ρ, θ). For the lines close to each other, which satisfy ∆ρ < 50 pixels and ∆θ < 5.7 ◦ , we only select one of them. The optimized line detection result for Fig. 8(c) is shown in Fig. 8(d). 2) Corner Point Detection: As shown in Fig. 8(d), not all lines of the keyboard (i.e., not all cross points) can be detected, due to the occlusion of hands. Correspondingly, the location of (a) An input frame (b) Hand segmentation (c) Line detection (d) Optimized line detection (e) Corner point detection (f) Keypoint selection Pl Pr Pl Pi d θ P′ l P ′ r P S′ l γ Pl (P′ l ) Fig. 8. Process of selecting keypoints. a detected cross point can not be directly inferred. To solve this problem, we introduce the corner point of keyboard to infer the location of a detected cross point, based on the relative position between corner point and other cross points. Specifically, we observe that there usually exist one or more corner points in captured images during typing, as shown in Fig. 7. Particularly, the top left or the top right corner often exists. Therefore, we utilize the top, leftmost, rightmost and bottom detected line by priority to detect possible corner points, until one corner point is detected. Take the top line as an example, we trace the points of the top line from the leftmost point to right, to detect the top left corner point. As shown in Fig. 8(e), for a point P 0 l (x 0 l , y0 l ) in the top line, we use a square area S 0 l = {(xi , yi)||xi − x 0 l | ≤ δx, |yi − y 0 l | ≤ δy} to verify whether P 0 l is a corner point. When the ratio of the number of black pixels (i.e., the possible contour of a corner) to the number of all pixels in S 0 l is larger than ∆ρc, P 0 l can be a candidate corner point. After that, we fit a line for the black pixels satisfying |xi − x 0 l | < δx and the black pixels satisfying |yi − y 0 l | < δy, respectively, as shown in Fig. 8(e). If the angle γ between the two fitted lines satisfies |γ − 90| < ∆, P 0 l will be selected as the top left corner, (i.e., Pl). Based on extensive experiments, we set ∆ρc = 0.25, ∆ = 6◦ , δx = δy = 5 by default. It is worth noting that if all borders (i.e., all corner points) of the keyboard are not detected, we will skip this frame. This is because there usually have no valid keystrokes, when all borders are blocked. Otherwise, if any border of the keyboard is detected, we will then detect the corner points for key tracking. 3) Common Cross Point Detection: For other detected lines, we extend the length of each line to detect the cross points, as the green points shown in Fig. 8(e). To extract the common cross point set detected in two frames, we first utilize the detected top-left corner point Pl to infer the location of a cross point. Specifically, we represent the location of a cross point Pi with a distance di and an angle θi . As shown in Fig. 8(f), the distance di is measured as the Euclidean distance between Pi and Pl , and the θi is computed as the angle between −−→PlPi and −−→PlP. Here, the point P is a randomly selected point on the right of Pl in top line. By comparing di and θi of each point in two frames, we pair two keypoints with similar distance and angle, i.e., the distance difference in two frames satisfies δd < 20 pixels while the angle difference satisfies δθ < 4 ◦ . In Fig. 8(f), the yellow and the green keypoints are selected as common cross points. 4) Keypoint Pair Determination: Finally, we select Authorized licensed use limited to: Nanjing University. Downloaded on December 03,2021 at 08:56:41 UTC from IEEE Xplore. Restrictions apply.

This article has been accepted for publication in a future issue of this journal,but has not been fully edited.Content may change prior to final publication.Citation information:DOI 10.1109/JIOT.2021.3114224.IEEE Internet of Things Journal EEE INTERNET OF THINGS JOURNAL,VOL.XX,NO.XX,XX 2021 Algorithm 1:Keypoint selection Input:An image frame. Using skin segmentation to extract hand regions. Using Canny edge detector to get the top,leftmost, (a)An input frame (b)Hand segmentation (c)Hand contour rightmost,and bottom lines ,k E [1,4]. Corner point set P=0. while k△Pc then introducing a gyroscope,and only activate the previous Key Fitting a line ly for black pixels satisfying Tracking module when the head movement is detected. z-xl≤6x In Fig.9(a),we show the gyroscope data and coordinate Fitting a line lz for black pixels satisfying changes of keys during a typing process.When the head l-≤6y. movement occurs,there is a sharp increase of angular velocity. The angle between l and l is y. Considering that the size of a key in an image is only about ify-90|eg,we activate Key Tracking module,where we set eg=2.9/s by default. four non-colinear cross point pairs as keypoint pairs, In fact,in addition to the sharp increase of angular velocity {(Xi Yi),(U,V)liE[0,3]),which will be used for calcu- caused by non-negligible head movements,the long-time lating the transformation matrix.As shown in Fig.8,by only micro camera movements will also lead to the coordinate detecting several intersection points instead of a large number changes of keys.As shown in Fig.9(b),there is no sharp of feature points,we can reduce the time of processing one increase of gyroscope data,while the accumulated rotation image for key tracking from 60ms to 26ms,as shown in Fig. angle from the last time of key tracking can lead to the 6(b),which is smaller than the inter-frame interval and satisfies non-negligible coordinate changes.In this case,we introduce the real-time requirement △f,=,。w(ddt,where to means the last time of key tracking and ti means the current time.If Ar er,we C.Adaptive Tracking activate the Key Tracking module,where we set er =3.1 Based on Observation 3,head movements occur occasion- by default.By introducing adaptive tracking,we can remove ally,thus it is unnecessary to track keys from each image. unnecessary image processing for key tracking. To reduce the unnecessary computation overhead in image processing,we present an adaptive key tracking scheme by D.Fingertip Detection After we obtain the coordinate of each key,we need to detect the fingertips for further keystroke detection and localization.Given an input image,as shown in Fig.10(a), we first utilize skin segmentation [29]to extract the hand region from the image,as shown in Fig.10(b).We then use the hand contours shown in Fig.10(c)and the shape feature (a)Short-time sharp camera movements vs.coordinate changes of Keys [29]of a fingertip to detect the possible fingertips,as shown in Fig.10(d).After that,we move along the hand's contour in Fig.10(c)to remove the pitfall points corresponding to fingerwebs.Specifically,we use Fi to represent the possible fingertip point,while using Fi-k and Fi+k to represent the points visited before and after Fi.As shown in Fig.10(d). if FF_kxFFtk>0,Fi can be treated as a fingertip. (b)Long-time micro camera movements vs.coordinate changes of keys Otherwise,it is a pitfall point in the fingerweb and will be eliminated,as shown in Fig.10(e).Besides,we introduce Fig.9.Sharp and micro camera movements vs.coordinate changes of keys. the distance between the possible fingertip and the center of

This article has been accepted for publication in a future issue of this journal,but has not been fully edited.Content may change prior to final publication.Citation information:DOI 10.1109/JIOT.2021.3114224.IEEE Internet of Things Journal EEE INTERNET OF THINGS JOURNAL,VOL.XX,NO.XX,XX 2021 (a)Fm 1:moving to 'U'.(b)Fm 4:moving to 'U'.(c)Fm 7:pressing 'U'.(d)Fm 10:pressing 'U'.(e)Fm 13:leaving 'U'.(f)Fm 16:leaving 'U' Fig.11.Sampled frames in the process of pressing 'U'with fingertip 7.The yellow and red points are the locations of the fingertip 7 in previous and current sampled frame,respectively.The green arrow indicates the trend of the fingertip's movement.The blue circles and points are the locations of other fingertips in previous and current sampled frame,respectively.('Fm'is short for frame.) the hand,to further remove the pitfall points with distances moving out and coming back to the same key is usually smaller than Ar.Finally,for each cluster of points related to larger than 185ms (i.e.,more than the duration of 5 frames). a fingertip,we choose the middle point to represent the final At this time,if we have processed a keystroke in the jth detected fingertip,as shown in Fig.10(f).Unless otherwise frame,we will not process the keystroke in the kth frame specified,we set k=50 and Ar =100 pixels by default. repeatedly.Otherwise,we detect a new possible keystroke in the kth frame.Differently,if the coordinates of fingertip E.Keystroke Detection and Localization T and T)from the jth to the kth frames change,we After obtaining coordinates of keys and detecting fingertips, need to further determine whether there is a keystroke in we will detect and locate keystrokes.Specifically,we first the kth frame.At this time,we introduce the (-1)th determine whether a typing operation occurs,i.e.,keystroke frame,detect the coordinate of the fingertip as T)and detection,and then determine which fingertip is pressing the transform T() to the coordinate system of the kth frame as key,i.e.,keystroke localization,as shown in Alg.2. T)Then,we calculate the coordinate changes of fingertip 1)Keystroke Detection:According to Observation 4,the depth information of fingertips is hardly obtained through a d=V(-1y)2+(1y))2 between the single image,thus we detect a keystroke from multiple con- (k-1)th and the kth frames.If d'>er,the fingertip keeps secutive frames.Specifically,a keystroke operation involves moving,there is no keystroke.Otherwise,we detect a possible several steps,first the fingertip moves towards the key,then keystroke in the kth frame,and keystroke localization will be stays on the key for a short duration,and finally moves away described in the following subsection. from it.An example is shown in Fig.11 (i.e..the seventh fingertip).Therefore,the coordinate changes of fingertips can be used to detect possible keystrokes.Additionally,to reduce the processing cost,we introduce a frame-skipping scheme for keystroke detection,instead of detecting the coordinates of fingertips from each image frame. To capture enough information for keystroke detection and localization,we first set the frame rate of camera to 30fps. which is the maximum/default frame rate of off-the-shelf mobile devices.According to [9],the duration of a keystroke Fram Fig.12.Coordinate changes in y-axis of ten fingertips usually lasts 185ms,which is about the duration of capturing 5 or 6 frames.Therefore,we first process every 5 image frames 2)Keystroke Localization:Keystroke localization is to de- and compare each fingertip's coordinate.For convenience,we tect which finger is typing.As shown in Fig.11,although use T()and T())to represent the ith all fingertips move together during a keystroke,the fingertip fingertip's coordinate in the jth and the kth frames,where T pressing a key often has the largest coordinate changes, k=j+5.Considering that camera movement may happen especially in y-aris.This is because Tk needs to move between the jth frame and the kth frame,T)is transformed towards the target key,stay on the key,and then move away, to the coordinate system of the kth frame asT(). while other fingertips often keep hovering or staying on the keyboard without large variation of coordinates.As shown in based on perspective transformation.If the coordinate change Fig.12,the 'fingertip 7'pressing a key has the largest variation 6d =V(ay2+()2 is less than ed.the of coordinates in y-aris.For the detected fingertip pressing fingertip is considered unchanged,otherwise it is moving.We a key,we further match the coordinate of the fingertip and the set ed 15 pixels by default. location of a key to locate the keystroke. After obtaining the coordinate changes of a fingertip from 3)Adaptive Calibration:However,considering the possible every 5 frames,we further need to determine whether a errors in keystroke detection and localization,we introduce fingertip is pressing a key.As mentioned before,the duration the adaptive calibration scheme for a better typing experi- of a keystroke usually lasts for 185ms.If the coordinates of a ence.Firstly,in the user interface,we keep the 'ADD'and fingertip T)and T()from the jth to the kth (=j+5)DELETE'operations.If the typing operation is not detected, frames keep unchanged,it implies that the fingertip keeps the user can use 'ADD'button in the top right corner of staying on the pressed key during the last five frames be- user interface to input the character by screen.If the typing cause the duration for pressing a key in the jth frame,then operation is wrongly detected/located,the 'DELETE'button in

2327-4662 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2021.3114224, IEEE Internet of Things Journal IEEE INTERNET OF THINGS JOURNAL, VOL. XX, NO. XX, XX 2021 8 1 2 3 4 5 6 8 9 10 7 o x y (a) Fm 1: moving to ‘U’. o x y 1 2 3 4 5 6 8 9 10 7 (b) Fm 4:moving to ‘U’. x y o 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 (a) Frame 1 (b) Frame 2 (c) Frame 3 (d) Frame 4 (e) Frame 5 (f) Frame 6 o (a) Frame 1 (b) Frame 2 (c) Frame 3 (d) Frame 4 (e) Frame 5 (f) Frame 6 x y 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 910 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 8 9 10 7 o x y 1 2 3 4 5 6 8 9 10 7 1 2 3 4 5 6 8 9 10 7 1 2 3 4 5 6 8 9 10 7 1 2 3 4 5 6 8 9 10 7 1 2 3 4 5 6 8 9 10 7 1 2 3 4 5 6 8 9 10 7 (c) Fm 7: pressing ‘U’. o x y 1 2 3 4 5 6 8 9 10 7 (d) Fm 10: pressing ‘U’. o x y 1 2 3 4 5 6 8 9 10 7 (e) Fm 13: leaving ‘U’. o x y 1 2 3 4 5 6 8 9 10 7 (f) Fm 16: leaving ‘U’. Fig. 11. Sampled frames in the process of pressing ‘U’ with fingertip 7. The yellow and red points are the locations of the fingertip 7 in previous and current sampled frame, respectively. The green arrow indicates the trend of the fingertip’s movement. The blue circles and points are the locations of other fingertips in previous and current sampled frame, respectively.(‘Fm’ is short for frame.) the hand, to further remove the pitfall points with distances smaller than ∆r. Finally, for each cluster of points related to a fingertip, we choose the middle point to represent the final detected fingertip, as shown in Fig. 10(f). Unless otherwise specified, we set k = 50 and ∆r = 100 pixels by default. E. Keystroke Detection and Localization After obtaining coordinates of keys and detecting fingertips, we will detect and locate keystrokes. Specifically, we first determine whether a typing operation occurs, i.e., keystroke detection, and then determine which fingertip is pressing the key, i.e., keystroke localization, as shown in Alg. 2. 1) Keystroke Detection: According to Observation 4, the depth information of fingertips is hardly obtained through a single image , thus we detect a keystroke from multiple consecutive frames. Specifically, a keystroke operation involves several steps, first the fingertip moves towards the key, then stays on the key for a short duration, and finally moves away from it. An example is shown in Fig. 11 (i.e., the seventh fingertip). Therefore, the coordinate changes of fingertips can be used to detect possible keystrokes. Additionally, to reduce the processing cost, we introduce a frame-skipping scheme for keystroke detection, instead of detecting the coordinates of fingertips from each image frame. To capture enough information for keystroke detection and localization, we first set the frame rate of camera to 30fps, which is the maximum/default frame rate of off-the-shelf mobile devices. According to [9], the duration of a keystroke usually lasts 185ms, which is about the duration of capturing 5 or 6 frames. Therefore, we first process every 5 image frames and compare each fingertip’s coordinate. For convenience, we use T (j) i (x (j) i , y (j) i ) and T (k) i (x (k) i , y (k) i ) to represent the ith fingertip’s coordinate in the jth and the kth frames, where k = j + 5. Considering that camera movement may happen between the jth frame and the kth frame, T (j) i is transformed to the coordinate system of the kth frame as T (j) 0 i (x (j) 0 i , y (j) 0 i ), based on perspective transformation. If the coordinate change δd = q (x (j) 0 i − x (k) i ) 2 + (y (j) 0 i − y (k) i ) 2 is less than d, the fingertip is considered unchanged, otherwise it is moving. We set d = 15 pixels by default. After obtaining the coordinate changes of a fingertip from every 5 frames, we further need to determine whether a fingertip is pressing a key. As mentioned before, the duration of a keystroke usually lasts for 185ms. If the coordinates of a fingertip T (j) 0 i and T (k) i from the jth to the kth (k = j + 5) frames keep unchanged, it implies that the fingertip keeps staying on the pressed key during the last five frames because the duration for pressing a key in the jth frame, then moving out and coming back to the same key is usually larger than 185ms (i.e., more than the duration of 5 frames). At this time, if we have processed a keystroke in the jth frame, we will not process the keystroke in the kth frame repeatedly. Otherwise, we detect a new possible keystroke in the kth frame. Differently, if the coordinates of fingertip T (j) 0 i and T (k) i from the jth to the kth frames change, we need to further determine whether there is a keystroke in the kth frame. At this time, we introduce the (k − 1)th frame, detect the coordinate of the fingertip as T (k−1) i , and transform T (k−1) i to the coordinate system of the kth frame as T (k−1)0 i . Then, we calculate the coordinate changes of fingertip δd0 = q (x (k−1)0 i − x (k) i ) 2 + (y (k−1)0 i − y (k) i ) 2 between the (k − 1)th and the kth frames. If δd0 > r, the fingertip keeps moving, there is no keystroke. Otherwise, we detect a possible keystroke in the kth frame, and keystroke localization will be described in the following subsection. 1 2 3 4 5 6 Frames 0 5 10 15 20 25 30 35 40 45 Coordinate changes in y-axis (pixels) tip 1 tip 2 tip 3 tip 4 tip 5 tip 6 tip 7 tip 8 tip 9 tip 10 Fig. 12. Coordinate changes in y-axis of ten fingertips. 2) Keystroke Localization: Keystroke localization is to detect which finger is typing. As shown in Fig. 11, although all fingertips move together during a keystroke, the fingertip Tk pressing a key often has the largest coordinate changes, especially in y − axis. This is because Tk needs to move towards the target key, stay on the key, and then move away, while other fingertips often keep hovering or staying on the keyboard without large variation of coordinates. As shown in Fig. 12, the ‘fingertip 7’ pressing a key has the largest variation of coordinates in y − axis. For the detected fingertip pressing a key, we further match the coordinate of the fingertip and the location of a key to locate the keystroke. 3) Adaptive Calibration: However, considering the possible errors in keystroke detection and localization, we introduce the adaptive calibration scheme for a better typing experience. Firstly, in the user interface, we keep the ‘ADD’ and ‘DELETE’ operations. If the typing operation is not detected, the user can use ‘ADD’ button in the top right corner of user interface to input the character by screen. If the typing operation is wrongly detected/located, the ‘DELETE’ button in Authorized licensed use limited to: Nanjing University. Downloaded on December 03,2021 at 08:56:41 UTC from IEEE Xplore. Restrictions apply

This article has been accepted for publication in a future issue of this journal,but has not been fully edited.Content may change prior to final publication.Citation information:DOI 10.1109/JIOT.2021.3114224.IEEE Internet of Things Journal EEE INTERNET OF THINGS JOURNAL,VOL.XX,NO.XX,XX 2021 the top-right corner of user interface can be used to remove the character.Secondly,considering the language rules in regular text,we introduce the Bayesian method [33]to correct the wrong keystroke sequence.That is to say,given the keystroke sequence,we calculate the likelihood of each possible word and finally select the word with largest likelihood.In this way, we can tolerate the errors like false-negative keystrokes,false- (a)Key tracking accuracy in 100 (b)Keystokecperfor- positive keystrokes and wrongly-detected keystrokes. frames. mance with and without key tracking. Fig.13.Performance of key tracking and keystroke localization. Algorithm 2:Keystroke detection and localization coordinate and the ground truth of the ith cross point,respec- Input:The consecutive frames. The ith fingertip in the jth frame is ()which tively,and 100;represents the intersection over union between is transformed to (in the th frame. the calculated ith key's area and the ground truth.z=55 represents the number of cross points,while n=40 represents k=j+5. the number of keys,as shown in Fig.5(a).We obtain the if v(agy)2+(uy))2<ed then ground truth by manually detecting the pre-marked coordinates if The jth frame has no keystroke then of cross points and keys from each image frame.To measure LDetecting a new keystroke. the performance of keystroke localization,we use several metrics-localization accuracy.localization error,false positive else if V(a-y2+1))2<ea rate (FPR)and false negative rate (FNR).The localization then accuracy is the ratio of correctly located keystrokes to the LDetecting a new keystroke. number of keystrokes performed by subject.The localization if A new stroke is detected then error is the ratio of falsely located keystrokes to the number of Selecting fingertip with largest coordinate variation. keystrokes performed by subject.FPR and FNR are defined as Matching fingertip with key by coordinates. the ratio of falsely detected keystrokes and missed keystrokes Output:The located keystroke to the number of keystrokes performed by subject,respectively. V.PERFORMANCE EVALUATION B.Accuracy of Key Tracking and Keystroke Localization We deploy DynaKey on a Samsung Galaxy S9 smartphone In the experiment,a subject is instructed to type on the which is used as a head-mounted camera device,as shown keyboard in her/his own way.She/he may move her/his head in Fig.2(a).The smartphone runs Android OS 9.0.We use a naturally during the typing process.We evaluate the accuracy Microsoft Hololens [13]keyboard layout and print it on a piece of key tracking by the aforementioned pixel deviation Er and of A4-sized paper.Unless otherwise specified,the frame rate the average intersection over union To0 in 100 frames.As of camera is set to 30fps,the sampling rate of gyroscope is set shown in Fig.13(a),the pixel deviation E,in an image ranges to 200Hz,the image size is set to 800x 480 pixels.We conduct from 0 to 5 pixels,and the average pixel deviation Er among our experiments in an office environment.We recruit twelve the frames is less than 3 pixels.When comparing with the volunteers to participate in the experiments and each subject key size,i.e.,45 x 25 pixels,the deviation less than 3 pixels types a set of pre-defined 1600 characters.Data sanitized can be neglected.Meanwhile,the average IoU achieves above is done to ensure no private and identity information.We 93%,indicating that the area of the calculated key coincides first evaluate the performance of key tracking and keystroke with the ground truth in a high degree.To conclude,DynaKey localization.Then we evaluate how camera jitters,frame sizes accurately tracks the coordinate changes of keys in different and frame rates affect the performance of key tracking and frames while tolerating head movements during the typing keystroke localization.We also evaluate the performance of process. DynaKey in complex scenarios to explore its usage modes. To evaluate the performance of keystroke localization and After that,we evaluate the latency and energy consumption tracking in dynamic scenes,we instruct a subject to press all of DynaKey.Finally,we evaluate DynaKey on text input,and the keys on the keyboard without and with key tracking mod- compare DynaKey with the state-of-the-art text input methods.ule.Fig.13(b)shows that the keystroke localization accuracy without key tracking module is only about 66.9%,while the A.Performance Metrics localization error and false negative rate are also high.This may be mainly due to the mismatch between the key's location To measure the accuracy of key tracking,we use Er ∑i=1 V(ime-xgP+(gn-g卫to represent the av. and its coordinates in dynamic camera views.With the key tracking module,the keystroke localization accuracy increases erage pixel deviation between the calculated cross points' significantly,i.e.,from 66.9%to 95.5%,and localization error, coordinates forming the keyboard layout and the ground truth, and Ioo to represent the average intersection false positive rate and false negative rate are 1.9%,2.1% and 2.6%,respectively.The results demonstrate that DynaKey over union [25]between the calculated keys'areas and the ground truth.The smaller Er the better,and the larger ToU the accurately locates the keystrokes,and the key tracking module better.Here,(m,m)and ()represent the calculated plays a critical role in keystroke localization in dynamic scenes

2327-4662 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2021.3114224, IEEE Internet of Things Journal IEEE INTERNET OF THINGS JOURNAL, VOL. XX, NO. XX, XX 2021 9 the top-right corner of user interface can be used to remove the character. Secondly, considering the language rules in regular text, we introduce the Bayesian method [33] to correct the wrong keystroke sequence. That is to say, given the keystroke sequence, we calculate the likelihood of each possible word and finally select the word with largest likelihood. In this way, we can tolerate the errors like false-negative keystrokes, falsepositive keystrokes and wrongly-detected keystrokes. Algorithm 2: Keystroke detection and localization Input: The consecutive frames. The ith fingertip in the jth frame is (x (j) i , y (j) i ), which is transformed to (x (j) 0 i , y (j) 0 i ) in the kth frame, k = j + 5. if q (x (j) 0 i − x (k) i ) 2 + (y (j) 0 i − y (k) i ) 2 < d then if The jth frame has no keystroke then Detecting a new keystroke. else if q (x (k−1)0 i − x (k) i ) 2 + (y (k−1)0 i − y (k) i ) 2 < d then Detecting a new keystroke. if A new stroke is detected then Selecting fingertip with largest coordinate variation. Matching fingertip with key by coordinates. Output: The located keystroke. V. PERFORMANCE EVALUATION We deploy DynaKey on a Samsung Galaxy S9 smartphone which is used as a head-mounted camera device, as shown in Fig. 2(a). The smartphone runs Android OS 9.0. We use a Microsoft Hololens [13] keyboard layout and print it on a piece of A4-sized paper. Unless otherwise specified, the frame rate of camera is set to 30fps, the sampling rate of gyroscope is set to 200Hz, the image size is set to 800×480 pixels. We conduct our experiments in an office environment. We recruit twelve volunteers to participate in the experiments and each subject types a set of pre-defined 1600 characters. Data sanitized is done to ensure no private and identity information. We first evaluate the performance of key tracking and keystroke localization. Then we evaluate how camera jitters, frame sizes and frame rates affect the performance of key tracking and keystroke localization. We also evaluate the performance of DynaKey in complex scenarios to explore its usage modes. After that, we evaluate the latency and energy consumption of DynaKey. Finally, we evaluate DynaKey on text input, and compare DynaKey with the state-of-the-art text input methods. A. Performance Metrics To measure the accuracy of key tracking, we use Er = 1 z Pz i=1 p (xmi − xgi ) 2 + (ymi − ygi ) 2 to represent the average pixel deviation between the calculated cross points’ coordinates forming the keyboard layout and the ground truth, and IoU = 1 n Pn i=1 IoUi to represent the average intersection over union [25] between the calculated keys’ areas and the ground truth. The smaller Er the better, and the larger IoU the better. Here, (xmi , ymi ) and (xgi , ygi ) represent the calculated 0 20 40 60 80 100 Frames 0 5 10 15 Pixel deviation (pixels) 0.5 0.6 0.7 0.8 0.9 1 IoU (a) Key tracking accuracy in 100 frames. 66.9 15.3 3 17.6 95.5 1.9 2.1 2.6 without with Keystroke localization vs. key tracking 0 20 40 60 80 100 Percentage (%) Localization accuracy Localization error False positive rate False negative rate (b) Keystroke localization performance with and without key tracking. Fig. 13. Performance of key tracking and keystroke localization. coordinate and the ground truth of the ith cross point, respectively, and IoUi represents the intersection over union between the calculated ith key’s area and the ground truth. z = 55 represents the number of cross points, while n = 40 represents the number of keys, as shown in Fig. 5(a). We obtain the ground truth by manually detecting the pre-marked coordinates of cross points and keys from each image frame. To measure the performance of keystroke localization, we use several metrics–localization accuracy, localization error, false positive rate (FPR) and false negative rate (FNR). The localization accuracy is the ratio of correctly located keystrokes to the number of keystrokes performed by subject. The localization error is the ratio of falsely located keystrokes to the number of keystrokes performed by subject. FPR and FNR are defined as the ratio of falsely detected keystrokes and missed keystrokes to the number of keystrokes performed by subject, respectively. B. Accuracy of Key Tracking and Keystroke Localization In the experiment, a subject is instructed to type on the keyboard in her/his own way. She/he may move her/his head naturally during the typing process. We evaluate the accuracy of key tracking by the aforementioned pixel deviation Er and the average intersection over union IoU in 100 frames. As shown in Fig. 13(a), the pixel deviation Er in an image ranges from 0 to 5 pixels, and the average pixel deviation Er among the frames is less than 3 pixels. When comparing with the key size, i.e., 45 × 25 pixels, the deviation less than 3 pixels can be neglected. Meanwhile, the average IoU achieves above 93%, indicating that the area of the calculated key coincides with the ground truth in a high degree. To conclude, DynaKey accurately tracks the coordinate changes of keys in different frames while tolerating head movements during the typing process. To evaluate the performance of keystroke localization and tracking in dynamic scenes, we instruct a subject to press all the keys on the keyboard without and with key tracking module. Fig. 13(b) shows that the keystroke localization accuracy without key tracking module is only about 66.9%, while the localization error and false negative rate are also high. This may be mainly due to the mismatch between the key’s location and its coordinates in dynamic camera views. With the key tracking module, the keystroke localization accuracy increases significantly, i.e., from 66.9% to 95.5%, and localization error, false positive rate and false negative rate are 1.9%, 2.1% and 2.6%, respectively. The results demonstrate that DynaKey accurately locates the keystrokes, and the key tracking module plays a critical role in keystroke localization in dynamic scenes. Authorized licensed use limited to: Nanjing University. Downloaded on December 03,2021 at 08:56:41 UTC from IEEE Xplore. Restrictions apply

This article has been accepted for publication in a future issue of this journal,but has not been fully edited.Content may change prior to final publication.Citation information:DOI 10.1109/JIOT.2021.3114224.IEEE Internet of Things Journal EEE INTERNET OF THINGS JOURNAL,VOL.XX,NO.XX,XX 2021 10 12345578910 aonary shohe obvioss lame saonar low medium high 12345678910 (a)Key tracking vs.range of jitters (b)Key tracking vs.speed of jitters. (c)Key tracking vs.frame sizes (d)Key tracking vs.sampling inter val. Fig.14.Performance of key tracking under different ranges and speeds of jitters. frame sizes and rates 100 100 95 Localization accuracy Localization accuracy Localization erro Localization error alse posit由Wera1e ☐False positive rate ☐False negative rate ☐False negative rate 20 121 192126 29254 13 232529 423超 stasonary slight obvious 1也ge stationary low medium hich of jitters (b)Keystroke locali eed of jitters (a)Keystroke localiz vs.range of jitters vs.speed of jitters Localization accuracy ILocalization accuracy ocalization error 60 alization erro alse positive rate se positive rate gative rate 320240 480°320 6840360 800°480 1280720 2 3 5 10 (c)Keystroke localization vs. frame sizes Fig.15.Performance of keystroke localization under different ranges and speeds of jitters,frame sizes and rates C.Effect of Camera Jitters updated coordinates of keys perfectly match with the fingertip In this experiment,we evaluate the performance of key pressing the key,the tracking and localization performance tracking and keystroke localization under different camera decreases.The pixel deviation increases to 4.5 pixels while jitters.Firstly,we change the range of camera jitters,i.e.,from the IoU decreases to 88.9%,and the false negative rate for stationary,slight(1.28°±0.26)，obvious(6.8°±4.4)to keystroke localization increases to 7.5%.However,considering large (18.49+3.7).Here,stationary means the device keeps the normal or unconscious head/camera movements during a unchanged during typing,while other jitters mean different typing process,the large or high-frequency jitters are rare,thus ranges of camera movements,which are controlled by attach- DynaKey performs well in typical cases.Besides,when using ing the device to a motor.The performance of key tracking and the camera with higher frame rates to capture fine-grained keystroke localization are shown in Fig.14(a)and Fig.15(a). camera movements,it is possible to mitigate the effect from respectively.The results show good performances of key large or high-frequency jitters tracking and keystroke localization under slight and obvious range of jitters.When camera jitter is obvious,the average D.Effect of Frame Sizes and Frame Rates pixel deviation is less than 3 pixels while the average IoU In this experiment,we evaluate how image sizes affect achieves 92.3%,and the localization accuracy reaches 93.7%.the performance of DynaKey.When the frame size is small, When the camera suffers from large jitters,the performance of e.g.,480x 320 pixels,the keyboard in the captured frame key tracking reduces clearly,the localization accuracy drops involves too few pixels to be extracted accurately,leading to to 89.1%.This may be caused by the mismatch between the poor performance.When the frame size increases to 800x480 detected fingertip and the key's coordinate during large jitters.pixels,the performance shows good results.When the frame In addition,we evaluate the performance of DynaKey by size keeps increasing to 1280x720 pixels,the performance has changing the frequency of jitters,i.e.,keeping stationary and a little decrease.This may be because the higher image reso- moving in low(0.04°±0.03/s),medium(0.09°±0.06/s) lution leads to the keyboard containing more pixels,resulting and high (0.2+0.15/s)speed,respectively.The subject types in larger pixel deviation for key tracking.Besides,the higher the same text as the above experiment.As shown in Fig.14(b) image resolution also causes higher image processing cost, and Fig.15(b),DynaKey can tolerate low and medium camera which may be too slow to process each keystroke and leads jitters well.When camera moves in medium speed,the average to higher false negative rate.In practice,to minimize latency pixel deviation is 3.2 pixels and the loU is 92.2%,and the and power consumption while guaranteeing the keystroke keystroke localization accuracy reaches 93.3%,respectively.In localization performance,the frame size is set to 800x480 case of high-frequency jitters,it is hard to guarantee that the pixels

2327-4662 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2021.3114224, IEEE Internet of Things Journal IEEE INTERNET OF THINGS JOURNAL, VOL. XX, NO. XX, XX 2021 10 stationary slight obvious large 0 5 Pixel deviation stationary slight obvious large 0.8 1 IoU (a) Key tracking vs. range of jitters. stationary low medium high 0 5 Pixel deviation stationary low medium high 0.8 1 IoU (b) Key tracking vs. speed of jitters. 320*240 480*320 640*360 800*480 1280*720 0 2 4 6 8 Pixel deviation 320*240 480*320 640*360 800*480 1280*720 0.7 0.8 0.9 1 IoU (c) Key tracking vs. frame sizes. 1 2 3 4 5 6 7 8 9 10 0 5 Pixel deviation 1 2 3 4 5 6 7 8 9 10 0.8 1 IoU (d) Key tracking vs. sampling interval. Fig. 14. Performance of key tracking under different ranges and speeds of jitters, frame sizes and rates. 97.4 1.1 2 1.5 95.5 1.9 2.1 2.6 93.7 2.9 2.5 3.4 89.1 3.6 4.3 7.3 stationary slight obvious large Range of jitters 0 20 40 60 80 100 Percentage (%) Localization accuracy Localization error False positive rate False negative rate (a) Keystroke localization vs. range of jitters. 97.4 1.1 2 1.5 94.8 2.3 2.5 2.9 93.3 3.4 2.7 3.3 89.9 2.6 4 7.5 stationary low medium high Speed of jitters 0 20 40 60 80 100 Percentage (%) Localization accuracy Localization error False positive rate False negative rate (b) Keystroke localization vs. speed of jitters. 73.2 12.611.814.2 81.7 7.3 3.1 11 88.5 5.2 2.5 6.3 95.5 1.9 2.1 2.6 85.9 5.2 2.6 8.9 320*240 480*320 640*360 800*480 1280*720 Frame sizes 0 20 40 60 80 100 Percentage (%) Localization accuracy Localization error False positive rate False negative rate (c) Keystroke localization vs. different frame sizes. 95.1 2.72.22.2 95.7 1.93.12.4 94.6 2 1.53.4 95.2 2.72.22.1 95.5 1.92.12.6 94.8 2.93.42.3 93.5 3.21.63.3 91.2 1.51.4 7.3 87 4.9 1.7 8.1 85.9 5.8 1.4 8.4 1 2 3 4 5 6 7 8 9 10 The duration Nd 0 20 40 60 80 100 Percentage (%) Localization accuracy Localization error False positive rate False negative rate (d) Keystroke localization vs. different sample intervals. Fig. 15. Performance of keystroke localization under different ranges and speeds of jitters, frame sizes and rates. C. Effect of Camera Jitters In this experiment, we evaluate the performance of key tracking and keystroke localization under different camera jitters. Firstly, we change the range of camera jitters, i.e., from stationary, slight (1.28◦ ± 0.26◦ ), obvious (6.8 ◦ ± 4.4 ◦ ) to large (18.4 ◦ ± 3.7 ◦ ). Here, stationary means the device keeps unchanged during typing, while other jitters mean different ranges of camera movements, which are controlled by attaching the device to a motor. The performance of key tracking and keystroke localization are shown in Fig. 14(a) and Fig.15(a), respectively. The results show good performances of key tracking and keystroke localization under slight and obvious range of jitters. When camera jitter is obvious, the average pixel deviation is less than 3 pixels while the average IoU achieves 92.3%, and the localization accuracy reaches 93.7%. When the camera suffers from large jitters, the performance of key tracking reduces clearly, the localization accuracy drops to 89.1%. This may be caused by the mismatch between the detected fingertip and the key’s coordinate during large jitters. In addition, we evaluate the performance of DynaKey by changing the frequency of jitters, i.e., keeping stationary and moving in low (0.04◦ ± 0.03◦ /s), medium (0.09◦ ± 0.06◦ /s) and high (0.2±0.15◦ /s) speed, respectively. The subject types the same text as the above experiment. As shown in Fig. 14(b) and Fig. 15(b), DynaKey can tolerate low and medium camera jitters well. When camera moves in medium speed, the average pixel deviation is 3.2 pixels and the IoU is 92.2%, and the keystroke localization accuracy reaches 93.3%, respectively. In case of high-frequency jitters, it is hard to guarantee that the updated coordinates of keys perfectly match with the fingertip pressing the key, the tracking and localization performance decreases. The pixel deviation increases to 4.5 pixels while the IoU decreases to 88.9%, and the false negative rate for keystroke localization increases to 7.5%. However, considering the normal or unconscious head/camera movements during a typing process, the large or high-frequency jitters are rare, thus DynaKey performs well in typical cases. Besides, when using the camera with higher frame rates to capture fine-grained camera movements, it is possible to mitigate the effect from large or high-frequency jitters. D. Effect of Frame Sizes and Frame Rates In this experiment, we evaluate how image sizes affect the performance of DynaKey. When the frame size is small, e.g., 480× 320 pixels, the keyboard in the captured frame involves too few pixels to be extracted accurately, leading to poor performance. When the frame size increases to 800×480 pixels, the performance shows good results. When the frame size keeps increasing to 1280×720 pixels, the performance has a little decrease. This may be because the higher image resolution leads to the keyboard containing more pixels, resulting in larger pixel deviation for key tracking. Besides, the higher image resolution also causes higher image processing cost, which may be too slow to process each keystroke and leads to higher false negative rate. In practice, to minimize latency and power consumption while guaranteeing the keystroke localization performance, the frame size is set to 800×480 pixels. Authorized licensed use limited to: Nanjing University. Downloaded on December 03,2021 at 08:56:41 UTC from IEEE Xplore. Restrictions apply