This article has been accepted for publication in a future issue of this journal,but has not been fully edited.Content may change prior to final publication.Citation information:DOI 10.1109/JIOT.2021.3114224.IEEE Internet of Things Journal EEE INTERNET OF THINGS JOURNAL,VOL.XX,NO.XX,XX 2021 of keyboard coordinate in the image frames.As shown in distance between the fingertip and the keyboard plane.As Fig.2(b),the paper keyboard is represented as K,and the shown in Fig.2(e),all fingers hover above the keyboard.Some captured keyboard from the camera view is K1.We take fingertips appear above the keys from camera view,hence it is the case of rotating around the y-axis (marked in Fig2(a)) easy to recognize non-keystrokes as keystrokes by mistake.To as an example of head (i.e.,camera)movements.When the address the confusion,we may dynamically track the moving camera slightly rotates A=5 around y-axis anticlockwise, patterns of a fingertip and detect a keystroke from several the image frame changes from x-y plane to z'-y'plane,and frames instead of a single frame.We also need an efficient way the captured keyboard in the image frame changes to K2. to distinguish the fingertip pressing a key from other fingertips. Correspondingly,the location offset of the keyboard achieves (Ad,Ad)=(78,27)pixels,which can lead to the mismatch IV.SYSTEM DESIGN between coordinates and keys.As shown in the right part of We now present the design of DynaKey,which provides Fig.2(b),due to the camera movement,the captured keyboard a text-input scheme for a head-mounted camera device in in the current image is shown in blue,while that in the original dynamic scenes,as shown in Fig.1.DynaKey works in image is shown in black.In the current frame,i.e.,blue realistic scenarios where a user types on a virtual keyboard keyboard,the user types letter 'y'.When using the coordinates with natural head movement.The keyboard layout can be of keys in the original frame,it may mismatch letter 'h'with printed on a piece of paper or drawn on a desk surface.Unless the keystroke. otherwise specified,we use an Android smartphone as the Observation 2.Extracting all keys from each image suffers head-mounted camera device,where the embedded camera is from unavoidable occlusion of hands and has an unacceptable used to capture user's typing behaviors,then track and locate cost of processing.To track the coordinate changes of keys, keystrokes.The embedded gyroscope is used to detect head an intuitive solution is to extract keys from each image movements.In regard to the keyboard layout,it is printed on frame.However,considering the hand occlusion which is a piece of paper,as shown in Fig.2(a). unavoidable,as shown in Fig.2(b),it is difficult to extract each key from the image frame accurately.Besides,considering the A.System Overview limited resources of a head-mounted device and the real-time requirement of text input,the processing cost of extracting Fig.3 shows the framework of DynaKey.The inputs are keys from each image frame is expensive.Specifically,we image frames captured by camera and the angular velocity use At to represent the processing cost of key extraction collected by gyroscope,while the output is the character of the from an image frame,i.e.,processing an input image and pressed key.Initially,the user keeps the head unchanged and extracting all keys from the image.In Fig.2(c).we show moves the hand out of the camera view for about 3 seconds, the cost of key extraction in 100 different frames.The result while using Key Tracking to detect the keyboard and extract shows that the processing cost At ranges from 40ms to 60ms, each key from the initial image frame.When the screen shows while the average cost is 49ms,which is larger than the inter- "Please TYPE",the user begins typing.During typing process, frame duration (i.e..33ms).Therefore,extracting all keys from we use Key Tracking to select keypoints of images to trans- each image frame to track the coordinates of keys may be form the coordinates of keys among different frames.At the unacceptable for real applications.More time-efficient key same time,we use Adaptive Tracking to analyze the angular tracking methods are expected. velocity of gyroscope to detect head (i.e.,camera)movements, Observation 3.Head movements occur occasionally and and then determine whether to update the coordinates of keys last for several frames instead of all frames.According to or not.In addition,DynaKey uses Fingertip Detection to seg- Observation 2,extracting all keys in each image can hardly ment the hand region from the frame and detect the fingertips. work.In fact,we find that performing key extraction in After that,we use Keystroke Detection and Localization to each frame is unnecessary.Although the user's head moves detect the keystroke occurred and locate the keystroke.To during typing,the ratio of head movement duration to the ensure Dynakey work in real time,we adopt three threads whole typing duration is small.Fig.2(d)shows that the to implement the image capturing,image processing(i.e.,key head movements cause the peaks in gyroscope data during tracking,fingertip detection,keystroke detection and localiza- a typing process (i.e.,3 minutes),the total duration of the tion),and adaptive tracking in parallel. three head movements is less than 1 minute.It implies that during the typing process,the coordinates of keys in the B.Key Tracking image frames keep unchanged for more than 67%of the time. Before typing,we first need to extract keys from the image. Consequently,we only need to re-extract the coordinates of With possible head movements.i.e..camera view changes.we keys when detecting head movements,rather than performing then need to track the coordinates of keys in the following key extraction in each frame. frames,as mentioned in Observation 1 of Section III.Key Observation 4.A frame from camera view is insufficient to tracking in DynaKey consists of key extraction and coordinate detect the depth information of fingertips.To decide whether a transformation,as described below. keystroke is occurring or not,it is critical to determine whether 1)Key Extraction:We adopt a common QWERTY key- the fingertip is pressing on a key.However,different from the board layout,which is printed in black and white on a piece of front camera view,the camera view from top and behind can paper,as shown in Fig.5(a).Given an input image in Fig.5(a). hardly detect the depth of an object,i.e.,the perpendicular we use Canny edge detection algorithm [10],[29]to obtain2327-4662 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2021.3114224, IEEE Internet of Things Journal IEEE INTERNET OF THINGS JOURNAL, VOL. XX, NO. XX, XX 2021 4 of keyboard coordinate in the image frames. As shown in Fig. 2(b), the paper keyboard is represented as K, and the captured keyboard from the camera view is K1. We take the case of rotating around the y-axis (marked in Fig2(a)) as an example of head (i.e., camera) movements. When the camera slightly rotates ∆θ = 5◦ around y-axis anticlockwise, the image frame changes from x-y plane to x 0 -y 0 plane, and the captured keyboard in the image frame changes to K2. Correspondingly, the location offset of the keyboard achieves (∆dx, ∆dy) = (78, 27) pixels, which can lead to the mismatch between coordinates and keys. As shown in the right part of Fig. 2(b), due to the camera movement, the captured keyboard in the current image is shown in blue, while that in the original image is shown in black. In the current frame, i.e., blue keyboard, the user types letter ‘y’. When using the coordinates of keys in the original frame, it may mismatch letter ‘h’ with the keystroke. Observation 2. Extracting all keys from each image suffers from unavoidable occlusion of hands and has an unacceptable cost of processing. To track the coordinate changes of keys, an intuitive solution is to extract keys from each image frame. However, considering the hand occlusion which is unavoidable, as shown in Fig. 2(b), it is difficult to extract each key from the image frame accurately. Besides, considering the limited resources of a head-mounted device and the real-time requirement of text input, the processing cost of extracting keys from each image frame is expensive. Specifically, we use ∆t to represent the processing cost of key extraction from an image frame, i.e., processing an input image and extracting all keys from the image. In Fig. 2(c), we show the cost of key extraction in 100 different frames. The result shows that the processing cost ∆t ranges from 40ms to 60ms, while the average cost is 49ms, which is larger than the interframe duration (i.e., 33ms). Therefore, extracting all keys from each image frame to track the coordinates of keys may be unacceptable for real applications. More time-efficient key tracking methods are expected. Observation 3. Head movements occur occasionally and last for several frames instead of all frames. According to Observation 2, extracting all keys in each image can hardly work. In fact, we find that performing key extraction in each frame is unnecessary. Although the user’s head moves during typing, the ratio of head movement duration to the whole typing duration is small. Fig. 2(d) shows that the head movements cause the peaks in gyroscope data during a typing process (i.e., 3 minutes), the total duration of the three head movements is less than 1 minute. It implies that during the typing process, the coordinates of keys in the image frames keep unchanged for more than 67% of the time. Consequently, we only need to re-extract the coordinates of keys when detecting head movements, rather than performing key extraction in each frame. Observation 4. A frame from camera view is insufficient to detect the depth information of fingertips. To decide whether a keystroke is occurring or not, it is critical to determine whether the fingertip is pressing on a key. However, different from the front camera view, the camera view from top and behind can hardly detect the depth of an object, i.e., the perpendicular distance between the fingertip and the keyboard plane. As shown in Fig. 2(e), all fingers hover above the keyboard. Some fingertips appear above the keys from camera view, hence it is easy to recognize non-keystrokes as keystrokes by mistake. To address the confusion, we may dynamically track the moving patterns of a fingertip and detect a keystroke from several frames instead of a single frame. We also need an efficient way to distinguish the fingertip pressing a key from other fingertips. IV. SYSTEM DESIGN We now present the design of DynaKey, which provides a text-input scheme for a head-mounted camera device in dynamic scenes, as shown in Fig. 1. DynaKey works in realistic scenarios where a user types on a virtual keyboard with natural head movement. The keyboard layout can be printed on a piece of paper or drawn on a desk surface. Unless otherwise specified, we use an Android smartphone as the head-mounted camera device, where the embedded camera is used to capture user’s typing behaviors, then track and locate keystrokes. The embedded gyroscope is used to detect head movements. In regard to the keyboard layout, it is printed on a piece of paper, as shown in Fig. 2(a). A. System Overview Fig. 3 shows the framework of DynaKey. The inputs are image frames captured by camera and the angular velocity collected by gyroscope, while the output is the character of the pressed key. Initially, the user keeps the head unchanged and moves the hand out of the camera view for about 3 seconds, while using Key Tracking to detect the keyboard and extract each key from the initial image frame. When the screen shows “Please TYPE”, the user begins typing. During typing process, we use Key Tracking to select keypoints of images to transform the coordinates of keys among different frames. At the same time, we use Adaptive Tracking to analyze the angular velocity of gyroscope to detect head (i.e., camera) movements, and then determine whether to update the coordinates of keys or not. In addition, DynaKey uses Fingertip Detection to segment the hand region from the frame and detect the fingertips. After that, we use Keystroke Detection and Localization to detect the keystroke occurred and locate the keystroke. To ensure DynaKey work in real time, we adopt three threads to implement the image capturing, image processing (i.e., key tracking, fingertip detection, keystroke detection and localization), and adaptive tracking in parallel. B. Key Tracking Before typing, we first need to extract keys from the image. With possible head movements, i.e., camera view changes, we then need to track the coordinates of keys in the following frames, as mentioned in Observation 1 of Section III. Key tracking in DynaKey consists of key extraction and coordinate transformation, as described below. 1) Key Extraction: We adopt a common QWERTY keyboard layout, which is printed in black and white on a piece of paper, as shown in Fig. 5(a). Given an input image in Fig. 5(a), we use Canny edge detection algorithm [10], [29] to obtain Authorized licensed use limited to: Nanjing University. Downloaded on December 03,2021 at 08:56:41 UTC from IEEE Xplore. Restrictions apply