2236 IEEE TRANSACTIONS ON MOBILE COMPUTING,VOL.17,NO.10.OCTOBER 2018 CamK:Camera-Based Keystroke Detection and Localization for Small Mobile Devices Yafeng Yin,Member,IEEE,Qun Li,Fellow,IEEE,Lei Xie,Member,IEEE, Shanhe Yi,Ed Novak,and Sanglu Lu,Member,IEEE Abstract-Because of the smaller size of mobile devices,text entry with on-screen keyboards becomes inefficient.Therefore,we present CamK,a camera-based text-entry method,which can use a panel(e.g.,a piece of paper)with a keyboard layout to input text into small devices.With the built-in camera of the mobile device,CamK captures images during the typing process and utilizes image processing techniques to recognize the typing behavior,i.e.,extract the keys,track the user's fingertips,detect,and locate keystrokes To achieve high accuracy of keystroke localization and low false positive rate of keystroke detection,CamK introduces the initial training and online calibration.To reduce the time latency,CamK optimizes computation-intensive modules by changing image sizes,focusing on target areas,introducing multiple threads,removing the operations of writing or reading images.Finally,we implement CamK on mobile devices running Android.Our experimental results show that CamK can achieve above 95 percent accuracy in keystroke localization,with only a 4.8 percent false positive rate.When compared with on-screen keyboards,CamK can achieve a 1.25X typing speedup for regular text input and 2.5X for random character input.In addition,we introduce word prediction to further improve the input speed for regular text by 13.4 percent. Index Terms-Mobile text-entry,camera,keystroke detection and localization,small mobile devices 1 INTRODUCTION TN recent years,we have witnessed a rapid development with their fingertips and nails,while the existing camera Lof electronic devices and mobile technology.Mobile devi- based keyboards either slow the typing speed [81,or should ces (e.g.,smartphones,Apple Watch)have become smaller be used in controlled environments [9].The existing schemes and smaller,in order to be carried everywhere easily,while are difficult to provide a similar user experience to using avoiding carrying bulky laptops all the time.However,the physical keyboards small size of the mobile device brings many new challenges, To provide a PC-like text-entry experience,we propose a a typical example is inputting text into the small mobile camera-based keyboard CamK,a more natural and intuitive device without a physical keyboard. text-entry method.As shown in Fig.1,CamK works with In order to get rid of the constraint of bulky physical key- the front-facing camera of the mobile device and a paper boards,many virtual keyboards have been proposed,e.g., keyboard.CamK takes pictures as the user types on the wearable keyboards,on-screen keyboards,projection key-paper keyboard,and uses image processing techniques to boards,etc.However,wearable keyboards introduce addi-detect and locate keystrokes.Then,CamK outputs the corre- tional equipments like rings [1]and gloves [2].On-screen sponding character of the pressed key.CamK can be used in keyboards [3],[4]usually take up a large area on the screen a wide variety of scenarios,e.g.,the office,coffee shops,out- and only support single finger for text entry.Typing with a doors,etc.However,to make CamK work well,we need to small screen becomes inefficient.Projection keyboards 5],[6] solve the following key technical challenges. often need a visible light projector or lasers to display the (1)Location Deviation:On a paper keyboard,the inter-key keyboard.To remove the additional hardwares,audio sig- distance is only about two centimeters [7].With image nal [7]and camera based virtual keyboards [81,[9]are pro- processing techniques,there may exist a position deviation posed.However,UbiK [7]requires the user to click keys between the real fingertip and the detected fingertip.This deviation may lead to localization errors of keystrokes.To address this challenge,CamK introduces the initial train- .Y.Yin,L.Xie,and S.Lu are with the State Key Laboratory for Novel Softiare Technology,Nanjing UIniversity,Nanjing 210023,China. ing to get the optimal parameters for image processing. E-mail:(yafeng,Ixie,sanglul@nju.edu.cn. Then,CamK uses an extended region to represent the Q.Li and S.Yi are with the Department of Computer Science,College of detected fingertip,to tolerate the position deviation. William and Mary,Williamsburg,VA23187.E-mail:(liqun,syij@cs.wm.edu. E.Nooak is with the Computer Science Department,Franklin and Mar- Besides,CamK utilizes the features of a keystroke(e.g.,the shall College,Lancaster,PA 17604.E-mail:enovak@fandm.edu. fingertip is located in the key for a certain duration,the Manuscript received 3 Feb.2017;revised 24 Dec.2017;accepted 15 Jan.2018. pressed key is partially obstructed by the fingertip,etc.)to Date of publication 25 Jan.2018;date of current version 29 Aug.2018. verify the validity of a keystroke. (Corresponding author:Lei Xie.) (2)False Positives:A false positive occurs when a non- For information on obtaining reprints of this article,please send e-mail to: keystroke(i.e.,a period in which no fingertip is pressing any reprints@ieee.org,and reference the Digital Object Identifier below. Digital Object Identifier no.10.1109/TMC.2018.2798635 key)is recognized as a keystroke.Without the assistance of
CamK: Camera-Based Keystroke Detection and Localization for Small Mobile Devices Yafeng Yin , Member, IEEE, Qun Li, Fellow, IEEE, Lei Xie , Member, IEEE, Shanhe Yi , Ed Novak , and Sanglu Lu, Member, IEEE Abstract—Because of the smaller size of mobile devices, text entry with on-screen keyboards becomes inefficient. Therefore, we present CamK, a camera-based text-entry method, which can use a panel (e.g., a piece of paper) with a keyboard layout to input text into small devices. With the built-in camera of the mobile device, CamK captures images during the typing process and utilizes image processing techniques to recognize the typing behavior, i.e., extract the keys, track the user’s fingertips, detect, and locate keystrokes. To achieve high accuracy of keystroke localization and low false positive rate of keystroke detection, CamK introduces the initial training and online calibration. To reduce the time latency, CamK optimizes computation-intensive modules by changing image sizes, focusing on target areas, introducing multiple threads, removing the operations of writing or reading images. Finally, we implement CamK on mobile devices running Android. Our experimental results show that CamK can achieve above 95 percent accuracy in keystroke localization, with only a 4.8 percent false positive rate. When compared with on-screen keyboards, CamK can achieve a 1.25X typing speedup for regular text input and 2.5X for random character input. In addition, we introduce word prediction to further improve the input speed for regular text by 13.4 percent. Index Terms—Mobile text-entry, camera, keystroke detection and localization, small mobile devices Ç 1 INTRODUCTION I N recent years, we have witnessed a rapid development of electronic devices and mobile technology. Mobile devices (e.g., smartphones, Apple Watch) have become smaller and smaller, in order to be carried everywhere easily, while avoiding carrying bulky laptops all the time. However, the small size of the mobile device brings many new challenges, a typical example is inputting text into the small mobile device without a physical keyboard. In order to get rid of the constraint of bulky physical keyboards, many virtual keyboards have been proposed, e.g., wearable keyboards, on-screen keyboards, projection keyboards, etc. However, wearable keyboards introduce additional equipments like rings [1] and gloves [2]. On-screen keyboards [3], [4] usually take up a large area on the screen and only support single finger for text entry. Typing with a small screen becomes inefficient. Projection keyboards [5], [6] often need a visible light projector or lasers to display the keyboard. To remove the additional hardwares, audio signal [7] and camera based virtual keyboards [8], [9] are proposed. However, UbiK [7] requires the user to click keys with their fingertips and nails, while the existing camera based keyboards either slow the typing speed [8], or should be used in controlled environments [9]. The existing schemes are difficult to provide a similar user experience to using physical keyboards. To provide a PC-like text-entry experience, we propose a camera-based keyboard CamK, a more natural and intuitive text-entry method. As shown in Fig. 1, CamK works with the front-facing camera of the mobile device and a paper keyboard. CamK takes pictures as the user types on the paper keyboard, and uses image processing techniques to detect and locate keystrokes. Then, CamK outputs the corresponding character of the pressed key. CamK can be used in a wide variety of scenarios, e.g., the office, coffee shops, outdoors, etc. However, to make CamK work well, we need to solve the following key technical challenges. (1) Location Deviation: On a paper keyboard, the inter-key distance is only about two centimeters [7]. With image processing techniques, there may exist a position deviation between the real fingertip and the detected fingertip. This deviation may lead to localization errors of keystrokes. To address this challenge, CamK introduces the initial training to get the optimal parameters for image processing. Then, CamK uses an extended region to represent the detected fingertip, to tolerate the position deviation. Besides, CamK utilizes the features of a keystroke (e.g., the fingertip is located in the key for a certain duration, the pressed key is partially obstructed by the fingertip, etc.) to verify the validity of a keystroke. (2) False Positives: A false positive occurs when a nonkeystroke (i.e., a period in which no fingertip is pressing any key) is recognized as a keystroke. Without the assistance of Y. Yin, L. Xie, and S. Lu are with the State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China. E-mail: {yafeng, lxie, sanglu}@nju.edu.cn. Q. Li and S. Yi are with the Department of Computer Science, College of William and Mary, Williamsburg, VA 23187. E-mail: {liqun, syi}@cs.wm.edu. E. Novak is with the Computer Science Department, Franklin and Marshall College, Lancaster, PA 17604. E-mail: enovak@fandm.edu. Manuscript received 3 Feb. 2017; revised 24 Dec. 2017; accepted 15 Jan. 2018. Date of publication 25 Jan. 2018; date of current version 29 Aug. 2018. (Corresponding author: Lei Xie.) For information on obtaining reprints of this article, please send e-mail to: reprints@ieee.org, and reference the Digital Object Identifier below. Digital Object Identifier no. 10.1109/TMC.2018.2798635 2236 IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. 17, NO. 10, OCTOBER 2018 1536-1233 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information
YIN ETAL:CAMK:CAMERA-BASED KEYSTROKE DETECTION AND LOCALIZATION FOR SMALL MOBILE DEVICES 2237 Camera based on the location of the touch.These wearable keyboards often need the user to wear devices around the hands or fin- gers,thus leading to the decrease of user experience. On-Screen Keyboards.On-screen keyboards allow the user to enter characters on a touch screen.Considering the lim- ited area of the keyboard on the screen,BigKey [3]and ZoomBoard [4]adaptively change the size of keys.Context- Type [16]leverages hand postures to improve mobile touch Fig.1.A typical use case of CamK screen text entry.Kwon et al.[17]introduce the regional error correction method to reduce the number of necessary other resources like audio signals,CamK should detect touches.ShapeWriter [18]recognizes a word based on the keystrokes only with images.To address this challenge, trace over successive letters in the word.Sandwich key- CamK combines keystroke detection with keystroke locali- board [19]affords ten-finger touch typing by utilizing a zation.For a potential keystroke,if there is no valid key touch sensor on the back side of a device.Usually,on-screen pressed by the fingertip,CamK will remove the keystroke keyboards occupy the screen area and support only one fin- and recognize it as a non-keystroke.Additionally,CamK ger for typing.Besides,it often needs to switch between dif- introduces online calibration,i.e.,using the movement fea- ferent screens to type letters,digits,punctuations,etc. tures of the fingertip after a keystroke,to further decrease Projection Keyboards.Projection keyboards usually need a the false positive rate. visible light projector or lasers to cast a keyboard,and then (3)Processing Latency:To serve as a text-entry method, utilize image processing methods [5]or infrared light [6]to when the user presses a key on the paper keyboard,CamK detect the typing events.Hu et al.use a pico-projector to should output the character of the key without any noticeable project the keyboard on the table,and then detect the touch latency.However,due to the limited computing resources of interaction by the distortion of the keyboard projection [201. small mobile devices,the heavy computation overhead of Roeber et al.utilize a pattern projector to display the key- image processing will lead to a large latency.To address this board layout on the flat surface,and then detect the key- challenge,CamK optimizes the computation-intensive mod- board events based on the intersection of fingers and ules by adaptively changing image sizes,focusing on the tar- infrared light [21].The projection keyboard often requires get area in the large-size image,adopting multiple threads the extra equipments,e.g.,a visible light projector,infrared and removing the operations of writing/reading images. light modules,etc.The extra equipments increase the cost We make the following contributions in this paper(a pre- and introduce the inconvenience of text entry. liminary version of this work appeared in [101) Camera Based Keyboards.Camera based virtual keyboards use the captured image [22]or video [23]to infer the key- We design a practical framework for CamK,which stroke.Gesture keyboard [22]gets the input by recognizing operates using a smart mobile device camera and a the finger's gesture.It works without a keyboard layout,thus portable paper keyboard.Based on image process- the user needs to remember the mapping between the keys ing,CamK can detect and locate the keystroke with and the finger's gestures.Visual Panel [8]works with a high accuracy and low false positive rate. printed keyboard on a piece of paper.It requires the user to We realize real time text-entry for small mobile use only one finger and wait for one second before each key- devices with limited resources,by optimizing the stroke.Malik et al.present the Visual Touchpad [24]to track computation-intensive modules.Additionally,we the 3D positions of the fingertips based on two downward- introduce word prediction to further improve the pointing cameras and a stereo.Adajania et al.[9]detect the input speed and reduce the error rate. keystroke based on shadow analysis with a standard web We implement CamK on smartphones running camera.Hagara et al.estimate the finger positions and detect Android.We first evaluate each module in CamK. clicking events based on edge detection,fingertip localization Then,we conduct extensive experiments to test the performance of CamK.After that,we compare CamK etc [251.In regard to the iPhone app paper keyboard [261, which only allows the user to use one finger to input letters. with other methods in input speed and error rate. The above research work usually focuses on detecting and 2 RELATED WORK tracking the fingertips,instead of locating the fingertip in a key's area of the keyboard,which is researched in our paper. Considering the small sizes of mobile devices,a lot of virtual In addition to the above text-entry solutions,MacKenzie keyboards are proposed for text entry,e.g.,wearable key- et al.[27]describe the text entry for mobile computing. boards,on-screen keyboards,projection keyboards,camera Zhang et al.[28]propose Okuli to locate user's finger based based keyboards,etc. on visible light communication modules,LED,and light Wearable Keyboards.Wearable keyboards sense and recog- sensors.Wang et al.[7]propose UbiK to locate the keystroke nize the typing behavior based on the sensors built into rings based on audio signals.The existing work usually needs [1],[11],gloves [121,and so on.TypingRing [13]utilizes the extra equipments,or only allows one finger to type,or embedded sensors of the ring to input text.Finger-Joint key- needs to change the user's typing behavior,while difficult pad [14]works with a glove equipped with the pressure sen- to provide a PC-like text-entry experience.In this paper,we sors.The Senseboard [2]consists of two rubber pads and propose a text-entry method based on the built-in camera of senses the movements in the palm to get keystrokes.Funk the mobile device and a paper keyboard,to provide a simi- et al.[15]utilize a touch sensitive wristband to enter text lar user experience to using physical keyboards
other resources like audio signals, CamK should detect keystrokes only with images. To address this challenge, CamK combines keystroke detection with keystroke localization. For a potential keystroke, if there is no valid key pressed by the fingertip, CamK will remove the keystroke and recognize it as a non-keystroke. Additionally, CamK introduces online calibration, i.e., using the movement features of the fingertip after a keystroke, to further decrease the false positive rate. (3) Processing Latency: To serve as a text-entry method, when the user presses a key on the paper keyboard, CamK should output the character of the key without any noticeable latency. However, due to the limited computing resources of small mobile devices, the heavy computation overhead of image processing will lead to a large latency. To address this challenge, CamK optimizes the computation-intensive modules by adaptively changing image sizes, focusing on the target area in the large-size image, adopting multiple threads and removing the operations of writing/reading images. We make the following contributions in this paper (a preliminary version of this work appeared in [10]). We design a practical framework for CamK, which operates using a smart mobile device camera and a portable paper keyboard. Based on image processing, CamK can detect and locate the keystroke with high accuracy and low false positive rate. We realize real time text-entry for small mobile devices with limited resources, by optimizing the computation-intensive modules. Additionally, we introduce word prediction to further improve the input speed and reduce the error rate. We implement CamK on smartphones running Android. We first evaluate each module in CamK. Then, we conduct extensive experiments to test the performance of CamK. After that, we compare CamK with other methods in input speed and error rate. 2 RELATED WORK Considering the small sizes of mobile devices, a lot of virtual keyboards are proposed for text entry, e.g., wearable keyboards, on-screen keyboards, projection keyboards, camera based keyboards, etc. Wearable Keyboards. Wearable keyboards sense and recognize the typing behavior based on the sensors built into rings [1], [11], gloves [12], and so on. TypingRing [13] utilizes the embedded sensors of the ring to input text. Finger-Joint keypad [14] works with a glove equipped with the pressure sensors. The Senseboard [2] consists of two rubber pads and senses the movements in the palm to get keystrokes. Funk et al. [15] utilize a touch sensitive wristband to enter text based on the location of the touch. These wearable keyboards often need the user to wear devices around the hands or fingers, thus leading to the decrease of user experience. On-Screen Keyboards. On-screen keyboards allow the user to enter characters on a touch screen. Considering the limited area of the keyboard on the screen, BigKey [3] and ZoomBoard [4] adaptively change the size of keys. ContextType [16] leverages hand postures to improve mobile touch screen text entry. Kwon et al. [17] introduce the regional error correction method to reduce the number of necessary touches. ShapeWriter [18] recognizes a word based on the trace over successive letters in the word. Sandwich keyboard [19] affords ten-finger touch typing by utilizing a touch sensor on the back side of a device. Usually, on-screen keyboards occupy the screen area and support only one finger for typing. Besides, it often needs to switch between different screens to type letters, digits, punctuations, etc. Projection Keyboards. Projection keyboards usually need a visible light projector or lasers to cast a keyboard, and then utilize image processing methods [5] or infrared light [6] to detect the typing events. Hu et al. use a pico-projector to project the keyboard on the table, and then detect the touch interaction by the distortion of the keyboard projection [20]. Roeber et al. utilize a pattern projector to display the keyboard layout on the flat surface, and then detect the keyboard events based on the intersection of fingers and infrared light [21]. The projection keyboard often requires the extra equipments, e.g., a visible light projector, infrared light modules, etc. The extra equipments increase the cost and introduce the inconvenience of text entry. Camera Based Keyboards. Camera based virtual keyboards use the captured image [22] or video [23] to infer the keystroke. Gesture keyboard [22] gets the input by recognizing the finger’s gesture. It works without a keyboard layout, thus the user needs to remember the mapping between the keys and the finger’s gestures. Visual Panel [8] works with a printed keyboard on a piece of paper. It requires the user to use only one finger and wait for one second before each keystroke. Malik et al. present the Visual Touchpad [24] to track the 3D positions of the fingertips based on two downwardpointing cameras and a stereo. Adajania et al. [9] detect the keystroke based on shadow analysis with a standard web camera. Hagara et al. estimate the finger positions and detect clicking events based on edge detection, fingertip localization, etc [25]. In regard to the iPhone app paper keyboard [26], which only allows the user to use one finger to input letters. The above research work usually focuses on detecting and tracking the fingertips, instead of locating the fingertip in a key’s area of the keyboard, which is researched in our paper. In addition to the above text-entry solutions, MacKenzie et al. [27] describe the text entry for mobile computing. Zhang et al. [28] propose Okuli to locate user’s finger based on visible light communication modules, LED, and light sensors. Wang et al. [7] propose UbiK to locate the keystroke based on audio signals. The existing work usually needs extra equipments, or only allows one finger to type, or needs to change the user’s typing behavior, while difficult to provide a PC-like text-entry experience. In this paper, we propose a text-entry method based on the built-in camera of the mobile device and a paper keyboard, to provide a similar user experience to using physical keyboards. Fig. 1. A typical use case of CamK. YIN ET AL.: CAMK: CAMERA-BASED KEYSTROKE DETECTION AND LOCALIZATION FOR SMALL MOBILE DEVICES 2237
2238 IEEE TRANSACTIONS ON MOBILE COMPUTING,VOL.17,NO.10.OCTOBER 2018 Finger number (a)Frame 1 (b)Frame 2 (c)Frame 3 (d)Frame 4 (e)Frame 5 Fig.2.Frames during two consecutive keystrokes. 3 FEASIBILITY STUDY AND OVERVIEW OF CAMK 3.2 System Overview In order to show the feasibility of locating keystrokes based As shown in Fig.1,CamK works with a mobile device and a on image processing techniques,we first show the observa- paper keyboard.The device uses the front-facing camera to tions of a keystroke from the camera's view.After that,we capture the typing process,while the paper keyboard is will describe the system overview of CamK. placed on a flat surface and located in the camera's view.We take Fig.1 as an example to describe the deployment.In 3.1 Observations of a Keystroke Fig.1,the mobile device is a Samsung N9109W smartphone, In Fig.2,we show the frames/images captured by the while l means the distance between the device and the printed camera during two consecutive keystrokes.The origin of keyboard,a means the angle between the plane of the device's axes is located in the top left corner of the image,as screen and that of the keyboard.In Fig.1,we set l=13.5 cm, shown in Fig.2a.The hand located in the left area of the a=90,to make the letter keys large enough in the camera's image is called left hand,while the other is called the view.In fact,there is no strict requirements of the above right hand,as shown in Fig.2b.From left to right,the fin- parameters'value,especially when the position of the camera gers are called finger i in sequence,i[1,10],as shown varies in different devices.In Fig.1,when we fix the A4 sized in Fig.2c.The fingertip pressing the key is called Stroke- paper keyboard,I can range in [13.5 cm,18.0 cm],while a can Tip,while that pressed key is called StrokeKey,as shown range in [78.8,90.0].In CamK,even if some part of the key- in Fig.2d. board is out of the camera's view,CamK still works. When the user presses a key,i.e.,a keystroke occurs,the The architecture of CamK is shown in Fig.3.The input is StrokeTip and StrokeKey often have the following features, the image taken by the camera and the output is the charac- which can be used to track,detect and locate the keystroke. ter of the pressed key.Before a user begins typing,CamK uses Key Extraction to detect the keyboard and extract each (1) Coordinate position:The StrokeTip usually has the larg- key from the image.When the user types,CamK uses Fin- est vertical coordinate among the fingers on the same gertip Detection to extract the user's hands and detect their hand,because the user tends to stretch out one finger fingertips.Based on the movements of fingertips,CamK when typing a key.An example is finger 9 in Fig.2a. uses Keystroke Detection and Localization to detect a possible While considering the particularity of thumbs,this keystroke and locate the keystroke.Finally,CamK uses Text feature may not be suitable for thumbs.Therefore Output to output the character of the pressed key. we separately detect the StrokeTip in thumbs and other fingertips. (2) 4 SYSTEM DESIGN Moving state:The StrokeTip stays on the StrokeKey for a certain duration in a typing operation,as finger 2 According to Fig.3,CamK consists of four components:key shown in Figs.2c and 2d.If the positions of the fin- extraction,fingertip detection,keystroke detection and gertip keep unchanged,a keystroke may happen. localization,and text output.Obviously,text output is easy (3) Correlated location:The StrokeTip is located in the Stro- to be implemented.Therefore,we mainly describe the first keKey,in order to press that key,such as finger 9 three components. shown in Fig.2a and finger 2 shown in Fig.2d. (4) Obstructed view:The StrokeTip obstructs the StrokeKey Key Extraction from the view of the camera,as shown in Fig.2d Text Keyhoard Location Key character range of keys egmentationKey location The ratio of the visually obstructed area to the whole 自appg output area of the key can be used to verify whether the key IKey area.Key location Keystroke is really pressed. Keystroke Detection and Localizatipn (5) Relative distance:The StrokeTip usually achieves the Candidate fingertip selection largest vertical distance between the fingertip and remain- frame 2 Frame 1 Largest vertical coordinate ing fingertips of the same hand.This is because the user Keeping unchanged usually stretches out the finger to press a key.Thus ngertip Detection Locating in the pressed key the feature can be used to infer which hand gener- Covering the pressed key Only oe ates the keystroke.In Fig.2a,the vertical distance d, No fineertips between the StrokeTip (i.e.,Finger 9)and remaining lative distance fingertips in right hand is larger than that(di)in left hand.Thus we choose finger 9 as the StrokeTip from two hands,instead of finger 2. Fig.3.Architecture of CamK
3 FEASIBILITY STUDY AND OVERVIEW OF CAMK In order to show the feasibility of locating keystrokes based on image processing techniques, we first show the observations of a keystroke from the camera’s view. After that, we will describe the system overview of CamK. 3.1 Observations of a Keystroke In Fig. 2, we show the frames/images captured by the camera during two consecutive keystrokes. The origin of axes is located in the top left corner of the image, as shown in Fig. 2a. The hand located in the left area of the image is called left hand, while the other is called the right hand, as shown in Fig. 2b. From left to right, the fingers are called finger i in sequence, i 2 ½1; 10, as shown in Fig. 2c. The fingertip pressing the key is called StrokeTip, while that pressed key is called StrokeKey, as shown in Fig. 2d. When the user presses a key, i.e., a keystroke occurs, the StrokeTip and StrokeKey often have the following features, which can be used to track, detect and locate the keystroke. (1) Coordinate position: The StrokeTip usually has the largest vertical coordinate among the fingers on the same hand, because the user tends to stretch out one finger when typing a key. An example is finger 9 in Fig. 2a. While considering the particularity of thumbs, this feature may not be suitable for thumbs. Therefore, we separately detect the StrokeTip in thumbs and other fingertips. (2) Moving state: The StrokeTip stays on the StrokeKey for a certain duration in a typing operation, as finger 2 shown in Figs. 2c and 2d. If the positions of the fingertip keep unchanged, a keystroke may happen. (3) Correlated location: The StrokeTip is located in the StrokeKey, in order to press that key, such as finger 9 shown in Fig. 2a and finger 2 shown in Fig. 2d. (4) Obstructed view: The StrokeTip obstructs the StrokeKey from the view of the camera, as shown in Fig. 2d. The ratio of the visually obstructed area to the whole area of the key can be used to verify whether the key is really pressed. (5) Relative distance: The StrokeTip usually achieves the largest vertical distance between the fingertip and remaining fingertips of the same hand. This is because the user usually stretches out the finger to press a key. Thus the feature can be used to infer which hand generates the keystroke. In Fig. 2a, the vertical distance dr between the StrokeTip (i.e., Finger 9) and remaining fingertips in right hand is larger than that (dl) in left hand. Thus we choose finger 9 as the StrokeTip from two hands, instead of finger 2. 3.2 System Overview As shown in Fig. 1, CamK works with a mobile device and a paper keyboard. The device uses the front-facing camera to capture the typing process, while the paper keyboard is placed on a flat surface and located in the camera’s view. We take Fig. 1 as an example to describe the deployment. In Fig. 1, the mobile device is a Samsung N9109W smartphone, while l means the distance between the device and the printed keyboard, a means the angle between the plane of the device’s screen and that of the keyboard. In Fig. 1, we set l ¼ 13:5 cm, a ¼ 90, to make the letter keys large enough in the camera’s view. In fact, there is no strict requirements of the above parameters’ value, especially when the position of the camera varies in different devices. In Fig. 1, when we fix the A4 sized paper keyboard, l can range in ½13:5 cm; 18:0 cm, while a can range in ½78:8; 90:0. In CamK, even if some part of the keyboard is out of the camera’s view, CamK still works. The architecture of CamK is shown in Fig. 3. The input is the image taken by the camera and the output is the character of the pressed key. Before a user begins typing, CamK uses Key Extraction to detect the keyboard and extract each key from the image. When the user types, CamK uses Fingertip Detection to extract the user’s hands and detect their fingertips. Based on the movements of fingertips, CamK uses Keystroke Detection and Localization to detect a possible keystroke and locate the keystroke. Finally, CamK uses Text Output to output the character of the pressed key. 4 SYSTEM DESIGN According to Fig. 3, CamK consists of four components: key extraction, fingertip detection, keystroke detection and localization, and text output. Obviously, text output is easy to be implemented. Therefore, we mainly describe the first three components. Fig. 2. Frames during two consecutive keystrokes. Fig. 3. Architecture of CamK. 2238 IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. 17, NO. 10, OCTOBER 2018
YIN ETAL:CAMK:CAMERA-BASED KEYSTROKE DETECTION AND LOCALIZATION FOR SMALL MOBILE DEVICES 2239 5】 与XJ (a)An input image (b)Edge detection (c)Edge optimization (d)Keyboard range (e)Keyboard boundary (f)Key segmentation Fig.4.Keyboard detection and key extraction. 4.1 Key Extraction conditions.Ey can be estimated in the initial training (see Without loss of generality,CamK adopts the common Section 5.1).The initial/default value of ey is 50. QWERTY keyboard layout,which is printed in black and When we obtain the white pixels,we need to get the con- white on a piece of paper,as shown in Fig.1.In order to tours of keys and separate the keys from one another.To eliminate the effects of background,we first detect the avoid pitfall areas such as small white areas which do not boundary of the keyboard.Then,we extract each key from belong to any key,we introduce the area of a key.Based on the keyboard.Therefore,key extraction contains three parts: Fig.4e,we first use B,B2,B3,Ba to calculate the area S of keyboard detection,key segmentation,and mapping the the keyboard as s%=·(IBB×BBl+BgB×BaB). characters to the keys,as shown in Fig.3. Then,we calculate the area of each key.We use N to repre- sent the number of keys in the keyboard.Considering the 4.1.1 Keyboard Detection size difference between keys,we treat larger keys (e.g.,the We use the Canny edge detection algorithm [29]to obtain space key)as multiple regular keys (e.g.,A-Z,0-9).For the edges of the keyboard.Fig.4b shows the edge detection example,the space key is treated as five regular keys.In this result of Fig.4a.However,the interference edges (e.g.,the way,we will change N to Nae.Then,we can estimate the paper's edge/longest edge in Fig.4b)should be removed. average area of a regular key as S/Natg.In addition to size Based on Fig.4b,the edges of the keyboard should be close difference between keys,the camera's view can also affect to the edges of keys.We use this feature to remove pitfall the area of a key in the image.Therefore,we introduce a, edges,the result is shown in Fig.4c.Additionally,we adopt an to describe the range of a valid area S of a key as the dilation operation [30]to join the dispersed edge points Ss∈lm,h·.We set a=0.15,ah=5 in Camk, which are close to each other,to get better edges/bound- based on extensive experiments.The key segmentation aries of the keyboard.After that,we use the Hough trans- result of Fig.4e is shown in Fig.4f.Then,we use the location form [8]to detect the lines in Fig.4c.Then,we use the of the space key(biggest key)to locate other keys,based on uppermost line and the bottom line to describe the position the relative locations between keys. range of the keyboard,as shown in Fig.4d.Similarly,we can use the Hough transform [8]to detect the left/right 4.2 Fingertip Detection edge of the keyboard.If there are no suitable edges detected After extracting the keys,we need to track the fingertips to by the Hough transform,it is usually because the keyboard detect and locate the keystrokes.To achieve this goal,we is not perfectly located in the camera's view.In this case,we should first detect the fingertip with hand segmentation simply use the left/right boundary of the image to represent and fingertip discovery,as shown below. the left/right edge of the keyboard.As shown in Fig.4e, we extend the four edges (lines)to get four intersections 4.2.1 Hand Segmentation B1(x1,y),B2(r2,y),B3(r3,y3),Ba(x4,y),which are used to describe the boundary of the keyboard. Skin segmentation [30]is often used for hand segmentation. In the YCrCb color space,a pixel (Y,Cr,Cb)is determined to be a skin pixel,if it satisfies Cr∈[133,l73]and Cb∈ 4.1.2 Key Segmentation 77.127.However,the threshold values of Cr and Co can be Considering the short interference edges generated by the affected by the surroundings such as lighting conditions.It edge detection algorithm,it is difficult to accurately seg- is difficult to choose suitable threshold values for Cr and ment each key from the keyboard with detected edges.Con- Co.Therefore,we combine Otsu's method [31]and the red sequently,we utilize the color difference between the white channel in YCrCb color space for skin segmentation. keys and the black background and the area of a key for key In the YCrCb color space,the red channel Cr is essential segmentation,to reduce the effect of pitfall areas. to human skin color.Therefore,with a captured image,we First,we introduce color segmentation to distinguish the use the grayscale image that is split based on the Cr channel white keys and black background.Considering the conve- as an input for Otsu's method [31].Otsu's method can auto- nience of image processing,we represent the color in YCrCb matically perform clustering-based image thresholding,i.e., space.In YCrCb space,the color coordinate (Y,Cr,Cb)of a calculate the optimal threshold to separate the foreground white pixel is(255,128,128),while that of a black pixel is(0,and background.The hand segmentation result of Fig.5a is 128,128).Thus,we only compute the difference in the Y shown in Fig.5b,where the white regions represent the value between the pixels to distinguish the white keys from hand regions with high value in C,channel,while the black the black background.If a pixel is located in the keyboard,regions represent the background.However,around the while satisfying 255-E<Y <255,the pixel belongs to a hands,there exist some interference regions,which may key.The offsets EyN of Y is mainly caused by light change the contours of fingers,resulting in detecting wrong
4.1 Key Extraction Without loss of generality, CamK adopts the common QWERTY keyboard layout, which is printed in black and white on a piece of paper, as shown in Fig. 1. In order to eliminate the effects of background, we first detect the boundary of the keyboard. Then, we extract each key from the keyboard. Therefore, key extraction contains three parts: keyboard detection, key segmentation, and mapping the characters to the keys, as shown in Fig. 3. 4.1.1 Keyboard Detection We use the Canny edge detection algorithm [29] to obtain the edges of the keyboard. Fig. 4b shows the edge detection result of Fig. 4a. However, the interference edges (e.g., the paper’s edge/longest edge in Fig. 4b) should be removed. Based on Fig. 4b, the edges of the keyboard should be close to the edges of keys. We use this feature to remove pitfall edges, the result is shown in Fig. 4c. Additionally, we adopt the dilation operation [30] to join the dispersed edge points which are close to each other, to get better edges/boundaries of the keyboard. After that, we use the Hough transform [8] to detect the lines in Fig. 4c. Then, we use the uppermost line and the bottom line to describe the position range of the keyboard, as shown in Fig. 4d. Similarly, we can use the Hough transform [8] to detect the left/right edge of the keyboard. If there are no suitable edges detected by the Hough transform, it is usually because the keyboard is not perfectly located in the camera’s view. In this case, we simply use the left/right boundary of the image to represent the left/right edge of the keyboard. As shown in Fig. 4e, we extend the four edges (lines) to get four intersections B1ðx1; y1Þ; B2ðx2; y2Þ; B3ðx3; y3Þ; B4ðx4; y4Þ, which are used to describe the boundary of the keyboard. 4.1.2 Key Segmentation Considering the short interference edges generated by the edge detection algorithm, it is difficult to accurately segment each key from the keyboard with detected edges. Consequently, we utilize the color difference between the white keys and the black background and the area of a key for key segmentation, to reduce the effect of pitfall areas. First, we introduce color segmentation to distinguish the white keys and black background. Considering the convenience of image processing, we represent the color in YCrCb space. In YCrCb space, the color coordinate (Y, Cr, Cb) of a white pixel is (255, 128, 128), while that of a black pixel is (0, 128, 128). Thus, we only compute the difference in the Y value between the pixels to distinguish the white keys from the black background. If a pixel is located in the keyboard, while satisfying 255 "y Y 255, the pixel belongs to a key. The offsets "y 2 N of Y is mainly caused by light conditions. "y can be estimated in the initial training (see Section 5.1). The initial/default value of "y is 50. When we obtain the white pixels, we need to get the contours of keys and separate the keys from one another. To avoid pitfall areas such as small white areas which do not belong to any key, we introduce the area of a key. Based on Fig. 4e, we first use B1; B2; B3; B4 to calculate the area Sb of the keyboard as Sb ¼ 1 2 ðjB1B2 ! B1B4 !jþjB3B4 ! B3B2 !jÞ. Then, we calculate the area of each key. We use N to represent the number of keys in the keyboard. Considering the size difference between keys, we treat larger keys (e.g., the space key) as multiple regular keys (e.g., A-Z, 0-9). For example, the space key is treated as five regular keys. In this way, we will change N to Navg. Then, we can estimate the average area of a regular key as Sb=Navg. In addition to size difference between keys, the camera’s view can also affect the area of a key in the image. Therefore, we introduce al, ah to describe the range of a valid area Sk of a key as Sk 2 ½al Sb Navg ; ah Sb Navg. We set al ¼ 0:15, ah ¼ 5 in CamK, based on extensive experiments. The key segmentation result of Fig. 4e is shown in Fig. 4f. Then, we use the location of the space key (biggest key) to locate other keys, based on the relative locations between keys. 4.2 Fingertip Detection After extracting the keys, we need to track the fingertips to detect and locate the keystrokes. To achieve this goal, we should first detect the fingertip with hand segmentation and fingertip discovery, as shown below. 4.2.1 Hand Segmentation Skin segmentation [30] is often used for hand segmentation. In the YCrCb color space, a pixel (Y, Cr, Cb) is determined to be a skin pixel, if it satisfies Cr 2 ½133; 173 and Cb 2 ½77; 127. However, the threshold values of Cr and Cb can be affected by the surroundings such as lighting conditions. It is difficult to choose suitable threshold values for Cr and Cb. Therefore, we combine Otsu’s method [31] and the red channel in YCrCb color space for skin segmentation. In the YCrCb color space, the red channel Cr is essential to human skin color. Therefore, with a captured image, we use the grayscale image that is split based on the Cr channel as an input for Otsu’s method [31]. Otsu’s method can automatically perform clustering-based image thresholding, i.e., calculate the optimal threshold to separate the foreground and background. The hand segmentation result of Fig. 5a is shown in Fig. 5b, where the white regions represent the hand regions with high value in Cr channel, while the black regions represent the background. However, around the hands, there exist some interference regions, which may change the contours of fingers, resulting in detecting wrong Fig. 4. Keyboard detection and key extraction. YIN ET AL.: CAMK: CAMERA-BASED KEYSTROKE DETECTION AND LOCALIZATION FOR SMALL MOBILE DEVICES 2239
2240 IEEE TRANSACTIONS ON MOBILE COMPUTING,VOL 17,NO.10.OCTOBER 2018 (a)An input image (b)Hand segmentation (c)Optimization (d)Fingers'contour (e)Fingertip discovery (f)Fingertips Fig.5.Fingertip detection. fingertips.Thus,CamK introduces the following erosion and shown in Fig.6b,the relative positions of Pi Pi,Pitg are dilation operations [321.We first use the erosion operation to different from that in Fig.6a.In Fig.6b,we show the isolate the hands from keys and separate each finger.Then, thumb of the left hand.Obviously,P Pi,Pt do not we use the dilation operation to smooth the edge of the fin- satisfy yi >y-and yi >yi+g Therefore,we use gers.Fig.5c shows the optimized result of hand segmenta- (i-i).(i-i+)>0 to describe the relative locations tion.After that,we select the top two segmented areas as of P-a,Pi,P+in thumbs.Then,we choose the vertex with hand regions,i.e.,left hand and right hand,to further reduce largest vertical coordinate in a finger's contour as the finger- the effect of inference regions,such as the red areas in Fig.5c. tip,as mentioned in the last paragraph. In fingertip detection,we only need to detect the points 4.2.2 Fingertip Discovery located on the bottom edge (from the left most point to the After we extract the fingers,we will detect the fingertips. right most point)of the hand,such as the blue contour of We can differentiate between the thumbs(i.e.,finger 5-6 in right hand in Fig.5d.The shape feature 0;and the positions Fig.2c)and non-thumbs (i.e.,finger 1-4,7-10 in Fig.2c) in vertical coordinates yi along the bottom edge are shown in shape and typing movement,as shown in Fig.6. Fig.5e.If we can detect five fingertips in a hand with 0;and In a non-thumb,the fingertip is usually a convex vertex, yi-yi,yi+g,we assume that we have also found the thumb. as shown in Fig.6a.For a point P(ri,yi)located in the con- At this time,the thumb presses a key like a non-thumb.Oth- tour of a hand,by tracing the contour,we can select the erwise,we detect the fingertip of the thumb in the right point P(ri--)before Pi and the point P+(i++ most area of left hand or left most area of right hand accord- after P:.Here,i,qN.We calculate the angle;between the ing to 0i and ii,i+The detected fingertips of Fig.5a two vectors PPPP+,according to Eq.(1).In order to are marked in Fig.5f. simplify the calculation for 0,we map 0;in the range a∈o°,l8o].Ifa∈[a,oanl,ay and yi>yi+Otherwise,Pi detect a possible keystroke and locate it for text entry.The will not be a candidate vertex.If there are multiple candi- keystroke is usually correlated with one or two fingertips date vertexes,such as P in Fig.6a,we will choose the vertex therefore we first select the candidate fingertip having a high having the largest vertical coordinate,because it has higher probability of pressing a key,instead of detecting all finger- probability of being a fingertip,as p shown in Fig.6a.Here, tips,to reduce the computation overhead.Then,we track the largest vertical coordinate means the local maximum in the candidate fingertip to detect the possible keystroke.Finally, a finger's contour,such as the red circle shown in Fig.5e. we correlate the candidate fingertip with the pressed key to The range of a finger's contour can be limited by Eq.(1),i.e., locate the keystroke. the angle feature of a finger.Based on extensive experi- ments,we set 0=60°,0h=150°,q=20 in this paper 4.3.1 Candidate Fingertip Selection in Each Hand CamK allows the user to use all of their fingers for text- PP-g·PB+9 entry,thus the keystroke may come from the left or right 0;=arccos (1) hand.Based on the observations(see Section 3.1),the finger- PP-PPital tip (i.e.,StrokeTip)pressing the key usually has the largest vertical coordinate in that hand,such as finger 9 shown in In a thumb,the "fingertip"also means a convex vertex of Fig.2a.Therefore,we first select the candidate fingertip the finger.Thus we still use Eq.(1)to represent the shape of with the largest vertical coordinate in each hand.We the fingertip in a thumb.However,the position of the con- respectively use C and C,to represent the points located vex vertex can be different from that of a non-thumb.As in the contour of left hand and right hand.For a point P(am,)∈C,if P satisfies≥5(P(x,)∈C,j≠), 00.0) 0(0,0) then P will be selected as the candidate fingertip in the left hand.Similarly,we can get the candidate fingertip P(r,yr) B4a(tgyg) in the right hand.In this step,we only need to get P and P, 0 instead of detecting all fingertips. P(.) Pg(x-g-g) P(x,y) 4.3.2 Keystroke Detection Based on Fingertip Tracking (a)Fingertips(non-thumbs) (b)A thumb As described in the observations,when the user presses a Fig.6.Features of a fingertip. key,the fingertip will stay at that key for a certain duration
fingertips. Thus, CamK introduces the following erosion and dilation operations [32]. We first use the erosion operation to isolate the hands from keys and separate each finger. Then, we use the dilation operation to smooth the edge of the fingers. Fig. 5c shows the optimized result of hand segmentation. After that, we select the top two segmented areas as hand regions, i.e., left hand and right hand, to further reduce the effect of inference regions, such as the red areas in Fig. 5c. 4.2.2 Fingertip Discovery After we extract the fingers, we will detect the fingertips. We can differentiate between the thumbs (i.e., finger 5-6 in Fig. 2c) and non-thumbs (i.e., finger 1 4, 7 10 in Fig. 2c) in shape and typing movement, as shown in Fig. 6. In a non-thumb, the fingertip is usually a convex vertex, as shown in Fig. 6a. For a point Piðxi; yiÞ located in the contour of a hand, by tracing the contour, we can select the point Piqðxiq; yiqÞ before Pi and the point Piþqðxiþq; yiþqÞ after Pi. Here, i; q 2 N. We calculate the angle ui between the two vectors PiPiq !, PiPiþq !, according to Eq. (1). In order to simplify the calculation for ui, we map ui in the range ui 2 ½0; 180. If ui 2 ½ul; uh; ul yiq and yi > yiþq. Otherwise, Pi will not be a candidate vertex. If there are multiple candidate vertexes, such as P0 i in Fig. 6a, we will choose the vertex having the largest vertical coordinate, because it has higher probability of being a fingertip, as Pi shown in Fig. 6a. Here, the largest vertical coordinate means the local maximum in a finger’s contour, such as the red circle shown in Fig. 5e. The range of a finger’s contour can be limited by Eq. (1), i.e., the angle feature of a finger. Based on extensive experiments, we set ul ¼ 60, uh ¼ 150, q ¼ 20 in this paper ui ¼ arccos PiPiq ! PiPiþq ! jPiPiq !jjPiPiþq !j : (1) In a thumb, the “fingertip” also means a convex vertex of the finger. Thus we still use Eq. (1) to represent the shape of the fingertip in a thumb. However, the position of the convex vertex can be different from that of a non-thumb. As shown in Fig. 6b, the relative positions of Piq, Pi, Piþq are different from that in Fig. 6a. In Fig. 6b, we show the thumb of the left hand. Obviously, Piq, Pi, Piþq do not satisfy yi > yiq and yi > yiþq. Therefore, we use ðxi xiqÞðxi xiþqÞ > 0 to describe the relative locations of Piq, Pi, Piþq in thumbs. Then, we choose the vertex with largest vertical coordinate in a finger’s contour as the fingertip, as mentioned in the last paragraph. In fingertip detection, we only need to detect the points located on the bottom edge (from the left most point to the right most point) of the hand, such as the blue contour of right hand in Fig. 5d. The shape feature ui and the positions in vertical coordinates yi along the bottom edge are shown Fig. 5e. If we can detect five fingertips in a hand with ui and yiq, yi, yiþq, we assume that we have also found the thumb. At this time, the thumb presses a key like a non-thumb. Otherwise, we detect the fingertip of the thumb in the right most area of left hand or left most area of right hand according to ui and xiq, xi, xiþq. The detected fingertips of Fig. 5a are marked in Fig. 5f. 4.3 Keystroke Detection and Localization After detecting the fingertip, we will track the fingertip to detect a possible keystroke and locate it for text entry. The keystroke is usually correlated with one or two fingertips, therefore we first select the candidate fingertip having a high probability of pressing a key, instead of detecting all fingertips, to reduce the computation overhead. Then, we track the candidate fingertip to detect the possible keystroke. Finally, we correlate the candidate fingertip with the pressed key to locate the keystroke. 4.3.1 Candidate Fingertip Selection in Each Hand CamK allows the user to use all of their fingers for textentry, thus the keystroke may come from the left or right hand. Based on the observations (see Section 3.1), the fingertip (i.e., StrokeTip) pressing the key usually has the largest vertical coordinate in that hand, such as finger 9 shown in Fig. 2a. Therefore, we first select the candidate fingertip with the largest vertical coordinate in each hand. We respectively use Cl and Cr to represent the points located in the contour of left hand and right hand. For a point Plðxl; ylÞ 2 Cl, if Pl satisfies yl yjð8Pjðxj; yjÞ 2 Cl; j 6¼ lÞ, then Pl will be selected as the candidate fingertip in the left hand. Similarly, we can get the candidate fingertip Prðxr; yrÞ in the right hand. In this step, we only need to get Pl and Pr, instead of detecting all fingertips. 4.3.2 Keystroke Detection Based on Fingertip Tracking As described in the observations, when the user presses a key, the fingertip will stay at that key for a certain duration. Fig. 5. Fingertip detection. Fig. 6. Features of a fingertip. 2240 IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. 17, NO. 10, OCTOBER 2018
YIN ETAL:CAMK:CAMERA-BASED KEYSTROKE DETECTION AND LOCALIZATION FOR SMALL MOBILE DEVICES 2241 0(0,0) (xA,yi)】 (x/2,yi2)】 B (比,y2 D ∠K3/∠K4y (x4ya】 (t3y) (a)Candidate keys (b)Locating a fingertip (a)Keys around the fingertip (b)Keys containing the fin- Fig.7.Candidate keys and Candidate fingertips. gertip Therefore,we can use the location variation of the candidate fingertip to detect a possible keystroke.In Frame i,we use P (i)and P(r,r)to represent the candidate finger- tips in the left hand and right hand,respectively.If the can- didate fingertips in frame [i-1,i]satisfy Eq.(2)in left hand or Eq.(3)in right hand,the corresponding fingertip will be treated as static,i.e.,a keystroke probably happens.Based (c)Visually obstructed key (d)Vertical distance with re- on extensive experiments,we set Ar =5 empirically maining fingertips Fig.8.Candidate fingertips/keys in each step. V(,-x4-1)P+(--1尸≤△r (2) a fingertip P is not located in the keyboard region,CamK V(r:-x-)》2+(--1)2≤△r (3) eliminates it from the candidate fingertips Crip. Selecting the Nearest Candidate Keys.For each candidate fingertip Pi,we first search the candidate keys which are 4.3.3 Keystroke Localization by Correlating the probably pressed by P.As shown in Fig.7a,although the Fingertip with the Pressed Key real fingertip is Pi,the detected fingertip is P.We use p to After detecting a possible keystroke,we correlate the candi- search the candidate keys.We use Kej(ej,to represent date fingertip and the pressed key to locate the keystroke, the centroid of key Kj.Then we get two rows of keys near- based on the observations of Section 3.1.In regard to the est the location P(,)(i.e.,the rows with two smallest candidate fingertips,we treat the thumb as a special case, lyj-).For each row,we select the two nearest keys (i.e., and also select it as a candidate fingertip at first.Then,we the keys with two smallest -)In Fig.7a,the candi- get the candidate fingertip set Cp=(PP,left thumb date key set C is consisted of K1,K2,K3,K.Fig.8a shows in frame i,right thumb in frame i.After that,we can locate the candidate keys of each fingertip. the keystroke by using Algorithm 1. Retaining Candidate Keys Containing the Candidate Finger- tip.If a key is pressed by the user,the fingertip will be Algorithm 1.Keystroke Localization located in that key.Thus we use the location of the fingertip Pi(,)to verify whether a candidate key contains the fin- Input:Candidate fingertip set Cup in frame i. Remove fingertips out of the keyboard from Crip. gertip,to remove the invalid candidate keys.As shown in Fig.7a,there exists a small deviation between the real finger- forP∈Cup do Obtain candidate key set Cey around P. tip and the detected fingertip.Therefore,we extend the range forK;∈Ckey do of the detected fingertip to Ri,as shown in Fig.7a.If any point if P is located in Ki then P(k,y)in the range Ri is located in a candidate key Ki,P is Calculate the coverage ratio P;of Kj. considered to be located in Kj.Ri is calculated as [PE if Pe;forms a possible keystroke. coordinates is located in the top left corner of the image. else Remove P;from Ctip. Therefore,if the fingertip PE Ri satisfies Eq.(4),it is located if Cp =0 then No keystroke occurs,return in the key.CamK will keep it as a candidate key.Otherwise, if ICipl =1 then Return the pressed key. CamK removes the key from the candidate key set Ckey.In Select P,Ki>with largest ratio P&,in each hand. Fig.7a,K1,K2 are the remaining candidate keys.The candi- Obtain P,K>(in left (right)hand. date keys contain the fingertip in Fig.8a is shown in Fig.8b Calculate relative distance d(d)in left(right)hand. if d>d,then Return Ki.else Return Kr. AB×AP≥0,BC×B≥0, (4) Output:The pressed key. C元×C2≥0,DA×D2≥0. Eliminating Impossible Fingertips.For convenience,we use Calculating the Coverage Ratios of Candidate Keys.When a P to represent the fingertip in Crip,i.e.,P Ctip,i[1,4].If key is pressed,it is visually obstructed by the fingertip,as
Therefore, we can use the location variation of the candidate fingertip to detect a possible keystroke. In Frame i, we use Pliðxli ; yliÞ and Priðxri ; yriÞ to represent the candidate fingertips in the left hand and right hand, respectively. If the candidate fingertips in frame ½i 1; i satisfy Eq. (2) in left hand or Eq. (3) in right hand, the corresponding fingertip will be treated as static, i.e., a keystroke probably happens. Based on extensive experiments, we set Dr ¼ 5 empirically ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ðxli xli1 Þ 2 þ ðyli yli1 Þ 2 q Dr; (2) ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ðxri xri1 Þ 2 þ ðyri yri1 Þ 2 q Dr: (3) 4.3.3 Keystroke Localization by Correlating the Fingertip with the Pressed Key After detecting a possible keystroke, we correlate the candidate fingertip and the pressed key to locate the keystroke, based on the observations of Section 3.1. In regard to the candidate fingertips, we treat the thumb as a special case, and also select it as a candidate fingertip at first. Then, we get the candidate fingertip set Ctip ¼ fPli ; Pri ; left thumb in frame i;right thumb in frame ig. After that, we can locate the keystroke by using Algorithm 1. Algorithm 1. Keystroke Localization Input: Candidate fingertip set Ctip in frame i. Remove fingertips out of the keyboard from Ctip . for Pi 2 Ctip do Obtain candidate key set Ckey around Pi. for Kj 2 Ckey do if Pi is located in Kj then Calculate the coverage ratio rkj of Kj. if rkj forms a possible keystroke. else Remove Pi from Ctip. if Ctip ¼ ; then No keystroke occurs, return. if jCtipj ¼ 1 then Return the pressed key. Select with largest ratio rkj in each hand. Obtain ( ) in left (right) hand. Calculate relative distance dl (dr) in left (right) hand. if dl > dr then Return Kl. else Return Kr. Output: The pressed key. Eliminating Impossible Fingertips. For convenience, we use Pi to represent the fingertip in Ctip, i.e., Pi 2 Ctip; i 2 ½1; 4. If a fingertip Pi is not located in the keyboard region, CamK eliminates it from the candidate fingertips Ctip. Selecting the Nearest Candidate Keys. For each candidate fingertip Pi, we first search the candidate keys which are probably pressed by Pi. As shown in Fig. 7a, although the real fingertip is Pi, the detected fingertip is P^ i. We use P^i to search the candidate keys. We use Kcjðxcj; ycjÞ to represent the centroid of key Kj. Then we get two rows of keys nearest the location P^iðx^i; y^iÞ (i.e., the rows with two smallest jycj y^ij). For each row, we select the two nearest keys (i.e., the keys with two smallest jxcj x^ij). In Fig. 7a, the candidate key set Ckey is consisted of K1; K2; K3; K4. Fig. 8a shows the candidate keys of each fingertip. Retaining Candidate Keys Containing the Candidate Fingertip. If a key is pressed by the user, the fingertip will be located in that key. Thus we use the location of the fingertip P^iðx^i; y^iÞ to verify whether a candidate key contains the fingertip, to remove the invalid candidate keys. As shown in Fig. 7a, there exists a small deviation between the real fingertip and the detected fingertip. Therefore, we extend the range of the detected fingertip to Ri, as shown in Fig. 7a. If any point Pkðxk; ykÞ in the range Ri is located in a candidate key Kj, P^ i is considered to be located in Kj. Ri is calculated as fPk 2 Rij ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ðx^i xkÞ 2 þ ðy^i ykÞ 2 q Drg. We set Dr ¼ 5 empirically. As shown in Fig. 7b, a key is represented as a quadrangle ABCD. If a point is located in ABCD, when we traverse ABCD clockwise, the point will be located in the right side of each edge in ABCD. As shown in Fig. 2a, the origin of coordinates is located in the top left corner of the image. Therefore, if the fingertip P 2 Ri satisfies Eq. (4), it is located in the key. CamK will keep it as a candidate key. Otherwise, CamK removes the key from the candidate key set Ckey. In Fig. 7a, K1; K2 are the remaining candidate keys. The candidate keys contain the fingertip in Fig. 8a is shown in Fig. 8b AB! AP! 0; BC! BP! 0; CD! CP! 0; DA! DP! 0: (4) Calculating the Coverage Ratios of Candidate Keys. When a key is pressed, it is visually obstructed by the fingertip, as Fig. 7. Candidate keys and Candidate fingertips. Fig. 8. Candidate fingertips/keys in each step. YIN ET AL.: CAMK: CAMERA-BASED KEYSTROKE DETECTION AND LOCALIZATION FOR SMALL MOBILE DEVICES 2241
2242 IEEE TRANSACTIONS ON MOBILE COMPUTING,VOL.17,NO.10.OCTOBER 2018 the dashed area of key Ki shown in Fig.7a.We use the cov- 5.1 Initial Training erage ratio to measure the visually obstructed area of a can- Optimal Parameters for Image Processing.For key segmenta- didate key,in order to remove wrong candidate keys.For a tion (see Section 4.1.2),sy is used for tolerating the change candidate key Ki,whose area is Sk,the visually obstructed of Y caused by the environment.Initially,y=50.CamK area is D,and its coverage ratio is Pk=s Du.For a larger updates y=+1,when the number of extracted key (e.g.,the space key),we update pe;by multiplying a key keys decreases,it stops.Then,CamK sets y to 50 and size factor方rie,P%,=min().where方=S/风. updates y=-1-1,when the number of extracted keys decreases,it stops.In the process,when CamK gets maxi- Here,5 means the average area of a key,i.e,Sa=So/Natg mum number of keys,the corresponding value Ey is If P,>pL,the key Kj is still a candidate key.Otherwise, selected as the optimal value for y. CamK removes it from the candidate key set Ckey.We set In hand segmentation,CamK uses erosion and dilation Pr=0.25 by default.For each hand,if there is more than operations,which respectively use a kernel B[32]to process one candidate key,we will keep the key with largest cover- images.To get a suitable size of B,the user first puts his/her age ratio as the final candidate key.For a candidate fingertip, hands on the home row of the keyboard (see Fig.5a).For sim- if there is no candidate key associated with it,the fingertip plicity,we set the kernel sizes for erosion and dilation to be will be eliminated.Fig.8c shows each candidate fingertip equal.The initial kernel size is 2o=0.Then,CamK updates and its associated key. =-1+1.When CamK can localize each fingertip in the correct key with zi,then CamK sets the kernel size as z=z. 4.3.4 Vertical Distance with Remaining Fingertips In initial training,the user puts on the hands based on the on-screen instructions,it usually spends less than 10s. Until now,there is one candidate fingertip in each hand at Frame Rate Selection.CamK sets the initial/default frame most.If there are no candidate fingertips,then no keystroke rate of the camera to be fo =30fps (frames per second), is detected.If there is only one candidate fingertip,then the which is usually the maximal possible frame rate.We use no; fingertip is the StrokeTip while the associated key is Stroke- to represent the number of frames containing the ith key- Key,they represent the keystroke.However,if there are two stroke.When the user has pressed u keys,we can get candidate fingertips,we will utilize the vertical distance the average number of frames during a keystroke as between the candidate fingertip and the remaining fingertips to choose the most probable StrokeTip,as shown in Fig.2a. o=士·∑ng,In fact,,而reflects the duration of a key- We use P(x,y)and P(r,yr)to represent the candidate stroke.When the frame rate f changes,the number of frames fingertips in the left hand and right hand,respectively in a keystroke nf changes.Intuitively,a smaller value of nf Then,we calculate the distance d between P and the can reduce the image processing time,while a larger value of remaining fingertips in the left hand,and the distance d, nf can improve the accuracy of keystroke localization.Based between P,and the remaining fingertips in the right hand. on extensive experiments (see Section 7.3),we set rif=3, Here,d=l-7·∑ih,j≠4,while d,=l-a·∑0 husf=[6· yj,jr.Here,yj represents the vertical coordinate of fin- gertip j.If d>d,we choose P as the StrokeTip.Otherwise, 5.2 Online Calibration we choose P as the StrokeTip.The associated key for the Removing False Positive Keystrokes.Under certain conditions, StrokeTip is the pressed key StrokeKey.In Fig.8d,we choose the user does not type any key while keeping the fingers sta- fingertip 3 in the left hand as StrokeTip.However,consider- tionary,CamK may misclassify the non-keystroke as a key- ing the effect of camera's view,sometimes di(d,)may fail to stroke.Thus we introduce a femporary character to mitigate locate the keystroke accurately.Therefore,for the unse- this problem. lected candidate fingertip (e.g.,fingertip 8 in Fig.8d),we In the process of pressing a key,the SfrokeTip moves will not discard its associated key directly.Specifically,we towards the key,stays at that key,and then moves away. sort the previous candidate keys which contain the candi- The vertical coordinate of the StrokeTip first increases,then date fingertip based on the coverage ratio in descending pauses,then decreases.If CamK has detected a keystroke in order.Finally,we select top four candidates keys and show nf consecutive frames,it displays the current character on them on the screen.The user can press the candidate key for the screen as a temporary character.In the next frame(s),if text input(see Fig.1),to tolerate the localization error. the position of the StrokeTip does not satisfy the features of a keystroke,CamK will cancel the temporary character.This 5 OPTIMIZATIONS FOR KEYSTROKE LOCALIZATION does not have much impact on the user's experience, because of the short time between two consecutive frames. AND IMAGE PROCESSING Besides,CamK also displays the candidate keys around the Considering the deviation caused by image processing,the StrokeTip,the user can choose them for text input. influence of light conditions,and other factors,we introduce Movement of Smartphone or Keyboard.CamK presumes that the initial training to select the suitable values of parameters the smartphone and the keyboard do not move while in use. for image processing and utilize online calibration to For best results,we recommend the user to tape the paper improve the performance of keystroke detection and locali- keyboard on a flat surface.Nevertheless,to alleviate the zation.In addition,considering the limited resources of effect caused by the movements of the mobile device or the small mobile devices,we also introduce multiple optimiza- keyboard,we offer a simple solution.If the user continu- tion techniques to reduce the time latency and energy cost ously uses the Delete key on the screen multiple times(e.g., in CamK. larger than 3 times),CamK will inform the user to move
the dashed area of key K1 shown in Fig. 7a. We use the coverage ratio to measure the visually obstructed area of a candidate key, in order to remove wrong candidate keys. For a candidate key Kj, whose area is Skj , the visually obstructed area is Dkj , and its coverage ratio is rkj ¼ Dkj Skj . For a larger key (e.g., the space key), we update rkj by multiplying a key size factor fj, i.e., rkj ¼ minð Dkj Skj fj; 1Þ, where fj ¼ Skj=S k. Here, S k means the average area of a key, i.e, S k ¼ Sb=Navg. If rkj rl, the key Kj is still a candidate key. Otherwise, CamK removes it from the candidate key set Ckey. We set rl ¼ 0:25 by default. For each hand, if there is more than one candidate key, we will keep the key with largest coverage ratio as the final candidate key. For a candidate fingertip, if there is no candidate key associated with it, the fingertip will be eliminated. Fig. 8c shows each candidate fingertip and its associated key. 4.3.4 Vertical Distance with Remaining Fingertips Until now, there is one candidate fingertip in each hand at most. If there are no candidate fingertips, then no keystroke is detected. If there is only one candidate fingertip, then the fingertip is the StrokeTip while the associated key is StrokeKey, they represent the keystroke. However, if there are two candidate fingertips, we will utilize the vertical distance between the candidate fingertip and the remaining fingertips to choose the most probable StrokeTip, as shown in Fig. 2a. We use Plðxl; ylÞ and Prðxr; yrÞ to represent the candidate fingertips in the left hand and right hand, respectively. Then, we calculate the distance dl between Pl and the remaining fingertips in the left hand, and the distance dr between Pr and the remaining fingertips in the right hand. Here, dl ¼ jyl 1 4 Pj¼5 j¼1 yj; j 6¼ lj, while dr ¼ jyr 1 4 Pj¼10 j¼6 yj; j 6¼ rj. Here, yj represents the vertical coordinate of fingertip j. If dl > dr, we choose Pl as the StrokeTip. Otherwise, we choose Pr as the StrokeTip. The associated key for the StrokeTip is the pressed key StrokeKey. In Fig. 8d, we choose fingertip 3 in the left hand as StrokeTip. However, considering the effect of camera’s view, sometimes dl (dr) may fail to locate the keystroke accurately. Therefore, for the unselected candidate fingertip (e.g., fingertip 8 in Fig. 8d), we will not discard its associated key directly. Specifically, we sort the previous candidate keys which contain the candidate fingertip based on the coverage ratio in descending order. Finally, we select top four candidates keys and show them on the screen. The user can press the candidate key for text input (see Fig. 1), to tolerate the localization error. 5 OPTIMIZATIONS FOR KEYSTROKE LOCALIZATION AND IMAGE PROCESSING Considering the deviation caused by image processing, the influence of light conditions, and other factors, we introduce the initial training to select the suitable values of parameters for image processing and utilize online calibration to improve the performance of keystroke detection and localization. In addition, considering the limited resources of small mobile devices, we also introduce multiple optimization techniques to reduce the time latency and energy cost in CamK. 5.1 Initial Training Optimal Parameters for Image Processing. For key segmentation (see Section 4.1.2), "y is used for tolerating the change of Y caused by the environment. Initially, "y ¼ 50. CamK updates "yi ¼ "yi1 þ 1, when the number of extracted keys decreases, it stops. Then, CamK sets "y to 50 and updates "yi ¼ "yi1 1, when the number of extracted keys decreases, it stops. In the process, when CamK gets maximum number of keys, the corresponding value "yi is selected as the optimal value for "y. In hand segmentation, CamK uses erosion and dilation operations, which respectively use a kernel B [32] to process images. To get a suitable size of B, the user first puts his/her hands on the home row of the keyboard (see Fig. 5a). For simplicity, we set the kernel sizes for erosion and dilation to be equal. The initial kernel size is z0 ¼ 0. Then, CamK updates zi ¼ zi1 þ 1. When CamK can localize each fingertip in the correct key with zi, then CamK sets the kernel size as z ¼ zi. In initial training, the user puts on the hands based on the on-screen instructions, it usually spends less than 10s. Frame Rate Selection. CamK sets the initial/default frame rate of the camera to be f0 ¼ 30fps (frames per second), which is usually the maximal possible frame rate. We use n0i to represent the number of frames containing the ith keystroke. When the user has pressed u keys, we can get the average number of frames during a keystroke as n0 ¼ 1 u Pi¼u i¼1 n0i . In fact, n0 reflects the duration of a keystroke. When the frame rate f changes, the number of frames in a keystroke nf changes. Intuitively, a smaller value of nf can reduce the image processing time, while a larger value of nf can improve the accuracy of keystroke localization. Based on extensive experiments (see Section 7.3), we set nf ¼ 3, thus f ¼ df0 nf n0 e. 5.2 Online Calibration Removing False Positive Keystrokes. Under certain conditions, the user does not type any key while keeping the fingers stationary, CamK may misclassify the non-keystroke as a keystroke. Thus we introduce a temporary character to mitigate this problem. In the process of pressing a key, the StrokeTip moves towards the key, stays at that key, and then moves away. The vertical coordinate of the StrokeTip first increases, then pauses, then decreases. If CamK has detected a keystroke in nf consecutive frames, it displays the current character on the screen as a temporary character. In the next frame(s), if the position of the StrokeTip does not satisfy the features of a keystroke, CamK will cancel the temporary character. This does not have much impact on the user’s experience, because of the short time between two consecutive frames. Besides, CamK also displays the candidate keys around the StrokeTip, the user can choose them for text input. Movement of Smartphone or Keyboard. CamK presumes that the smartphone and the keyboard do not move while in use. For best results, we recommend the user to tape the paper keyboard on a flat surface. Nevertheless, to alleviate the effect caused by the movements of the mobile device or the keyboard, we offer a simple solution. If the user continuously uses the Delete key on the screen multiple times (e.g., larger than 3 times), CamK will inform the user to move 2242 IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. 17, NO. 10, OCTOBER 2018
YIN ETAL:CAMK:CAMERA-BASED KEYSTROKE DETECTION AND LOCALIZATION FOR SMALL MOBILE DEVICES 2243 TABLE 1 Time Cost in GT-19100 Smartphone (Image is Described in Pixels) Cap-img Cap-img Tra-tip Loc-key (120*90) (480*360) (120*90) (480*360) Image 32 ms 75ms 9 ms 339ms Reference 0.02ms 0.02ms 13 ms 106ms Fig.9.Changing image sizes and focusing on the target area his/her hands away from the keyboard for relocation.After image size is 480*360 pixels.As shown in Fig.9,when a key- that,the user can continue the typing process.The reloca- stroke will happen,CamK adaptively changes frame i to be a tion process just needs the user to move away the hands large image (i.e.,480*360 pixels).After that,CamK changes and it is usually less than 10s. the following frames to small images(i.e.,120*90 pixels)until the next keystroke is detected.The time cost in 'Cap-img 5.3 Real Time Image Processing (120*90 pixels),'Cap-img (480*360 pixels)'Tra-tip (120*90 As a text-entry method,CamK needs to output the character pixels)',Loc-key (480*360 pixels)'is 32,75,9,631 ms,respec- without noticeable latency.According to Section 4,in order tively.Then,we can estimate the minimal time cost T2 to to output a character,we need to capture images,track fin- detect and locate a keystroke with Eq.(6).Here,T2 is 59.7 gertips,and finally detect and locate the keystroke.The percent of Th.However,more optimizations are expected large time cost in image processing leads to large time for large-size image processing latency for text output.To solve this problem,we first pro- file the time cost of each stage in Camk,and then introduce T2=(32+9+32+9+75+631)ms=788ms. (6) four optimization techniques to reduce the time cost.Unless otherwise specified,the frame rate is set to 30fps by default. 5.3.3 Optimizing Large-Size Image Processing Based on Fig.8,the keystroke is only related to a small area 5.3.1 Time Cost in Different Stages of the large-size image.Thus we optimize CamK by only There are three main stages in CamK,i.e.,capturing the processing the small area around the StrokeTip:the red images,tracking the fingertips,and locating the keystroke. region of frame i shown in Fig.9.Suppose the position of The stages are respectively called 'Cap-img','Tra-tip'and the candidate fingertip in frame i-1 (small image)is Loc-key'for short.We frist set the image size to 640 *480 P-1(e,ye),then we scale the position to P(ye)(corre- pixels which is supported by many smartphones,and then sponding position in large image),according to the ratio Pat measure the time cost of producing or processing one image of the small image to the large image in width,i.e, in each stage with a Samsung GT-19100 smartphone(GT- t=菱,=Pa.Here,=器Then,CamK only processes the 19100 for short).According to the measurement,the time area S in frame i(large image)to reduce time cost,as cost in 'Cap-img',Tra-tip',Loc-key'is 99,118,787 ms, shown in Eq.(7).We set△x=40,△y=20 by default respectively.Here,'Cap-img'means the time for capturing S2={P(x,)∈S-Ee|≤△x,l-≤△}.(7) an image,Tra-tip'means the time for processing the image to select the candidate fingertips,Loc-key'means the time for processing an image to locate the keystroke.We repeats Currently,the processing time for the large-size image is the measurement for 500 times to get the average time cost. 339 ms.As Image'row shown in Table 1,we estimate the According to Section 5.1,we need to capture/process minimal time to detect and locate a keystroke with Eq.(8), three images during a keystroke to guarantee the localiza- which is 37.6 percent of that in Eq.(5).However,the proc- tion accuracy.Thus we can estimate the minimal time T of essing time 339 ms for the large-size image is still a little detecting and locating a keystroke with Eq.(5).Obviously, high.If CamK works with a single thread,it may miss the 1,320 ms is a very large latency,thus more optimizations are next keystroke,due to the large processing time.Thus, expected for CamK to realize real time processing CamK is expected to work with multiple threads in parallel T3=(32+9+32+9+75+339)ms=496ms.(8) T1=(99+118+99+118+99+787)ms=1320ms.(5) 5.3.4 Multi-Thread Processing 5.3.2 Adaptively Changing Image Sizes According the above conclusion,CamK uses three threads As descried before,it is rather time-consuming to process to capture,detect and locate the keystrokes in parallel.As the image with 640*480 pixels.Intuitively,if CamK oper- shown in Fig.10,the 'Cap-img'thread captures the images, ates on smaller images,it would reduce the time cost.How- the 'Tra-tip'thread processes the small images for keystroke ever,to guarantee the keystroke localization accuracy,we detection,and the Loc-key'thread processes the large can not use a very small image.Therefore,we adaptively image to locate the keystroke.In this way,CamK will not adopts different sizes of images for processing,as shown in miss the frames of next keystroke,because the 'Cap-img Fig.9.We use smaller images to track the fingertips during thread does not stop taking images.As shown in Fig.10, two keystrokes and use larger images to locate the detected Camk utilizes consecutive frames to determine the key- keystroke.Based on extensive experiments,we set the size stroke and also introduces the online calibration to improve of the small image to be 120 *90 pixels,while the large the performance
his/her hands away from the keyboard for relocation. After that, the user can continue the typing process. The relocation process just needs the user to move away the hands and it is usually less than 10s. 5.3 Real Time Image Processing As a text-entry method, CamK needs to output the character without noticeable latency. According to Section 4, in order to output a character, we need to capture images, track fingertips, and finally detect and locate the keystroke. The large time cost in image processing leads to large time latency for text output. To solve this problem, we first pro- file the time cost of each stage in Camk, and then introduce four optimization techniques to reduce the time cost. Unless otherwise specified, the frame rate is set to 30fps by default. 5.3.1 Time Cost in Different Stages There are three main stages in CamK, i.e., capturing the images, tracking the fingertips, and locating the keystroke. The stages are respectively called ‘Cap-img’, ‘Tra-tip’ and ‘Loc-key’ for short. We frist set the image size to 640 480 pixels which is supported by many smartphones, and then measure the time cost of producing or processing one image in each stage with a Samsung GT-I9100 smartphone (GTI9100 for short). According to the measurement, the time cost in ‘Cap-img’, ‘Tra-tip’, ‘Loc-key’ is 99, 118, 787 ms, respectively. Here, ‘Cap-img’ means the time for capturing an image, ‘Tra-tip’ means the time for processing the image to select the candidate fingertips, ‘Loc-key’ means the time for processing an image to locate the keystroke. We repeats the measurement for 500 times to get the average time cost. According to Section 5.1, we need to capture/process three images during a keystroke to guarantee the localization accuracy. Thus we can estimate the minimal time Tk1 of detecting and locating a keystroke with Eq. (5). Obviously, 1,320 ms is a very large latency, thus more optimizations are expected for CamK to realize real time processing Tk1 ¼ ð99 þ 118 þ 99 þ 118 þ 99 þ 787Þms ¼ 1320 ms: (5) 5.3.2 Adaptively Changing Image Sizes As descried before, it is rather time-consuming to process the image with 640 480 pixels. Intuitively, if CamK operates on smaller images, it would reduce the time cost. However, to guarantee the keystroke localization accuracy, we can not use a very small image. Therefore, we adaptively adopts different sizes of images for processing, as shown in Fig. 9. We use smaller images to track the fingertips during two keystrokes and use larger images to locate the detected keystroke. Based on extensive experiments, we set the size of the small image to be 120 90 pixels, while the large image size is 480*360 pixels. As shown in Fig. 9, when a keystroke will happen, CamK adaptively changes frame i to be a large image (i.e., 480*360 pixels). After that, CamK changes the following frames to small images (i.e., 120*90 pixels) until the next keystroke is detected. The time cost in ‘Cap-img (120*90 pixels)’, ‘Cap-img (480*360 pixels)’ ‘Tra-tip (120*90 pixels)’, ‘Loc-key (480*360 pixels)’ is 32, 75, 9, 631 ms, respectively. Then, we can estimate the minimal time cost Tk2 to detect and locate a keystroke with Eq. (6). Here, Tk2 is 59.7 percent of Tk1. However, more optimizations are expected for large-size image processing Tk2 ¼ ð32 þ 9 þ 32 þ 9 þ 75 þ 631Þms ¼ 788 ms: (6) 5.3.3 Optimizing Large-Size Image Processing Based on Fig. 8, the keystroke is only related to a small area of the large-size image. Thus we optimize CamK by only processing the small area around the StrokeTip: the red region of frame i shown in Fig. 9. Suppose the position of the candidate fingertip in frame i 1 (small image) is Pi1ðxc; ycÞ, then we scale the position to P0 i1ðx 0 c; y0 cÞ (corresponding position in large image), according to the ratio rsl of the small image to the large image in width, i.e, xc x 0 c ¼ yc y 0 c ¼ rsl. Here, rsl ¼ 120 480. Then, CamK only processes the area S0 c in frame i (large image) to reduce time cost, as shown in Eq. (7). We set Dx ¼ 40, Dy ¼ 20 by default S 0 c ¼ fPiðxi; yiÞ 2 S 0 cj jxi x 0 cj Dx; jyi y 0 cj Dyg: (7) Currently, the processing time for the large-size image is 339 ms. As ‘Image’ row shown in Table 1, we estimate the minimal time to detect and locate a keystroke with Eq. (8), which is 37.6 percent of that in Eq. (5). However, the processing time 339 ms for the large-size image is still a little high. If CamK works with a single thread, it may miss the next keystroke, due to the large processing time. Thus, CamK is expected to work with multiple threads in parallel Tk3 ¼ ð32 þ 9 þ 32 þ 9 þ 75 þ 339Þms ¼ 496 ms: (8) 5.3.4 Multi-Thread Processing According the above conclusion, CamK uses three threads to capture, detect and locate the keystrokes in parallel. As shown in Fig. 10, the ‘Cap-img’ thread captures the images, the ‘Tra-tip’ thread processes the small images for keystroke detection, and the ‘Loc-key’ thread processes the large image to locate the keystroke. In this way, CamK will not miss the frames of next keystroke, because the ‘Cap-img’ thread does not stop taking images. As shown in Fig. 10, Camk utilizes consecutive frames to determine the keystroke and also introduces the online calibration to improve the performance. TABLE 1 Time Cost in GT-I9100 Smartphone (Image is Described in Pixels) Cap-img (120*90) Cap-img (480*360) Tra-tip (120*90) Loc-key (480*360) Image 32 ms 75 ms 9 ms 339 ms Reference 0.02 ms 0.02 ms 13 ms 106 ms Fig. 9. Changing image sizes and focusing on the target area. YIN ET AL.: CAMK: CAMERA-BASED KEYSTROKE DETECTION AND LOCALIZATION FOR SMALL MOBILE DEVICES 2243
2244 IEEE TRANSACTIONS ON MOBILE COMPUTING,VOL.17,NO.10.OCTOBER 2018 Cap-img Tra-tip Loc-key command Producing Movin Processng large small images Processin里sma目 Small images images to track the moving states of the Producing lingertips 发ss arge images Online calibration (a)An input image (b)Multi-touch Fig.10.Multi-thread processing. Fig.11.Multi-touch function in CamK. By adopting multiple threads,we can estimate the mini- mal time to detect and locate the keystroke with Eq.(9), image processing.Based on the definition of Camera.Parame- which is 36.4 percent of that in Eq.(5).Because the frame ters [33]in Android APIs,the adjustable parameters of cam- rate is 30fps,the interval between two frames is 33 ms. era are picture size,preview frame rate,preview size,and Therefore,we use 33 ms to replace 32 ms('Cap-img'time). camera view size (i.e.,window size of camera). In regard to the 'Tra-tip'time (9 ms),which is simultaneous Among the parameters,the picture size has no effect on with the 'Cap-img'time of the next frame,thus not being CamK.The preview frame rate and preview size (i.e.,image added in Eq.(9).When comparing with Eq.(8),Eq.(9)does sizes)have already been optimized in Sections 5.1 and 5.3. not reduce much processing time.This is mainly caused by Therefore,we only observe how camera view size affect the the time-consuming operations for writing and reading power consumption by a Monsoon power monitor [341. each image.Therefore,it is better eliminate the operations When we respectively set the camera view size to 120*90, 240*180,360*270,480*360,720*540,960*720,1200*900 pixels, of writing/reading images the corresponding power consumption is 1204.6,1228.4, TM=(33+33+75+339)ms=480ms. (9) 1219.8,1221.8,1222.9,1213.5,1222.2mW.The camera view size has little effect on power consumption.In this paper, 5.3.5 Processing without Writing and Reading Images the camera view size is set to 480*360 pixels by default. To remove the operations of writing and reading images,we store the image data captured by the camera in the RAM. 6 EXTENSION:MULTI-TOUCH AND WORD Then,CamK accesses the data based on pass-by-reference.It PREDICTION indicates that different functions access the same image data. To provide a better user experience,we add multi-touch to In this way,we remove the operations of reading and writing allow the user to use a key combination,e.g.,pressing'shift' images.The corresponding time cost in each stage is shown and 'a'at the same time.Besides,we introduce word predic- in the Reference'row of Table 1.Here,the time cost for cap- fion to guess the possible word to be typed to improve the turing/storing the source data of a small image and a large text-input speed. image is the same.This is because the limitation of hard- wares,in preview mode,the size of the source data in each 6.1 Multi-Touch Function frame is the same(e.g.,480*360 pixels).If we want to get the In Section 4.3,CamK determines one efficient keystroke. image with size 120 90 pixels,we will resize the image dur- However,considering the actual typing behavior,we may ing image processing. use key combinations for special purposes.Take the Apple In Table 1,the time cost of image processing includes the Wireless Keyboard as an example,we can press 'command' time of reading image data and processing the image data.and'c at the same time to"copy".Therefore,we introduce When processing images with 120*90 pixels,CamK only the multi-touch for CamK,to allow the user to use special needs to detect the candidate fingertips with hand segmen- key combinations. tation and fingertip discovery.When processing images In CamK,we consider the case that user presses the key with 480 360 pixels,CamK not only needs to detect the fin- combination by using left hand and right hand at the same gertip,but also needs to select the candidate keys,calculate time,in order to eliminate ambiguity.In Fig.11a,we show an the covered area of the pressed key and correlate the candi- example of ambiguity,the 7th finger in right hand presses date fingertip with the pressed key to locate the keystroke. the key 'f',while the 10th fingertip in the image is located in Thus the timecost of processing the image with 120*90 pixels shift,it seems that the user aims to get the capital letterF.In is smaller than that of the image with 480 360 pixels.Accord- fact,the 3rd finger in left hand presses the key 'command', ing to Table 1,we can estimate the minimal time to detect and i.e.,the user wants to call the search function instead of getting locate a keystroke with Eq.(10),which is only 15.5 percent of the letter 'F,as shown in Fig.11b.Therefore,CamK considers that in Eq.(5).Here,33 ms means the waiting time between that the user uses key combination with two hands at the two images,because the maximum frame rate is 30fps.Unit same time,i.e.,one hand correlates with one keystroke.With now,Tk is comparable to the duration of a keystroke,i.e.,the the located keystroke in each hand,we verify whether the two output time latency is usually within 50 ms and below human keystrokes forms a special key combination.If it is true, response time[T☑ CamK calls the corresponding function.Otherwise,CamK T6=(33+33+33+106)ms=205ms. (10) determines the only efficient keystroke based on Section 4.3. 5.4 Reduction of Power Consumption 6.2 Word Prediction To make Camk practical for small mobile devices,we need When the user has input one or more characters,CamK will to reduce the power consumption in CamK,especially in predict the word the user is probably going to type,by
By adopting multiple threads, we can estimate the minimal time to detect and locate the keystroke with Eq. (9), which is 36.4 percent of that in Eq. (5). Because the frame rate is 30fps, the interval between two frames is 33 ms. Therefore, we use 33 ms to replace 32 ms (‘Cap-img’ time). In regard to the ‘Tra-tip’ time (9 ms), which is simultaneous with the ‘Cap-img’ time of the next frame, thus not being added in Eq. (9). When comparing with Eq. (8), Eq. (9) does not reduce much processing time. This is mainly caused by the time-consuming operations for writing and reading each image. Therefore, it is better eliminate the operations of writing/reading images Tk4 ¼ ð33 þ 33 þ 75 þ 339Þ ms ¼ 480 ms: (9) 5.3.5 Processing without Writing and Reading Images To remove the operations of writing and reading images, we store the image data captured by the camera in the RAM. Then, CamK accesses the data based on pass-by-reference. It indicates that different functions access the same image data. In this way, we remove the operations of reading and writing images. The corresponding time cost in each stage is shown in the ‘Reference’ row of Table 1. Here, the time cost for capturing/storing the source data of a small image and a large image is the same. This is because the limitation of hardwares, in preview mode, the size of the source data in each frame is the same (e.g., 480 360 pixels). If we want to get the image with size 120 90 pixels, we will resize the image during image processing. In Table 1, the time cost of image processing includes the time of reading image data and processing the image data. When processing images with 120 90 pixels, CamK only needs to detect the candidate fingertips with hand segmentation and fingertip discovery. When processing images with 480 360 pixels, CamK not only needs to detect the fingertip, but also needs to select the candidate keys, calculate the covered area of the pressed key and correlate the candidate fingertip with the pressed key to locate the keystroke. Thus the time cost of processing the image with 120 90 pixels is smaller than that of the image with 480 360 pixels. According to Table 1, we can estimate the minimal time to detect and locate a keystroke with Eq. (10), which is only 15.5 percent of that in Eq. (5). Here, 33 ms means the waiting time between two images, because the maximum frame rate is 30fps. Unit now, Tk is comparable to the duration of a keystroke, i.e., the output time latency is usually within 50 ms and below human response time [7] Tk ¼ ð33 þ 33 þ 33 þ 106Þ ms ¼ 205 ms: (10) 5.4 Reduction of Power Consumption To make Camk practical for small mobile devices, we need to reduce the power consumption in CamK, especially in image processing. Based on the definition of Camera.Parameters [33] in Android APIs, the adjustable parameters of camera are picture size, preview frame rate, preview size, and camera view size (i.e., window size of camera). Among the parameters, the picture size has no effect on CamK. The preview frame rate and preview size (i.e., image sizes) have already been optimized in Sections 5.1 and 5.3. Therefore, we only observe how camera view size affect the power consumption by a Monsoon power monitor [34]. When we respectively set the camera view size to 120*90, 240*180, 360*270, 480*360, 720*540, 960*720, 1200*900 pixels, the corresponding power consumption is 1204.6, 1228.4, 1219.8, 1221.8, 1222.9, 1213.5, 1222.2 mW. The camera view size has little effect on power consumption. In this paper, the camera view size is set to 480*360 pixels by default. 6 EXTENSION: MULTI-TOUCH AND WORD PREDICTION To provide a better user experience, we add multi-touch to allow the user to use a key combination, e.g., pressing ‘shift’ and ‘a’ at the same time. Besides, we introduce word prediction to guess the possible word to be typed to improve the text-input speed. 6.1 Multi-Touch Function In Section 4.3, CamK determines one efficient keystroke. However, considering the actual typing behavior, we may use key combinations for special purposes. Take the Apple Wireless Keyboard as an example, we can press ‘command’ and ‘c’ at the same time to “copy”. Therefore, we introduce the multi-touch for CamK, to allow the user to use special key combinations. In CamK, we consider the case that user presses the key combination by using left hand and right hand at the same time, in order to eliminate ambiguity. In Fig. 11a, we show an example of ambiguity, the 7th finger in right hand presses the key ‘f’, while the 10th fingertip in the image is located in ‘shift’, it seems that the user aims to get the capital letter ‘F’. In fact, the 3rd finger in left hand presses the key ‘command’, i.e., the user wants to call the search function instead of getting the letter ‘F’, as shown in Fig. 11b. Therefore, CamK considers that the user uses key combination with two hands at the same time, i.e., one hand correlates with one keystroke. With the located keystroke in each hand, we verify whether the two keystrokes forms a special key combination. If it is true, CamK calls the corresponding function. Otherwise, CamK determines the only efficient keystroke based on Section 4.3. 6.2 Word Prediction When the user has input one or more characters, CamK will predict the word the user is probably going to type, by Fig. 10. Multi-thread processing. Fig. 11. Multi-touch function in CamK. 2244 IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. 17, NO. 10, OCTOBER 2018
YIN ETAL:CAMK:CAMERA-BASED KEYSTROKE DETECTION AND LOCALIZATION FOR SMALL MOBILE DEVICES 2245 100 Train Type Delete Continue Clear 90 Please TYPE: For word prediction.we introduce common word: 80 and word frequencies to predict the next w 60 andidate Keys: we Localization accuracy False positive Predicted Words: 40 Start Stop Add with we would 30 The camera is turned ON. words 20 Fig.12.Word prediction. Office Outdoor Coffee Shop Restaurant Scenario Fig.14.Four scenarios. 41 is 4.4.4,the frame rate is 15fps,the image size is 480*360 pixels,and CamK works in the office.We first evaluate each component of CamK.Then,we test the performance of 0.4 CamK in different environments.Finally,we recruit 9 par- ticipants to use CamK and compare the performance of 0.2 CamK with other text-entry methods. 11 21 3141 7.1 Localization Accuracy for Known Keystrokes Localization Result To verify whether CamK has obtained the optimal parame- Fig.13.Confusion matrix of 59 keys. ters for image processing,we first measure the accuracy of keystroke localization with a Samsung SM-N9109W smart- using the common-word set and word frequencies [35]of phone,when CamK knows a keystroke is happening.The the typed words.In regard to the common-word set,we intro- user presses 59 keys(excluding the PC function keys:first duce Ne most common words [36],which are sorted by fre- row,five keys in last row)on the paper keyboard.Specifi- quency of use in descending order.We use Ei,ie[1,Nel to cally,we let the user press the sentences/words from the represent the priority level of the common word Wi, standard MacKenzie set [37].Besides,we introduce some EE(1).In regard to ord frequencies of the random characters to guarantee that each key will be typed words,we use F;to represent the frequency of word pressed with fifty times.We record the typing process with Wi typed by the user in CamK.Initially,F;is set to zero. a camera.When the occurrence of a keystroke is known,the Everytime the user types a word Wi,the frequency of Wi localization accuracy is close to 100 percent,as shown in increases by one.Whatever the larger priority level or the Fig.13.It indicates that CamK can adaptively select suitable larger word frequency,it indicates the word has a higher values of the parameters used in image processing. probability to be typed.By matching the prefix Sp of word Wk with that of common words,we get the top-me candi- 7.2 Accuracy in Different Environments date words le with the highest Ei.Similarly,we get the To verify whether CamK can detect and locate the keystroke top-m candidate words le with the highest F.After that, accurately,we conduct the experiments in four typical sce- we merge le and le to get the candidate words in leUl. narios:an office environment (light's color is close to white), We set Ne 1000,me mu =5 by default. outdoors (basic/pure light),a coffee shop (light's color is a As shown in Fig.12,when the user types 'w',we get le as little bit closer to that of human skin),and a restaurant(light (with,we,what,who,would},while le is [word,words,we}.is a bit dim).In each test,the user types words from the Then,we merge the candidate words as {with,we,what, MacKenzie set [37]and makes N&500 keystrokes.Sup- who,would,word,words}.By pressing the button corre- pose CamK locates Na keystrokes correctly and treats N sponding to the candidate word (e.g.,"word"),the user can non-keystrokes as keystrokes wrongly.We define the locali- omit the following keystrokes (e.g.,'o','r and 'd').Then, zation accuracy aspwhile the false positive rate as CamK can improve the input speed and reduce the compu- p=min(1).As shown in Fig.14,CamK can achieve tation/energy overhead. high accuracy(close to or larger than 85 percent)with low false positive rate(about 5 percent).In the office,the locali- 7 PERFORMANCE EVALUATION zation accuracy can achieve above 95 percent. We implement CamK on smartphones running Android We use the layout of Apple Wireless Keyboard (AWK)as 7.3 Effect of Frame Rate the default keyboard layout,which is printed on a piece of As described in Section 5.1,the frame rate affects the number US Letter sized paper.Unless otherwise specified,we use of images n during a keystroke.If the value of n is too small, the Samsung GT-19100 smartphone whose Android version CamK may miss the keystrokes.On the contrary,more frames
using the common-word set and word frequencies [35] of the typed words. In regard to the common-word set, we introduce Nc most common words [36], which are sorted by frequency of use in descending order. We use Ei; i 2 ½1; Nc to represent the priority level of the common word Wi, Ei ¼ Nciþ1 Ncþ1 , Ei 2 ð0; 1Þ. In regard to word frequencies of the typed words, we use Fj to represent the frequency of word Wj typed by the user in CamK. Initially, Fj is set to zero. Everytime the user types a word Wj, the frequency of Wj increases by one. Whatever the larger priority level or the larger word frequency, it indicates the word has a higher probability to be typed. By matching the prefix Spk of word Wk with that of common words, we get the top mc candidate words lc1 with the highest Ei. Similarly, we get the top mu candidate words lc2 with the highest Fj. After that, we merge lc1 and lc2 to get the candidate words in flc1 [ lc2 g. We set Nc ¼ 1000, mc ¼ mu ¼ 5 by default. As shown in Fig. 12, when the user types ‘w’, we get lc1 as fwith; we; what; who; wouldg, while lc2 is fword; words; weg. Then, we merge the candidate words as fwith; we; what; who; would; word; wordsg. By pressing the button corresponding to the candidate word (e.g., “word”), the user can omit the following keystrokes (e.g., ‘o’, ‘r’ and ‘d’). Then, CamK can improve the input speed and reduce the computation/energy overhead. 7 PERFORMANCE EVALUATION We implement CamK on smartphones running Android. We use the layout of Apple Wireless Keyboard (AWK) as the default keyboard layout, which is printed on a piece of US Letter sized paper. Unless otherwise specified, we use the Samsung GT-I9100 smartphone whose Android version is 4.4.4, the frame rate is 15fps, the image size is 480 360 pixels, and CamK works in the office. We first evaluate each component of CamK. Then, we test the performance of CamK in different environments. Finally, we recruit 9 participants to use CamK and compare the performance of CamK with other text-entry methods. 7.1 Localization Accuracy for Known Keystrokes To verify whether CamK has obtained the optimal parameters for image processing, we first measure the accuracy of keystroke localization with a Samsung SM-N9109W smartphone, when CamK knows a keystroke is happening. The user presses 59 keys (excluding the PC function keys: first row, five keys in last row) on the paper keyboard. Specifi- cally, we let the user press the sentences/words from the standard MacKenzie set [37]. Besides, we introduce some random characters to guarantee that each key will be pressed with fifty times. We record the typing process with a camera. When the occurrence of a keystroke is known, the localization accuracy is close to 100 percent, as shown in Fig. 13. It indicates that CamK can adaptively select suitable values of the parameters used in image processing. 7.2 Accuracy in Different Environments To verify whether CamK can detect and locate the keystroke accurately, we conduct the experiments in four typical scenarios: an office environment (light’s color is close to white), outdoors (basic/pure light), a coffee shop (light’s color is a little bit closer to that of human skin), and a restaurant (light is a bit dim). In each test, the user types words from the MacKenzie set [37] and makes Nk ¼ 500 keystrokes. Suppose CamK locates Na keystrokes correctly and treats Nf non-keystrokes as keystrokes wrongly. We define the localization accuracy as pa ¼ Na Nk , while the false positive rate as pf ¼ minð Nf Nk ; 1Þ. As shown in Fig. 14, CamK can achieve high accuracy (close to or larger than 85 percent) with low false positive rate (about 5 percent). In the office, the localization accuracy can achieve above 95 percent. 7.3 Effect of Frame Rate As described in Section 5.1, the frame rate affects the number of images nf during a keystroke. If the value of nf is too small, CamK may miss the keystrokes. On the contrary, more frames Fig. 12. Word prediction. L Fig. 13. Confusion matrix of 59 keys. u Fig. 14. Four scenarios. YIN ET AL.: CAMK: CAMERA-BASED KEYSTROKE DETECTION AND LOCALIZATION FOR SMALL MOBILE DEVICES 2245