IEEE TRANSACTIONS ON MOBILE COMPUTING,VOL.X,NO.X,XXXX 2018 CamK:Camera-based Keystroke Detection and Localization for Small Mobile Devices Yafeng Yin,Member,IEEE,Qun Li,Fellow,IEEE,Lei Xie,Member,IEEE,Shanhe Yi, Ed Novak,and Sanglu Lu,Member,IEEE Abstract-Because of the smaller size of mobile devices,text entry with on-screen keyboards becomes inefficient.Therefore,we present CamK,a camera-based text-entry method,which can use a panel(e.g.,a piece of paper)with a keyboard layout to input text into small devices.With the built-in camera of the mobile device,CamK captures images during the typing process and utilizes image processing techniques to recognize the typing behavior,i.e..extract the keys,track the user's fingertips,detect and locate keystrokes. To achieve high accuracy of keystroke localization and low false positive rate of keystroke detection,CamK introduces the initial training and online calibration.To reduce the time latency,CamK optimizes computation-intensive modules by changing image sizes,focusing on target areas,introducing multiple threads,removing the operations of writing or reading images.Finally,we implement CamK on mobile devices running Android.Our experimental results show that CamK can achieve above 95%accuracy in keystroke localization, with only a 4.8%false positive rate.When compared with on-screen keyboards,CamK can achieve a 1.25X typing speedup for regular text input and 2.5X for random character input.In addition,we introduce word prediction to further improve the input speed for regular text by 13.4% Index Terms-Mobile text-entry,camera,keystroke detection and localization,small mobile devices. 1 INTRODUCTION TN recent years,we have witnessed a rapid development era Lof electronic devices and mobile technology.Mobile de- vices(e.g.,smartphones,Apple Watch)have become smaller and smaller,in order to be carried everywhere easily,while avoiding carrying bulky laptops all the time.However,the small size of the mobile device brings many new challenges, · a typical example is inputting text into the small mobile device without a physical keyboard. In order to get rid of the constraint of bulky physical keyboards,many virtual keyboards have been proposed, Fig.1.A typical use case of CamK. e.g.,wearable keyboards,on-screen keyboards,projection keyboards,etc.However,wearable keyboards introduce ad- To provide a PC-like text-entry experience,we propose a camera-based keyboard CamK,a more natural and intuitive ditional equipments like rings [1]and gloves [2].On-screen text-entry method.As shown in Fig.1,CamK works with keyboards [31,[4]usually take up a large area on the screen the front-facing camera of the mobile device and a paper and only support single finger for text entry.Typing with a small screen becomes inefficient.Projection keyboards [5],[6] keyboard.CamK takes pictures as the user types on the often need a visible light projector or lasers to display the paper keyboard,and uses image processing techniques to detect and locate keystrokes.Then,CamK outputs the cor- keyboard.To remove the additional hardwares,audio signal [7]and camera based virtual keyboards [8],[9]are proposed. responding character of the pressed key.CamK can be used However,UbiK [7]requires the user to click keys with in a wide variety of scenarios,e.g.,the office,coffee shops, outdoors,etc.However,to make CamK work well,we need their fingertips and nails,while the existing camera based to solve the following key technical challenges. keyboards either slow the typing speed [8],or should be used in controlled environments [9].The existing schemes (1)Location deviation:On a paper keyboard,the inter- are difficult to provide a similar user experience to using key distance is only about two centimeters [7].With image physical keyboards. processing techniques,there may exist a position deviation between the real fingertip and the detected fingertip.This deviation may lead to localization errors of keystrokes.To Y.Yin,L.Xie and S.Lu are with the State Key Laboratory for Novel Softare Technology,Nanjing University,Nanjing 210023,China. address this challenge,CamK introduces the initial training E-mail:fyafeng,Ixie,sangluy@nju.edu.cn to get the optimal parameters for image processing.Then, O.Li and S.Yi are with the Department of Computer Science,College of CamK uses an extended region to represent the detected William and Mary,Williamsburg,Virginia 23187. fingertip,to tolerate the position deviation.Besides,CamK E-mail:{liqun,syi@cs.wm.edu E.Novak is with Computer Science Department,Franklin and Marshall utilizes the features of a keystroke (e.g.,the fingertip is College,Lancaster,PA 17604.E-mail:enovak@fandm.edu. located in the key for a certain duration,the pressed key Lei Xie is the corresponding author. is partially obstructed by the fingertip,etc.)to verify the Manuscript received 0.0000;revised 0.0000 validity of a keystroke
IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. X, NO. X, XXXX 2018 1 CamK: Camera-based Keystroke Detection and Localization for Small Mobile Devices Yafeng Yin, Member, IEEE, Qun Li, Fellow, IEEE, Lei Xie, Member, IEEE, Shanhe Yi, Ed Novak, and Sanglu Lu, Member, IEEE Abstract—Because of the smaller size of mobile devices, text entry with on-screen keyboards becomes inefficient. Therefore, we present CamK, a camera-based text-entry method, which can use a panel (e.g., a piece of paper) with a keyboard layout to input text into small devices. With the built-in camera of the mobile device, CamK captures images during the typing process and utilizes image processing techniques to recognize the typing behavior, i.e., extract the keys, track the user’s fingertips, detect and locate keystrokes. To achieve high accuracy of keystroke localization and low false positive rate of keystroke detection, CamK introduces the initial training and online calibration. To reduce the time latency, CamK optimizes computation-intensive modules by changing image sizes, focusing on target areas, introducing multiple threads, removing the operations of writing or reading images. Finally, we implement CamK on mobile devices running Android. Our experimental results show that CamK can achieve above 95% accuracy in keystroke localization, with only a 4.8% false positive rate. When compared with on-screen keyboards, CamK can achieve a 1.25X typing speedup for regular text input and 2.5X for random character input. In addition, we introduce word prediction to further improve the input speed for regular text by 13.4%. Index Terms—Mobile text-entry, camera, keystroke detection and localization, small mobile devices. ✦ 1 INTRODUCTION I N recent years, we have witnessed a rapid development of electronic devices and mobile technology. Mobile devices (e.g., smartphones, Apple Watch) have become smaller and smaller, in order to be carried everywhere easily, while avoiding carrying bulky laptops all the time. However, the small size of the mobile device brings many new challenges, a typical example is inputting text into the small mobile device without a physical keyboard. In order to get rid of the constraint of bulky physical keyboards, many virtual keyboards have been proposed, e.g., wearable keyboards, on-screen keyboards, projection keyboards, etc. However, wearable keyboards introduce additional equipments like rings [1] and gloves [2]. On-screen keyboards [3], [4] usually take up a large area on the screen and only support single finger for text entry. Typing with a small screen becomes inefficient. Projection keyboards [5], [6] often need a visible light projector or lasers to display the keyboard. To remove the additional hardwares, audio signal [7] and camera based virtual keyboards [8], [9] are proposed. However, UbiK [7] requires the user to click keys with their fingertips and nails, while the existing camera based keyboards either slow the typing speed [8], or should be used in controlled environments [9]. The existing schemes are difficult to provide a similar user experience to using physical keyboards. • Y. Yin, L. Xie and S. Lu are with the State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China. E-mail: {yafeng, lxie, sanglu}@nju.edu.cn • Q. Li and S. Yi are with the Department of Computer Science, College of William and Mary, Williamsburg, Virginia 23187. E-mail: {liqun, syi}@cs.wm.edu. • E. Novak is with Computer Science Department, Franklin and Marshall College, Lancaster, PA 17604. E-mail: enovak@fandm.edu. • Lei Xie is the corresponding author. Manuscript received 0 . 0000; revised 0 . 0000. l Camera α Fig. 1. A typical use case of CamK. To provide a PC-like text-entry experience, we propose a camera-based keyboard CamK, a more natural and intuitive text-entry method. As shown in Fig. 1, CamK works with the front-facing camera of the mobile device and a paper keyboard. CamK takes pictures as the user types on the paper keyboard, and uses image processing techniques to detect and locate keystrokes. Then, CamK outputs the corresponding character of the pressed key. CamK can be used in a wide variety of scenarios, e.g., the office, coffee shops, outdoors, etc. However, to make CamK work well, we need to solve the following key technical challenges. (1) Location deviation: On a paper keyboard, the interkey distance is only about two centimeters [7]. With image processing techniques, there may exist a position deviation between the real fingertip and the detected fingertip. This deviation may lead to localization errors of keystrokes. To address this challenge, CamK introduces the initial training to get the optimal parameters for image processing. Then, CamK uses an extended region to represent the detected fingertip, to tolerate the position deviation. Besides, CamK utilizes the features of a keystroke (e.g., the fingertip is located in the key for a certain duration, the pressed key is partially obstructed by the fingertip, etc.) to verify the validity of a keystroke
2 IEEE TRANSACTIONS ON MOBILE COMPUTING,VOL.X,NO.X,XXXX 2018 (2)False positives:A false positive occurs when a non- ZoomBoard [4]adaptively change the size of keys.Context- keystroke(i.e.,a period in which no fingertip is pressing Type [16]leverages hand postures to improve mobile touch any key)is recognized as a keystroke.Without the assistance screen text entry.Kwon et al.[17]introduce the regional of other resources like audio signals,CamK should detect error correction method to reduce the number of necessary keystrokes only with images.To address this challenge, touches.ShapeWriter [18]recognizes a word based on the CamK combines keystroke detection with keystroke local- trace over successive letters in the word.Sandwich key- ization.For a potential keystroke,if there is no valid key board [19]affords ten-finger touch typing by utilizing a pressed by the fingertip,CamK will remove the keystroke touch sensor on the back side of a device.Usually,on-screen and recognize it as a non-keystroke.Additionally,CamK keyboards occupy the screen area and support only one introduces online calibration,i.e.,using the movement fea- finger for typing.Besides,it often needs to switch between tures of the fingertip after a keystroke,to further decrease different screens to type letters,digits,punctuations,etc. the false positive rate. Projection keyboards:Projection keyboards usually (3)Processing latency:To serve as a text-entry method, need a visible light projector or lasers to cast a keyboard, when the user presses a key on the paper keyboard,CamK and then utilize image processing methods [5]or infrared should output the character of the key without any no- light [6]to detect the typing events.Hu et al.use a pico- ticeable latency.However,due to the limited computing projector to project the keyboard on the table,and then resources of small mobile devices,the heavy computation detect the touch interaction by the distortion of the keyboard overhead of image processing will lead to a large latency.To projection [20].Roeber et al.utilize a pattern projector to address this challenge,CamK optimizes the computation- display the keyboard layout on the flat surface,and then intensive modules by adaptively changing image sizes, detect the keyboard events based on the intersection of focusing on the target area in the large-size image,adopt- fingers and infrared light [21].The projection keyboard often ing multiple threads and removing the operations of writ- requires the extra equipments,e.g.,a visible light projector, ing/reading images. infrared light modules,etc.The extra equipments increase We make the following contributions in this paper (a the cost and introduce the inconvenience of text entry preliminary version of this work appeared in [10]). Camera based keyboards:Camera based virtual key- We design a practical framework for CamK,which boards use the captured image [22]or video [23]to infer the operates using a smart mobile device camera and a keystroke.Gesture keyboard [22]gets the input by recogniz- portable paper keyboard.Based on image processing, ing the finger's gesture.It works without a keyboard layout, thus the user needs to remember the mapping between CamK can detect and locate the keystroke with high the keys and the finger's gestures.Visual Panel [8]works accuracy and low false positive rate. with a printed keyboard on a piece of paper.It requires the We realize real time text-entry for small mobile devices with limited resources,by optimizing the user to use only one finger and wait for one second before each keystroke.Malik et al.present the Visual Touchpad computation-intensive modules.Additionally,we in- [24]to track the 3D positions of the fingertips based on troduce word prediction to further improve the input speed and reduce the error rate. two downward-pointing cameras and a stereo.Adajania et We implement CamK on smartphones running An- al.[9]detect the keystroke based on shadow analysis with a standard web camera.Hagara et al.estimate the finger droid.We first evaluate each module in CamK.Then, positions and detect clicking events based on edge detec- we conduct extensive experiments to test the perfor- tion,fingertip localization,etc [25.In regard to the iPhone mance of CamK.After that,we compare CamK with app paper keyboard [261,which only allows the user to use other methods in input speed and error rate. one finger to input letters.The above research work usually 2 RELATED WORK focuses on detecting and tracking the fingertips,instead of locating the fingertip in a key's area of the keyboard,which Considering the small sizes of mobile devices,a lot of virtual is researched in our paper. keyboards are proposed for text entry,e.g.,wearable key- In addition to the above text-entry solutions,MacKenzie boards,on-screen keyboards,projection keyboards,camera et al.[27]describe the text entry for mobile computing. based keyboards,etc. Zhang et al.[28]propose Okuli to locate user's finger based Wearable keyboards:Wearable keyboards sense and on visible light communication modules,LED,and light recognize the typing behavior based on the sensors built into sensors.Wang et al.[7]propose UbiK to locate the keystroke rings [1][11],gloves [12],and so on.TypingRing [13]utilizes based on audio signals.The existing work usually needs the embedded sensors of the ring to input text.Finger-Joint extra equipments,or only allows one finger to type,or keypad [14]works with a glove equipped with the pressure needs to change the user's typing behavior,while difficult sensors.The Senseboard [2]consists of two rubber pads and to provide a PC-like text-entry experience.In this paper,we senses the movements in the palm to get keystrokes.Funk et propose a text-entry method based on the built-in camera al.[15]utilize a touch sensitive wristband to enter text based of the mobile device and a paper keyboard,to provide a on the location of the touch.These wearable keyboards often similar user experience to using physical keyboards. need the user to wear devices around the hands or fingers, 3 FEASIBILITY STUDY AND OVERVIEW OF CAMK thus leading to the decrease of user experience. On-screen keyboards:On-screen keyboards allow the In order to show the feasibility of locating keystrokes based user to enter characters on a touch screen.Considering the on image processing techniques,we first show the observa- limited area of the keyboard on the screen,BigKey [3]and tions of a keystroke from the camera's view.After that,we will describe the system overview of CamK
2 IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. X, NO. X, XXXX 2018 (2) False positives: A false positive occurs when a nonkeystroke (i.e., a period in which no fingertip is pressing any key) is recognized as a keystroke. Without the assistance of other resources like audio signals, CamK should detect keystrokes only with images. To address this challenge, CamK combines keystroke detection with keystroke localization. For a potential keystroke, if there is no valid key pressed by the fingertip, CamK will remove the keystroke and recognize it as a non-keystroke. Additionally, CamK introduces online calibration, i.e., using the movement features of the fingertip after a keystroke, to further decrease the false positive rate. (3) Processing latency: To serve as a text-entry method, when the user presses a key on the paper keyboard, CamK should output the character of the key without any noticeable latency. However, due to the limited computing resources of small mobile devices, the heavy computation overhead of image processing will lead to a large latency. To address this challenge, CamK optimizes the computationintensive modules by adaptively changing image sizes, focusing on the target area in the large-size image, adopting multiple threads and removing the operations of writing/reading images. We make the following contributions in this paper (a preliminary version of this work appeared in [10]). • We design a practical framework for CamK, which operates using a smart mobile device camera and a portable paper keyboard. Based on image processing, CamK can detect and locate the keystroke with high accuracy and low false positive rate. • We realize real time text-entry for small mobile devices with limited resources, by optimizing the computation-intensive modules. Additionally, we introduce word prediction to further improve the input speed and reduce the error rate. • We implement CamK on smartphones running Android. We first evaluate each module in CamK. Then, we conduct extensive experiments to test the performance of CamK. After that, we compare CamK with other methods in input speed and error rate. 2 RELATED WORK Considering the small sizes of mobile devices, a lot of virtual keyboards are proposed for text entry, e.g., wearable keyboards, on-screen keyboards, projection keyboards, camera based keyboards, etc. Wearable keyboards: Wearable keyboards sense and recognize the typing behavior based on the sensors built into rings [1] [11], gloves [12], and so on. TypingRing [13] utilizes the embedded sensors of the ring to input text. Finger-Joint keypad [14] works with a glove equipped with the pressure sensors. The Senseboard [2] consists of two rubber pads and senses the movements in the palm to get keystrokes. Funk et al. [15] utilize a touch sensitive wristband to enter text based on the location of the touch. These wearable keyboards often need the user to wear devices around the hands or fingers, thus leading to the decrease of user experience. On-screen keyboards: On-screen keyboards allow the user to enter characters on a touch screen. Considering the limited area of the keyboard on the screen, BigKey [3] and ZoomBoard [4] adaptively change the size of keys. ContextType [16] leverages hand postures to improve mobile touch screen text entry. Kwon et al. [17] introduce the regional error correction method to reduce the number of necessary touches. ShapeWriter [18] recognizes a word based on the trace over successive letters in the word. Sandwich keyboard [19] affords ten-finger touch typing by utilizing a touch sensor on the back side of a device. Usually, on-screen keyboards occupy the screen area and support only one finger for typing. Besides, it often needs to switch between different screens to type letters, digits, punctuations, etc. Projection keyboards: Projection keyboards usually need a visible light projector or lasers to cast a keyboard, and then utilize image processing methods [5] or infrared light [6] to detect the typing events. Hu et al. use a picoprojector to project the keyboard on the table, and then detect the touch interaction by the distortion of the keyboard projection [20]. Roeber et al. utilize a pattern projector to display the keyboard layout on the flat surface, and then detect the keyboard events based on the intersection of fingers and infrared light [21]. The projection keyboard often requires the extra equipments, e.g., a visible light projector, infrared light modules, etc. The extra equipments increase the cost and introduce the inconvenience of text entry. Camera based keyboards: Camera based virtual keyboards use the captured image [22] or video [23] to infer the keystroke. Gesture keyboard [22] gets the input by recognizing the finger’s gesture. It works without a keyboard layout, thus the user needs to remember the mapping between the keys and the finger’s gestures. Visual Panel [8] works with a printed keyboard on a piece of paper. It requires the user to use only one finger and wait for one second before each keystroke. Malik et al. present the Visual Touchpad [24] to track the 3D positions of the fingertips based on two downward-pointing cameras and a stereo. Adajania et al. [9] detect the keystroke based on shadow analysis with a standard web camera. Hagara et al. estimate the finger positions and detect clicking events based on edge detection, fingertip localization, etc [25]. In regard to the iPhone app paper keyboard [26], which only allows the user to use one finger to input letters. The above research work usually focuses on detecting and tracking the fingertips, instead of locating the fingertip in a key’s area of the keyboard, which is researched in our paper. In addition to the above text-entry solutions, MacKenzie et al. [27] describe the text entry for mobile computing. Zhang et al. [28] propose Okuli to locate user’s finger based on visible light communication modules, LED, and light sensors. Wang et al. [7] propose UbiK to locate the keystroke based on audio signals. The existing work usually needs extra equipments, or only allows one finger to type, or needs to change the user’s typing behavior, while difficult to provide a PC-like text-entry experience. In this paper, we propose a text-entry method based on the built-in camera of the mobile device and a paper keyboard, to provide a similar user experience to using physical keyboards. 3 FEASIBILITY STUDY AND OVERVIEW OF CAMK In order to show the feasibility of locating keystrokes based on image processing techniques, we first show the observations of a keystroke from the camera’s view. After that, we will describe the system overview of CamK
YIN et al:CAMK:CAMERA-BASED KEYSTROKE DETECTION AND LOCALIZATION FOR SMALL MOBILE DEVICES ager numbe (a)Frame 1 (b)Frame 2 (c)Frame 3 (d)Frame 4 (e)Frame 5 Fig.2.Frames during two consecutive keystrokes 3.1 Observations of A Keystroke we set I 13.5cm,a =90,to make the letter keys large In Fig.2,we show the frames/images captured by the enough in the camera's view.In fact,there is no strict re- camera during two consecutive keystrokes.The origin of quirements of the above parameters'value,especially when axes is located in the top left corner of the image,as shown the position of the camera varies in different devices.In Fig. in Fig.2(a).The hand located in the left area of the image 1,when we fix the A4 sized paper keyboard,l can range is called left hand,while the other is called the right hand, in [13.5cm,18.0cm],while a can range in [78.8,90.0].In as shown in Fig.2(b).From left to right,the fingers are CamK,even if some part of the keyboard is out of the called finger i in sequence,iE [1,10],as shown in Fig.2(c). camera's view,CamK still works. The fingertip pressing the key is called StrokeTip,while that The architecture of CamK is shown in Fig.3.The input pressed key is called StrokeKey,as shown in Fig.2(d). is the image taken by the camera and the output is the When the user presses a key,i.e.,a keystroke occurs,the character of the pressed key.Before a user begins typing, StrokeTip and StrokeKey often have the following features, CamK uses Key Extraction to detect the keyboard and extract which can be used to track,detect and locate the keystroke. each key from the image.When the user types,CamK uses (1)Coordinate position:The StrokeTip usually has the Fingertip Detection to extract the user's hands and detect largest vertical coordinate among the fingers on the same their fingertips.Based on the movements of fingertips, hand,because the user tends to stretch out one finger when CamK uses Keystroke Detection and Localization to detect a typing a key.An example is finger 9 in Fig.2(a).While possible keystroke and locate the keystroke.Finally,CamK considering the particularity of thumbs,this feature may uses Text Output to output the character of the pressed key. not be suitable for thumbs.Therefore,we separately detect Key Extraction the StrokeTip in thumbs and other fingertips. Text (2)Moving state:The StrokeTip stays on the StrokeKey for output a certain duration in a typing operation,as finger 2 shown Key aren,Key locati回 in Fig.2(c)-Fig.2(d).If the positions of the fingertip keep Keystroke Detection and Localizat unchanged,a keystroke may happen. Candidate fingertip selection (3)Correlated location:The StrokeTip is located in the Fra e设 Frame i- Largest vertical coordinate StrokeKey,in order to press that key,such as finger 9 shown Keeping unchanged in Fig.2(a)and finger 2 shown in Fig.2(d). ingertip Detection Locating in the pressed key (4)Obstructed view:The StrokeTip obstructs the StrokeKey ressed ke land segmentatio from the view of the camera,as shown in Fig.2(d).The ratio of the visually obstructed area to the whole area of the key Largest relative distanc 0n can be used to verify whether the key is really pressed. gertip discover y No frg (5)Relative distance:The StrokeTip usually achieves the largest vertical distance between the fingertip and remaining Fig.3.Architecture of CamK. fingertips of the same hand.This is because the user usually stretches out the finger to press a key.Thus the feature can SYSTEM DESIGN be used to infer which hand generates the keystroke.In 4 Fig.2(a),the vertical distance dr between the StrokeTip (i.e., According to Fig.3,CamK consists of four components: Finger 9)and remaining fingertips in right hand is larger key extraction,fingertip detection,keystroke detection and than that (di)in left hand.Thus we choose finger 9 as the localization,and text output.Obviously,text output is easy StrokeTip from two hands,instead of finger 2. to be implemented.Therefore,we mainly describe the first three components. 3.2 System Overview As shown in Fig.1,CamK works with a mobile device and 4.1 Key Extraction a paper keyboard.The device uses the front-facing camera to capture the typing process,while the paper keyboard is Without loss of generality,CamK adopts the common QW- placed on a flat surface and located in the camera's view. ERTY keyboard layout,which is printed in black and white We take Fig.1 as an example to describe the deployment.In on a piece of paper,as shown in Fig.1.In order to eliminate Fig.1,the mobile device is a Samsung N9109W smartphone, the effects of background,we first detect the boundary of the keyboard.Then,we extract each key from the keyboard. while l means the distance between the device and the printed keyboard,o means the angle between the plane Therefore,key extraction contains three parts:keyboard of the device's screen and that of the keyboard.In Fig.1, detection,key segmentation,and mapping the characters to the keys,as shown in Fig.3
YIN et al.: CAMK: CAMERA-BASED KEYSTROKE DETECTION AND LOCALIZATION FOR SMALL MOBILE DEVICES 3 O (0, 0) x y dl dr (a) Frame 1 Left hand Right hand (b) Frame 2 Finger number 1 2 3 4 5 6 7 8 9 10 (c) Frame 3 StrokeKey StrokeTip (d) Frame 4 (e) Frame 5 Fig. 2. Frames during two consecutive keystrokes 3.1 Observations of A Keystroke In Fig. 2, we show the frames/images captured by the camera during two consecutive keystrokes. The origin of axes is located in the top left corner of the image, as shown in Fig. 2(a). The hand located in the left area of the image is called left hand, while the other is called the right hand, as shown in Fig. 2(b). From left to right, the fingers are called finger i in sequence, i ∈ [1, 10], as shown in Fig. 2(c). The fingertip pressing the key is called StrokeTip, while that pressed key is called StrokeKey, as shown in Fig. 2(d). When the user presses a key, i.e., a keystroke occurs, the StrokeTip and StrokeKey often have the following features, which can be used to track, detect and locate the keystroke. (1) Coordinate position: The StrokeTip usually has the largest vertical coordinate among the fingers on the same hand, because the user tends to stretch out one finger when typing a key. An example is finger 9 in Fig. 2(a). While considering the particularity of thumbs, this feature may not be suitable for thumbs. Therefore, we separately detect the StrokeTip in thumbs and other fingertips. (2) Moving state: The StrokeTip stays on the StrokeKey for a certain duration in a typing operation, as finger 2 shown in Fig. 2(c) - Fig. 2(d). If the positions of the fingertip keep unchanged, a keystroke may happen. (3) Correlated location: The StrokeTip is located in the StrokeKey, in order to press that key, such as finger 9 shown in Fig. 2(a) and finger 2 shown in Fig. 2(d). (4) Obstructed view: The StrokeTip obstructs the StrokeKey from the view of the camera, as shown in Fig. 2(d). The ratio of the visually obstructed area to the whole area of the key can be used to verify whether the key is really pressed. (5) Relative distance: The StrokeTip usually achieves the largest vertical distance between the fingertip and remaining fingertips of the same hand. This is because the user usually stretches out the finger to press a key. Thus the feature can be used to infer which hand generates the keystroke. In Fig. 2(a), the vertical distance dr between the StrokeTip (i.e., Finger 9) and remaining fingertips in right hand is larger than that (dl) in left hand. Thus we choose finger 9 as the StrokeTip from two hands, instead of finger 2. 3.2 System Overview As shown in Fig. 1, CamK works with a mobile device and a paper keyboard. The device uses the front-facing camera to capture the typing process, while the paper keyboard is placed on a flat surface and located in the camera’s view. We take Fig. 1 as an example to describe the deployment. In Fig. 1, the mobile device is a Samsung N9109W smartphone, while l means the distance between the device and the printed keyboard, α means the angle between the plane of the device’s screen and that of the keyboard. In Fig. 1, we set l = 13.5cm, α = 90◦ , to make the letter keys large enough in the camera’s view. In fact, there is no strict requirements of the above parameters’ value, especially when the position of the camera varies in different devices. In Fig. 1, when we fix the A4 sized paper keyboard, l can range in [13.5cm, 18.0cm], while α can range in [78.8 ◦ , 90.0 ◦ ]. In CamK, even if some part of the keyboard is out of the camera’s view, CamK still works. The architecture of CamK is shown in Fig. 3. The input is the image taken by the camera and the output is the character of the pressed key. Before a user begins typing, CamK uses Key Extraction to detect the keyboard and extract each key from the image. When the user types, CamK uses Fingertip Detection to extract the user’s hands and detect their fingertips. Based on the movements of fingertips, CamK uses Keystroke Detection and Localization to detect a possible keystroke and locate the keystroke. Finally, CamK uses Text Output to output the character of the pressed key. Key Extraction Keyboard detection Key segmentation Character mapping Location range of keys Key area, Key location Fingertip Detection Hand segmentation Fingertip discovery Frame i Keystroke Detection and Localization Candidate fingertip selection Key area, Key location Largest vertical coordinate Covering the pressed key Largest relative distance Two fingertips No fingertips Keystroke Nonkeystroke Only one fingertip No fingertips Only one fingertip Text output Fingertips’ locations Keystroke location Locating in the pressed key Keeping unchanged Frame j Frame i-2 Frame i-1 Fig. 3. Architecture of CamK. 4 SYSTEM DESIGN According to Fig. 3, CamK consists of four components: key extraction, fingertip detection, keystroke detection and localization, and text output. Obviously, text output is easy to be implemented. Therefore, we mainly describe the first three components. 4.1 Key Extraction Without loss of generality, CamK adopts the common QWERTY keyboard layout, which is printed in black and white on a piece of paper, as shown in Fig. 1. In order to eliminate the effects of background, we first detect the boundary of the keyboard. Then, we extract each key from the keyboard. Therefore, key extraction contains three parts: keyboard detection, key segmentation, and mapping the characters to the keys, as shown in Fig. 3.
IEEE TRANSACTIONS ON MOBILE COMPUTING,VOL.X,NO.X,XXXX 2018 44y (a)An input image (b)Edge detection (c)Edge optimization (d)Keyboard range (e)Keyboard boundary (④Key segmentation Fig.4.Keyboard detection and key extraction 4.1.1 Keyboard detection the space key)as multiple regular keys (e.g.,A-Z,0-9).For We use the Canny edge detection algorithm [29]to obtain example,the space key is treated as five regular keys.In this the edges of the keyboard.Fig.4(b)shows the edge de- way,we will change N to Navg.Then,we can estimate the tection result of Fig.4(a).However,the interference edges average area of a regular key as S/Navg.In addition to size (e.g.,the paper's edge/longest edge in Fig.4(b))should difference between keys,the camera's view can also affect be removed.Based on Fig.4(b),the edges of the keyboard the area of a key in the image.Therefore,we introduce o,oh should be close to the edges of keys.We use this feature to describe the range of a valid area Sk of a key as SkE[o to remove pitfall edges,the result is shown in Fig.4(c). We set a-0.15,n-5 in Camk,based Additionally,we adopt the dilation operation [30]to join on extensive experiments.The key segmentation result of the dispersed edge points which are close to each other, Fig.4(e)is shown in Fig.4(f).Then,we use the location of to get better edges/boundaries of the keyboard.After that, the space key (biggest key)to locate other keys,based on we use the Hough transform [8]to detect the lines in the relative locations between keys. Fig.4(c).Then,we use the uppermost line and the bot- 4.2 Fingertip Detection tom line to describe the position range of the keyboard, After extracting the keys,we need to track the fingertips as shown in Fig.4(d).Similarly,we can use the Hough transform [8]to detect the left/right edge of the keyboard. to detect and locate the keystrokes.To achieve this goal,we If there are no suitable edges detected by the Hough trans- should first detect the fingertip with hand segmentation and form,it is usually because the keyboard is not perfect- fingertip discovery,as shown below. ly located in the camera's view.In this case,we simply 4.2.1 Hand segmentation use the left/right boundary of the image to represent the Skin segmentation [30]is often used for hand segmentation. left/right edge of the keyboard.As shown in Fig.4(e), In the YCrCb color space,a pixel (Y,Cr,Cb)is determined we extend the four edges (lines)to get four intersection- to be a skin pixel,if it satisfies Cr E [133,173]and Cb E s Bi(1,y1),B2(x2,y2),B3(x3,y3),B4(x4,y4),which are [77,127].However,the threshold values of Cr and Cb can used to describe the boundary of the keyboard. be affected by the surroundings such as lighting conditions It is difficult to choose suitable threshold values for Cr and 4.1.2 Key segmentation Cb.Therefore,we combine Otsu's method [31]and the red Considering the short interference edges generated by the channel in YCrCb color space for skin segmentation. edge detection algorithm,it is difficult to accurately segment In the YCrCb color space,the red channel Cr is es- each key from the keyboard with detected edges.Conse- sential to human skin color.Therefore,with a captured quently,we utilize the color difference between the white image,we use the grayscale image that is split based on keys and the black background and the area of a key for key the Cr channel as an input for Otsu's method [31].Otsu's segmentation,to reduce the effect of pitfall areas. method can automatically perform clustering-based image Firstly,we introduce color segmentation to distinguish thresholding,i.e.,calculate the optimal threshold to separate the white keys and black background.Considering the the foreground and background.The hand segmentation convenience of image processing,we represent the color in result of Fig.5(a)is shown in Fig.5(b),where the white YCrCb space.In YCrCb space,the color coordinate (Y,Cr, regions represent the hand regions with high value in Cr Cb)of a white pixel is(255,128,128),while that of a black channel,while the black regions represent the background. pixel is(0,128,128).Thus,we only compute the difference However,around the hands,there exist some interference in the y value between the pixels to distinguish the white regions,which may change the contours of fingers,resulting keys from the black background.If a pixel is located in the in detecting wrong fingertips.Thus,CamK introduces the keyboard,,while satisfying255-ey≤Y≤255,the pixel following erosion and dilation operations [32].We first use belongs to a key.The offsets eyE Nof y is mainly caused by the erosion operation to isolate the hands from keys and light conditions.ey can be estimated in the initial training separate each finger.Then,we use the dilation operation to (see section 5.1).The initial/default value of y is 50. smooth the edge of the fingers.Fig.5(c)shows the optimized When we obtain the white pixels,we need to get the result of hand segmentation.After that,we select the top contours of keys and separate the keys from one another. two segmented areas as hand regions,i.e.,left hand and To avoid pitfall areas such as small white areas which do right hand,to further reduce the effect of inference regions, not belong to any key,we introduce the area of a key.Based such as the red areas in Fig.5(c). on Fig.4(e),we first use B1,B2,B3,Ba to calculate the area 4.2.2 Fingertip discovery Sp of the keyboard as S=(|B1B2 x BiBl+B3B4 x After we extract the fingers,we will detect the fingertips. B3 B2).Then,we calculate the area of each key.We use N to We can differentiate between the thumbs (i.e.,finger 5-6 in represent the number of keys in the keyboard.Considering Fig.2(c))and non-thumbs(i.e.,finger 1-4,7-10 in Fig the size difference between keys,we treat larger keys(e.g., 2(c))in shape and typing movement,as shown in Fig.6
4 IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. X, NO. X, XXXX 2018 (a) An input image (b) Edge detection (c) Edge optimization (d) Keyboard range B1 (x1 , y1 ) B4 (x4 , y4 ) B2 (x2 , y2 ) B3 (x3 , y3 ) (e) Keyboard boundary (f) Key segmentation Fig. 4. Keyboard detection and key extraction 4.1.1 Keyboard detection We use the Canny edge detection algorithm [29] to obtain the edges of the keyboard. Fig. 4(b) shows the edge detection result of Fig. 4(a). However, the interference edges (e.g., the paper’s edge/longest edge in Fig. 4(b)) should be removed. Based on Fig. 4(b), the edges of the keyboard should be close to the edges of keys. We use this feature to remove pitfall edges, the result is shown in Fig. 4(c). Additionally, we adopt the dilation operation [30] to join the dispersed edge points which are close to each other, to get better edges/boundaries of the keyboard. After that, we use the Hough transform [8] to detect the lines in Fig. 4(c). Then, we use the uppermost line and the bottom line to describe the position range of the keyboard, as shown in Fig. 4(d). Similarly, we can use the Hough transform [8] to detect the left/right edge of the keyboard. If there are no suitable edges detected by the Hough transform, it is usually because the keyboard is not perfectly located in the camera’s view. In this case, we simply use the left/right boundary of the image to represent the left/right edge of the keyboard. As shown in Fig. 4(e), we extend the four edges (lines) to get four intersections B1(x1, y1), B2(x2, y2), B3(x3, y3), B4(x4, y4), which are used to describe the boundary of the keyboard. 4.1.2 Key segmentation Considering the short interference edges generated by the edge detection algorithm, it is difficult to accurately segment each key from the keyboard with detected edges. Consequently, we utilize the color difference between the white keys and the black background and the area of a key for key segmentation, to reduce the effect of pitfall areas. Firstly, we introduce color segmentation to distinguish the white keys and black background. Considering the convenience of image processing, we represent the color in YCrCb space. In YCrCb space, the color coordinate (Y, Cr, Cb) of a white pixel is (255, 128, 128), while that of a black pixel is (0, 128, 128). Thus, we only compute the difference in the Y value between the pixels to distinguish the white keys from the black background. If a pixel is located in the keyboard, while satisfying 255 − εy ≤ Y ≤ 255, the pixel belongs to a key. The offsets εy ∈ N of Y is mainly caused by light conditions. εy can be estimated in the initial training (see section 5.1). The initial/default value of εy is 50. When we obtain the white pixels, we need to get the contours of keys and separate the keys from one another. To avoid pitfall areas such as small white areas which do not belong to any key, we introduce the area of a key. Based on Fig. 4(e), we first use B1, B2, B3, B4 to calculate the area Sb of the keyboard as Sb = 1 2 · (| −−−→ B1B2 × −−−→ B1B4| + | −−−→ B3B4 × −−−→ B3B2|). Then, we calculate the area of each key. We use N to represent the number of keys in the keyboard. Considering the size difference between keys, we treat larger keys (e.g., the space key) as multiple regular keys (e.g., A-Z, 0-9). For example, the space key is treated as five regular keys. In this way, we will change N to Navg. Then, we can estimate the average area of a regular key as Sb/Navg. In addition to size difference between keys, the camera’s view can also affect the area of a key in the image. Therefore, we introduce αl , αh to describe the range of a valid area Sk of a key as Sk ∈ [αl · Sb Navg , αh · Sb Navg ]. We set αl = 0.15, αh = 5 in CamK, based on extensive experiments. The key segmentation result of Fig. 4(e) is shown in Fig. 4(f). Then, we use the location of the space key (biggest key) to locate other keys, based on the relative locations between keys. 4.2 Fingertip Detection After extracting the keys, we need to track the fingertips to detect and locate the keystrokes. To achieve this goal, we should first detect the fingertip with hand segmentation and fingertip discovery, as shown below. 4.2.1 Hand segmentation Skin segmentation [30] is often used for hand segmentation. In the YCrCb color space, a pixel (Y, Cr, Cb) is determined to be a skin pixel, if it satisfies Cr ∈ [133, 173] and Cb ∈ [77, 127]. However, the threshold values of Cr and Cb can be affected by the surroundings such as lighting conditions. It is difficult to choose suitable threshold values for Cr and Cb. Therefore, we combine Otsu’s method [31] and the red channel in YCrCb color space for skin segmentation. In the YCrCb color space, the red channel Cr is essential to human skin color. Therefore, with a captured image, we use the grayscale image that is split based on the Cr channel as an input for Otsu’s method [31]. Otsu’s method can automatically perform clustering-based image thresholding, i.e., calculate the optimal threshold to separate the foreground and background. The hand segmentation result of Fig. 5(a) is shown in Fig. 5(b), where the white regions represent the hand regions with high value in Cr channel, while the black regions represent the background. However, around the hands, there exist some interference regions, which may change the contours of fingers, resulting in detecting wrong fingertips. Thus, CamK introduces the following erosion and dilation operations [32]. We first use the erosion operation to isolate the hands from keys and separate each finger. Then, we use the dilation operation to smooth the edge of the fingers. Fig. 5(c) shows the optimized result of hand segmentation. After that, we select the top two segmented areas as hand regions, i.e., left hand and right hand, to further reduce the effect of inference regions, such as the red areas in Fig. 5(c). 4.2.2 Fingertip discovery After we extract the fingers, we will detect the fingertips. We can differentiate between the thumbs (i.e., finger 5-6 in Fig. 2(c)) and non-thumbs (i.e., finger 1 − 4, 7 − 10 in Fig. 2(c)) in shape and typing movement, as shown in Fig. 6
YIN et al:CAMK:CAMERA-BASED KEYSTROKE DETECTION AND LOCALIZATION FOR SMALL MOBILE DEVICES (a)An input image (b)Hand segmentation (c)Optimization (d)Fingers'contour (e)Fingertip discovery (f)Fingertips Fig.5.Fingertip detection In a non-thumb,the fingertip is usually a convex vertex, thumb in the right most area of left hand or left most area of as shown in Fig.6(a).For a point Pi(zi,i)located in right hand according to 0i and ii,+The detected the contour of a hand,by tracing the contour,we can fingertips of Fig.5(a)are marked in Fig.5(f). select the point Pg(zi-,-)before Pi and the point P()after Pi.Here,i,q E N.We calculate the 4.3 Keystroke Detection and Localization angle 0;between the two vectors PP-,PP+,according After detecting the fingertip,we will track the fingertip to detect a possible keystroke and locate it for text entry.The to Eq.(1).In order to simplify the calculation for 0i,we map keystroke is usually correlated with one or two fingertips, 6A,in the range9,∈[0°,180].Ifa:∈[,fnl,ayi-and yi>y+ probability of pressing a key,instead of detecting all finger- tips,to reduce the computation overhead.Then,we track Otherwise,Pi will not be a candidate vertex.If there are multiple candidate vertexes,such as P in Fig.6(a),we the candidate fingertip to detect the possible keystroke.Finally, we correlate the candidate fingertip with the pressed key to will choose the vertex having the largest vertical coordinate, locate the keystroke. because it has higher probability of being a fingertip,as Pi shown in Fig.6(a).Here,the largest vertical coordinate 4.3.1 Candidate fingertip selection in each hand means the local maximum in a finger's contour,such as the CamK allows the user to use all of their fingers for text-entry, red circle shown in Fig.5(e).The range of a finger's contour thus the keystroke may come from the left or right hand can be limited by Eq.(1),i.e.,the angle feature of a finger. Based on the observations(see section 3.1),the fingertip(i.e., Based on extensive experiments,we set 0,=60°,fh=l50°, StrokeTip)pressing the key usually has the largest vertical q=20 in this paper. coordinate in that hand,such as finger 9 shown in Fig.2(a) Therefore,we first select the candidate fingertip with the 0i=arccos PE-g·pP+9 (1) PP-PP largest vertical coordinate in each hand.We respectively use C and Cr to represent the points located in the contour of left hand and right hand.For a point P(,y)E CL,if 00,0) 00,0) P satisfies y≥(P(xj,)∈Ci,j≠),then乃will be selected as the candidate fingertip in the left hand.Similarly, P(x we can get the candidate fingertip P(r,yr)in the right P+g(4g4g】 hand.In this step,we only need to get P and P,instead of P-(x- (x,y) detecting all fingertips. (a)Fingertips (non-thumbs) (b)A thumb 4.3.2 Keystroke detection based on fingertip tracking Fig.6.Features of a fingertip As described in the observations,when the user presses a In a thumb,the "fingertip"also means a convex vertex key,the fingertip will stay at that key for a certain duration. of the finger.Thus we still use Eq.(1)to represent the shape Therefore,we can use the location variation of the candidate of the fingertip in a thumb.However,the position of the fingertip to detect a possible keystroke.In Frame i,we convex vertex can be different from that of a non-thumb. use P(L,y)and Pr (ryr)to represent the candidate As shown in Fig.6(b),the relative positions of Pi-q Pi, fingertips in the left hand and right hand,respectively.If the P+are different from that in Fig.6(a).In Fig.6(b),we candidate fingertips in frame i-1,i]satisfy Eg.(2)in left show the thumb of the left hand.Obviously,P-Pi,P+ hand or Eq.(3)in right hand,the corresponding fingertip do not satisfy yi yi-a and yi >yi+Therefore,we use will be treated as static,i.e.,a keystroke probably happens. (i-i-).(ii+)>0 to describe the relative locations Based on extensive experiments,we set Ar=5 empirically. of Pi-a,Pi,Pi+g in thumbs.Then,we choose the vertex with largest vertical coordinate in a finger's contour as the V(:--)2+(:-h-)2≤△r (2) fingertip,as mentioned in the last paragraph. In fingertip detection,we only need to detect the points V(xr-x4-2+(gr--1)2≤△r (3) located on the bottom edge(from the left most point to the right most point)of the hand,such as the blue contour 4.3.3 Keystroke localization by correlating the fingertip with of right hand in Fig.5(d).The shape feature 0;and the the pressed key positions in vertical coordinates yi along the bottom edge After detecting a possible keystroke,we correlate the candi- are shown Fig.5(e).If we can detect five fingertips in a date fingertip and the pressed key to locate the keystroke, hand with i and y,we assume that we have based on the observations of Section 3.1.In regard to also found the thumb.At this time,the thumb presses a key the candidate fingertips,we treat the thumb as a spe- like a non-thumb.Otherwise,we detect the fingertip of the cial case,and also select it as a candidate fingertip at
YIN et al.: CAMK: CAMERA-BASED KEYSTROKE DETECTION AND LOCALIZATION FOR SMALL MOBILE DEVICES 5 (a) An input image (b) Hand segmentation (c) Optimization (d) Fingers’contour 0 100 200 300 400 500 0 20 40 60 80 100 120 140 160 180 200 Point sequence Angle (q) 0 100 200 300 400 500 0 40 80 120 160 200 240 280 320 360 400 Vertical coordinate Angle Vertical coordinate (e) Fingertip discovery (f) Fingertips Fig. 5. Fingertip detection In a non-thumb, the fingertip is usually a convex vertex, as shown in Fig. 6(a). For a point Pi(xi , yi) located in the contour of a hand, by tracing the contour, we can select the point Pi−q(xi−q, yi−q) before Pi and the point Pi+q(xi+q, yi+q) after Pi . Here, i, q ∈ N. We calculate the angle θi between the two vectors −−−−→ PiPi−q, −−−−→ PiPi+q, according to Eq. (1). In order to simplify the calculation for θi , we map θi in the range θi ∈ [0◦ , 180◦ ]. If θi ∈ [θl , θh], θl yi−q and yi > yi+q. Otherwise, Pi will not be a candidate vertex. If there are multiple candidate vertexes, such as P ′ i in Fig. 6(a), we will choose the vertex having the largest vertical coordinate, because it has higher probability of being a fingertip, as Pi shown in Fig. 6(a). Here, the largest vertical coordinate means the local maximum in a finger’s contour, such as the red circle shown in Fig. 5(e). The range of a finger’s contour can be limited by Eq. (1), i.e., the angle feature of a finger. Based on extensive experiments, we set θl = 60◦ , θh = 150◦ , q = 20 in this paper. θi = arccos −−−−→ PiPi−q · −−−−→ PiPi+q | −−−−→ PiPi−q| · |−−−−→ PiPi+q| (1) qi ( , ) P x y i i i ( , ) P x y i q i q i q + + + ( , ) P x y i q i q i q - - - ' Pi O (0,0) x y (a) Fingertips (non-thumbs) qi ( , ) P x y i i i ( , ) P x y i q i q i q + + + ( , ) P x y i q i q i q - - - ' Pi O (0,0) x y (b) A thumb Fig. 6. Features of a fingertip In a thumb, the “fingertip” also means a convex vertex of the finger. Thus we still use Eq. (1) to represent the shape of the fingertip in a thumb. However, the position of the convex vertex can be different from that of a non-thumb. As shown in Fig. 6(b), the relative positions of Pi−q, Pi , Pi+q are different from that in Fig. 6(a). In Fig. 6(b), we show the thumb of the left hand. Obviously, Pi−q, Pi , Pi+q do not satisfy yi > yi−q and yi > yi+q. Therefore, we use (xi −xi−q)·(xi −xi+q) > 0 to describe the relative locations of Pi−q, Pi , Pi+q in thumbs. Then, we choose the vertex with largest vertical coordinate in a finger’s contour as the fingertip, as mentioned in the last paragraph. In fingertip detection, we only need to detect the points located on the bottom edge (from the left most point to the right most point) of the hand, such as the blue contour of right hand in Fig. 5(d). The shape feature θi and the positions in vertical coordinates yi along the bottom edge are shown Fig. 5(e). If we can detect five fingertips in a hand with θi and yi−q, yi , yi+q, we assume that we have also found the thumb. At this time, the thumb presses a key like a non-thumb. Otherwise, we detect the fingertip of the thumb in the right most area of left hand or left most area of right hand according to θi and xi−q, xi , xi+q. The detected fingertips of Fig. 5(a) are marked in Fig. 5(f). 4.3 Keystroke Detection and Localization After detecting the fingertip, we will track the fingertip to detect a possible keystroke and locate it for text entry. The keystroke is usually correlated with one or two fingertips, therefore we first select the candidate fingertip having a high probability of pressing a key, instead of detecting all fingertips, to reduce the computation overhead. Then, we track the candidate fingertip to detect the possible keystroke. Finally, we correlate the candidate fingertip with the pressed key to locate the keystroke. 4.3.1 Candidate fingertip selection in each hand CamK allows the user to use all of their fingers for text-entry, thus the keystroke may come from the left or right hand. Based on the observations (see section 3.1), the fingertip (i.e., StrokeTip) pressing the key usually has the largest vertical coordinate in that hand, such as finger 9 shown in Fig. 2(a). Therefore, we first select the candidate fingertip with the largest vertical coordinate in each hand. We respectively use Cl and Cr to represent the points located in the contour of left hand and right hand. For a point Pl(xl , yl) ∈ Cl , if Pl satisfies yl ≥ yj (∀Pj (xj , yj ) ∈ Cl , j ̸= l), then Pl will be selected as the candidate fingertip in the left hand. Similarly, we can get the candidate fingertip Pr(xr, yr) in the right hand. In this step, we only need to get Pl and Pr, instead of detecting all fingertips. 4.3.2 Keystroke detection based on fingertip tracking As described in the observations, when the user presses a key, the fingertip will stay at that key for a certain duration. Therefore, we can use the location variation of the candidate fingertip to detect a possible keystroke. In Frame i, we use Pli (xli , yli ) and Pri (xri , yri ) to represent the candidate fingertips in the left hand and right hand, respectively. If the candidate fingertips in frame [i − 1, i] satisfy Eq. (2) in left hand or Eq. (3) in right hand, the corresponding fingertip will be treated as static, i.e., a keystroke probably happens. Based on extensive experiments, we set ∆r = 5 empirically. √ (xli − xli−1 ) 2 + (yli − yli−1 ) 2 ≤ ∆r, (2) √ (xri − xri−1 ) 2 + (yri − yri−1 ) 2 ≤ ∆r. (3) 4.3.3 Keystroke localization by correlating the fingertip with the pressed key After detecting a possible keystroke, we correlate the candidate fingertip and the pressed key to locate the keystroke, based on the observations of Section 3.1. In regard to the candidate fingertips, we treat the thumb as a special case, and also select it as a candidate fingertip at
6 IEEE TRANSACTIONS ON MOBILE COMPUTING,VOL.X,NO.X,XXXX 2018 first.Then,we get the candidate fingertip set Ctip is calculated as{Pk∈RlV(-xk)2+(i-yk2≤△r}. (P,P,left thumb in frame i,right thumb in frame i.Af- We set Ar =5 empirically. ter that,we can locate the keystroke by using Alg.1. As shown in Fig.7(b),a key is represented as a quad- Algorithm 1:Keystroke localization rangle ABCD.If a point is located in ABCD,when we traverse ABCD clockwise,the point will be located in the Input:Candidate fingertip set Ctip in frame i. Remove fingertips out of the keyboard from Ctip. right side of each edge in ABCD.As shown in Fig.2(a),the forP∈Ctip do origin of coordinates is located in the top left corner of the Obtain candidate key set Ckey around P. image.Therefore,if the fingertip P E Ri satisfies Eq.(4),it for Kj∈Ckey do is located in the key.CamK will keep it as a candidate key. if P is located in Ki then Otherwise,CamK removes the key from the candidate key Calculate the coverage ratio pk,of Kj set Ckey In Fig.7(a),K1,K2 are the remaining candidate if pk pi then keys.The candidate keys contain the fingertip in Fig.8(a)is LRemove Kj from Ckey. shown in Fig.8(b). else Remove Ki from Ckey. AB×AP≥0,BC×B2≥0, (4) if Ckey≠0then Cb×P≥0,DA×DP≥0. Select K;with largest Pk,from Ckey. Calculating the coverage ratios of candidate keys: forms a possible keystroke. When a key is pressed,it is visually obstructed by the else Remove Pi from Ctip. fingertip,as the dashed area of key Ki shown in Fig.7(a). We use the coverage ratio to measure the visually obstructed if Ctip =0 then No keystroke occurs,return. area of a candidate key,in order to remove wrong candidate if Cipl=1 then Return the pressed key. keys.For a candidate key Kj,whose area is Sk,,the visually Select Pi,Ki>with largest ratio p&,in each hand. Dks Obtain P,K>(in left (right)hand. obstructed area is Dk,and its coverage ratio is Pk,= Calculate relative distance di(dr)in left(right)hand. For a larger key (e.g.,the space key),we update Pk,by if di>dr then Return Ki.else Return Kr. D·f,1 multiplying a key size factor fjie.=min( Output:The pressed key. where fj=Sk/Sk.Here,Sk means the average area of Eliminating impossible fingertips:For convenience,we a key,i.e,Sk=So/Navg.If pk pL,the key Kj is still use P to represent the fingertip in Ctip,ie.,PiE Ctip,iE a candidate key.Otherwise,CamK removes it from the [1,4].If a fingertip Pi is not located in the keyboard region, candidate key set Ckey.We set pi =0.25 by default.For CamK eliminates it from the candidate fingertips Ctip. each hand,if there is more than one candidate key,we will Selecting the nearest candidate keys:For each candi- keep the key with largest coverage ratio as the final candidate date fingertip Pi,we first search the candidate keys which key.For a candidate fingertip,if there is no candidate key are probably pressed by Pi.As shown in Fig.7(a),although associated with it,the fingertip will be eliminated.Fig.8(c) the real fingertip is Pi,the detected fingertip is P.We use shows each candidate fingertip and its associated key. B to search the candidate keys.We use Kej(xcj,c)to 4.3.4 Vertical distance with remaining fingertips represent the centroid of key Kj.Then we get two rows of keys nearest the location Pi(i,i)(i.e.,the rows with two Until now,there is one candidate fingertip in each hand at smallest lyej-iil).For each row,we select the two nearest most.If there are no candidate fingertips,then no keystroke keys(i.e.,the keys with two smallest -)In Fig.7(a), is detected.If there is only one candidate fingertip,then the candidate key set Ckey is consisted of K1,K2,K3,K4. the fingertip is the StrokeTip while the associated key is Fig.8(a)shows the candidate keys of each fingertip. StrokeKey,they represent the keystroke.However,if there are two candidate fingertips,we will utilize the vertical 0(0,0 distance between the candidate fingertip and the remaining (x2,y2 B fingertips to choose the most probable StrokeTip,as shown in Fig.2(a). P (x,y2 We use P(xL,y)and Pr(r,yr)to represent the candi- D C (84yn date fingertips in the left hand and right hand,respectively. (xn,yi) Then,we calculate the distance d between p and the (a)Candidate keys (b)Locating a fingertip remaining fingertips in the left hand,and the distance d. Fig.7.Candidate keys and Candidate fingertips between Pr and the remaining fingertips in the right hand. Retaining candidate keys containing the candidate Here,d=lh-·∑i,j≠,while d,=y,-}· fingertip:If a key is pressed by the user,the fingertip ∑,j≠Here,,repethe vertica will be located in that key.Thus we use the location of of fingertip j.If d>dr,we choose P as the StrokeTip. the fingertip P(,to verify whether a candidate key Otherwise,we choose P as the StrokeTip.The associated contains the fingertip,to remove the invalid candidate keys. key for the StrokeTip is the pressed key StrokeKey.In Fig.8(d), As shown in Fig.7(a),there exists a small deviation between we choose fingertip 3 in the left hand as StrokeTip.However, the real fingertip and the detected fingertip.Therefore,we considering the effect of camera's view,sometimes d(d,) extend the range of the detected fingertip to Ri,as shown in may fail to locate the keystroke accurately.Therefore,for the Fig.7(a).If any point P(,y)in the range Ri is located in unselected candidate fingertip (e.g.,fingertip 8 in Fig.8(d)), a candidate key Kj,P is considered to be located in Kj.Ri we will not discard its associated key directly.Specifically
6 IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. X, NO. X, XXXX 2018 first. Then, we get the candidate fingertip set Ctip = {Pli , Pri , left thumb in frame i,right thumb in frame i}. After that, we can locate the keystroke by using Alg. 1. Algorithm 1: Keystroke localization Input: Candidate fingertip set Ctip in frame i. Remove fingertips out of the keyboard from Ctip . for Pi ∈ Ctip do Obtain candidate key set Ckey around Pi . for Kj ∈ Ckey do if Pi is located in Kj then Calculate the coverage ratio ρkj of Kj . if ρkj forms a possible keystroke. else Remove Pi from Ctip. if Ctip = ∅ then No keystroke occurs, return. if |Ctip| = 1 then Return the pressed key. Select with largest ratio ρkj in each hand. Obtain () in left (right) hand. Calculate relative distance dl (dr) in left (right) hand. if dl > dr then Return Kl . else Return Kr. Output: The pressed key. Eliminating impossible fingertips: For convenience, we use Pi to represent the fingertip in Ctip, i.e., Pi ∈ Ctip, i ∈ [1, 4]. If a fingertip Pi is not located in the keyboard region, CamK eliminates it from the candidate fingertips Ctip. Selecting the nearest candidate keys: For each candidate fingertip Pi , we first search the candidate keys which are probably pressed by Pi . As shown in Fig. 7(a), although the real fingertip is Pi , the detected fingertip is Pˆ i . We use Pˆ i to search the candidate keys. We use Kcj (xcj , ycj ) to represent the centroid of key Kj . Then we get two rows of keys nearest the location Pˆ i(ˆxi , yˆi) (i.e., the rows with two smallest |ycj − yˆi |). For each row, we select the two nearest keys (i.e., the keys with two smallest |xcj − xˆi |). In Fig. 7(a), the candidate key set Ckey is consisted of K1, K2, K3, K4. Fig. 8(a) shows the candidate keys of each fingertip. O (0,0) x y ˆPi Ri K1 K2 K3 K4 Pi (a) Candidate keys A B D C P 1 j1 ( , y ) j x 2 j2 ( , y ) j x 3 j3 ( , y ) j x 4 j4 ( , y ) j x ( , y ) i i x (b) Locating a fingertip Fig. 7. Candidate keys and Candidate fingertips Retaining candidate keys containing the candidate fingertip: If a key is pressed by the user, the fingertip will be located in that key. Thus we use the location of the fingertip Pˆ i(ˆxi , yˆi) to verify whether a candidate key contains the fingertip, to remove the invalid candidate keys. As shown in Fig. 7(a), there exists a small deviation between the real fingertip and the detected fingertip. Therefore, we extend the range of the detected fingertip to Ri , as shown in Fig. 7(a). If any point Pk(xk, yk) in the range Ri is located in a candidate key Kj , Pˆ i is considered to be located in Kj . Ri is calculated as {Pk ∈ Ri | √ (ˆxi − xk) 2 + (ˆyi − yk) 2 ≤ ∆r}. We set ∆r = 5 empirically. As shown in Fig. 7(b), a key is represented as a quadrangle ABCD. If a point is located in ABCD, when we traverse ABCD clockwise, the point will be located in the right side of each edge in ABCD. As shown in Fig. 2(a), the origin of coordinates is located in the top left corner of the image. Therefore, if the fingertip P ∈ Ri satisfies Eq. (4), it is located in the key. CamK will keep it as a candidate key. Otherwise, CamK removes the key from the candidate key set Ckey. In Fig. 7(a), K1, K2 are the remaining candidate keys. The candidate keys contain the fingertip in Fig. 8(a) is shown in Fig. 8(b). −−→AB × −→AP ≥ 0, −−→BC × −−→BP ≥ 0, −−→CD × −−→CP ≥ 0, −−→DA × −−→DP ≥ 0. (4) Calculating the coverage ratios of candidate keys: When a key is pressed, it is visually obstructed by the fingertip, as the dashed area of key K1 shown in Fig. 7(a). We use the coverage ratio to measure the visually obstructed area of a candidate key, in order to remove wrong candidate keys. For a candidate key Kj , whose area is Skj , the visually obstructed area is Dkj , and its coverage ratio is ρkj = Dkj Skj . For a larger key (e.g., the space key), we update ρkj by multiplying a key size factor fj , i.e., ρkj = min( Dkj Skj · fj , 1), where fj = Skj /S¯ k. Here, S¯ k means the average area of a key, i.e, S¯ k = Sb/Navg. If ρkj ≥ ρl , the key Kj is still a candidate key. Otherwise, CamK removes it from the candidate key set Ckey. We set ρl = 0.25 by default. For each hand, if there is more than one candidate key, we will keep the key with largest coverage ratio as the final candidate key. For a candidate fingertip, if there is no candidate key associated with it, the fingertip will be eliminated. Fig. 8(c) shows each candidate fingertip and its associated key. 4.3.4 Vertical distance with remaining fingertips Until now, there is one candidate fingertip in each hand at most. If there are no candidate fingertips, then no keystroke is detected. If there is only one candidate fingertip, then the fingertip is the StrokeTip while the associated key is StrokeKey, they represent the keystroke. However, if there are two candidate fingertips, we will utilize the vertical distance between the candidate fingertip and the remaining fingertips to choose the most probable StrokeTip, as shown in Fig. 2(a). We use Pl(xl , yl) and Pr(xr, yr) to represent the candidate fingertips in the left hand and right hand, respectively. Then, we calculate the distance dl between Pl and the remaining fingertips in the left hand, and the distance dr between Pr and the remaining fingertips in the right hand. Here, dl = |yl − 1 4 · ∑j=5 j=1 yj , j ̸= l|, while dr = |yr − 1 4 · ∑j=10 j=6 yj , j ̸= r|. Here, yj represents the vertical coordinate of fingertip j. If dl > dr, we choose Pl as the StrokeTip. Otherwise, we choose Pr as the StrokeTip. The associated key for the StrokeTip is the pressed key StrokeKey. In Fig. 8(d), we choose fingertip 3 in the left hand as StrokeTip. However, considering the effect of camera’s view, sometimes dl (dr) may fail to locate the keystroke accurately. Therefore, for the unselected candidate fingertip (e.g., fingertip 8 in Fig. 8(d)), we will not discard its associated key directly. Specifically
YIN et al:CAMK:CAMERA-BASED KEYSTROKE DETECTION AND LOCALIZATION FOR SMALL MOBILE DEVICES 7 we sort the previous candidate keys which contain the can- We use no,to represent the number of frames containing didate fingertip based on the coverage ratio in descending the ith keystroke.When the user has pressed u keys,we order.Finally,we select top four candidates keys and show can get the average number of frames during a keystroke them on the screen.The user can press the candidate key for as=&·∑no,In fact,f而reflects the duration of text input (see Fig.1),to tolerate the localization error. a keystroke.When the frame rate f changes,the number of frames in a keystroke nif changes.Intuitively,a smaller value of nf can reduce the image processing time,while a larger value of nif can improve the accuracy of keystroke localization.Based on extensive experiments (see section 7.3),we set rij=3,thusf=fo 5.2 Online Calibration Removing false positive keystrokes:Under certain con- (a)Keys around the fingertip (b)Keys containing the fin- gertip ditions,the user does not type any key while keeping the fingers stationary,CamK may misclassify the non-keystroke as a keystroke.Thus we introduce a temporary character to mitigate this problem. In the process of pressing a key,the StrokeTip moves towards the key,stays at that key,and then moves away. The vertical coordinate of the StrokeTip first increases,then pauses,then decreases.If CamK has detected a keystroke (c)Visually obstructed key (d)Vertical distance with re- in nif consecutive frames,it displays the current character maining fingertips on the screen as a temporary character.In the next frame(s), Fig.8.Candidate fingertips/keys in each step if the position of the StrokeTip does not satisfy the features of a keystroke,CamK will cancel the temporary character. 5 OPTIMIZATIONS FOR KEYSTROKE LOCALIZA- This does not have much impact on the user's experience, TION AND IMAGE PROCESSING because of the short time between two consecutive frames. Besides,CamK also displays the candidate keys around the Considering the deviation caused by image processing,the StrokeTip,the user can choose them for text input. influence of light conditions,and other factors,we introduce Movement of smartphone or keyboard:CamK pre- the initial training to select the suitable values of parameters sumes that the smartphone and the keyboard do not move for image processing and utilize online calibration to im- while in use.For best results,we recommend the user to prove the performance of keystroke detection and localiza- tape the paper keyboard on a flat surface.Nevertheless, tion.In addition,considering the limited resources of small to alleviate the effect caused by the movements of the mobile devices,we also introduce multiple optimization mobile device or the keyboard,we offer a simple solution. techniques to reduce the time latency and energy cost in If the user continuously uses the Delete key on the screen CamK. multiple times (e.g.,larger than 3 times),CamK will inform 5.1 Initial Training the user to move his/her hands away from the keyboard Optimal parameters for image processing:For key segmen- for relocation.After that,the user can continue the typing tation(see section 4.1.2),y is used for tolerating the change process.The relocation process just needs the user to move of Y caused by the environment.Initially,y=50.CamK away the hands and it is usually less than 10s. updates Ey=+1,when the number of extracted 5.3 Real Time Image Processing keys decreases,it stops.Then,CamK sets Ey to 50 and As a text-entry method,CamK needs to output the character updates Ey-E-1-1,when the number of extracted without noticeable latency.According to Section 4,in order keys decreases,it stops.In the process,when CamK gets to output a character,we need to capture images,track maximum number of keys,the corresponding value Ey is fingertips,and finally detect and locate the keystroke.The selected as the optimal value for y. large time cost in image processing leads to large time In hand segmentation,CamK uses erosion and dilation latency for text output.To solve this problem,we first profile operations,which respectively use a kernel B [32]to process the time cost of each stage in Camk,and then introduce images.To get a suitable size of B,the user first puts four optimization techniques to reduce the time cost.Unless his/her hands on the home row of the keyboard(see Fig. otherwise specified,the frame rate is set to 30fps by default. 5(a)).For simplicity,we set the kernel sizes for erosion and dilation to be equal.The initial kernel size is zo =0.Then, 5.3.1 Time cost in different stages CamK updates zi=2-1+1.When CamK can localize There are three main stages in CamK,i.e.,capturing the each fingertip in the correct key with zi,then CamK sets the images,tracking the fingertips,and locating the keystroke. kernel size as z=2i.In initial training,the user puts on the The stages are respectively called 'Cap-img',Tra-tip'and hands based on the on-screen instructions,it usually spends Loc-key'for short.We frist set the image size to 640*480 less than 10s. pixels which is supported by many smartphones,and then Frame rate selection:CamK sets the initial/default measure the time cost of producing or processing one image frame rate of the camera to be fo 30fps (frames per in each stage with a Samsung GT-19100 smartphone (GT- second),which is usually the maximal possible frame rate. 19100 for short).According to the measurement,the time
YIN et al.: CAMK: CAMERA-BASED KEYSTROKE DETECTION AND LOCALIZATION FOR SMALL MOBILE DEVICES 7 we sort the previous candidate keys which contain the candidate fingertip based on the coverage ratio in descending order. Finally, we select top four candidates keys and show them on the screen. The user can press the candidate key for text input (see Fig. 1), to tolerate the localization error. (a) Keys around the fingertip (b) Keys containing the fingertip (c) Visually obstructed key (d) Vertical distance with remaining fingertips Fig. 8. Candidate fingertips/keys in each step 5 OPTIMIZATIONS FOR KEYSTROKE LOCALIZATION AND IMAGE PROCESSING Considering the deviation caused by image processing, the influence of light conditions, and other factors, we introduce the initial training to select the suitable values of parameters for image processing and utilize online calibration to improve the performance of keystroke detection and localization. In addition, considering the limited resources of small mobile devices, we also introduce multiple optimization techniques to reduce the time latency and energy cost in CamK. 5.1 Initial Training Optimal parameters for image processing: For key segmentation (see section 4.1.2), εy is used for tolerating the change of Y caused by the environment. Initially, εy = 50. CamK updates εyi = εyi−1 + 1, when the number of extracted keys decreases, it stops. Then, CamK sets εy to 50 and updates εyi = εyi−1 − 1, when the number of extracted keys decreases, it stops. In the process, when CamK gets maximum number of keys, the corresponding value εyi is selected as the optimal value for εy. In hand segmentation, CamK uses erosion and dilation operations, which respectively use a kernel B [32] to process images. To get a suitable size of B, the user first puts his/her hands on the home row of the keyboard (see Fig. 5(a)). For simplicity, we set the kernel sizes for erosion and dilation to be equal. The initial kernel size is z0 = 0. Then, CamK updates zi = zi−1 + 1. When CamK can localize each fingertip in the correct key with zi , then CamK sets the kernel size as z = zi . In initial training, the user puts on the hands based on the on-screen instructions, it usually spends less than 10s. Frame rate selection: CamK sets the initial/default frame rate of the camera to be f0 = 30fps (frames per second), which is usually the maximal possible frame rate. We use n0i to represent the number of frames containing the ith keystroke. When the user has pressed u keys, we can get the average number of frames during a keystroke as n¯0 = 1 u · ∑i=u i=1 n0i . In fact, n¯0 reflects the duration of a keystroke. When the frame rate f changes, the number of frames in a keystroke n¯f changes. Intuitively, a smaller value of n¯f can reduce the image processing time, while a larger value of n¯f can improve the accuracy of keystroke localization. Based on extensive experiments (see section 7.3), we set n¯f = 3, thus f = ⌈ f0 · n¯f n¯0 ⌉ . 5.2 Online Calibration Removing false positive keystrokes: Under certain conditions, the user does not type any key while keeping the fingers stationary, CamK may misclassify the non-keystroke as a keystroke. Thus we introduce a temporary character to mitigate this problem. In the process of pressing a key, the StrokeTip moves towards the key, stays at that key, and then moves away. The vertical coordinate of the StrokeTip first increases, then pauses, then decreases. If CamK has detected a keystroke in n¯f consecutive frames, it displays the current character on the screen as a temporary character. In the next frame(s), if the position of the StrokeTip does not satisfy the features of a keystroke, CamK will cancel the temporary character. This does not have much impact on the user’s experience, because of the short time between two consecutive frames. Besides, CamK also displays the candidate keys around the StrokeTip, the user can choose them for text input. Movement of smartphone or keyboard: CamK presumes that the smartphone and the keyboard do not move while in use. For best results, we recommend the user to tape the paper keyboard on a flat surface. Nevertheless, to alleviate the effect caused by the movements of the mobile device or the keyboard, we offer a simple solution. If the user continuously uses the Delete key on the screen multiple times (e.g., larger than 3 times), CamK will inform the user to move his/her hands away from the keyboard for relocation. After that, the user can continue the typing process. The relocation process just needs the user to move away the hands and it is usually less than 10s. 5.3 Real Time Image Processing As a text-entry method, CamK needs to output the character without noticeable latency. According to Section 4, in order to output a character, we need to capture images, track fingertips, and finally detect and locate the keystroke. The large time cost in image processing leads to large time latency for text output. To solve this problem, we first profile the time cost of each stage in Camk, and then introduce four optimization techniques to reduce the time cost. Unless otherwise specified, the frame rate is set to 30fps by default. 5.3.1 Time cost in different stages There are three main stages in CamK, i.e., capturing the images, tracking the fingertips, and locating the keystroke. The stages are respectively called ‘Cap-img’, ‘Tra-tip’ and ‘Loc-key’ for short. We frist set the image size to 640 ∗ 480 pixels which is supported by many smartphones, and then measure the time cost of producing or processing one image in each stage with a Samsung GT-I9100 smartphone (GTI9100 for short). According to the measurement, the time
8 IEEE TRANSACTIONS ON MOBILE COMPUTING,VOL.X,NO.X,XXXX 2018 cost in 'Cap-img','Tra-tip',Loc-key'is 99ms,118ms,787ms, Currently,the processing time for the large-size image respectively.Here,'Cap-img'means the time for capturing is 339ms.We can estimate the minimal time to detect and an image,"Tra-tip'means the time for processing the image locate a keystroke with Eq.(8),which is 37.6%of that in to select the candidate fingertips,Loc-key'means the time Eg.(5).However,the processing time 339ms for the large- for processing an image to locate the keystroke.We repeats size image is still a little high.If CamK works with a single the measurement for 500 times to get the average time cost. thread,it may miss the next keystroke,due to the large According to Section 5.1,we need to capture/process processing time.Thus,CamK is expected to work with three images during a keystroke to guarantee the localiza- multiple threads in parallel. tion accuracy.Thus we can estimate the minimal time T1 of Tk3=(32+9+32+9+75+339)ms=496ms (8) detecting and locating a keystroke with Eq.(5).Obviously, 1320ms is a very large latency,thus more optimizations are 5.3.4 Multi-thread processing expected for CamK to realize real time processing. According the above conclusion,CamK uses three threads Tk1=(99+118+99+118+99+787)ms=1320ms(5) to capture,detect and locate the keystrokes in parallel. As shown in Fig.10,the Cap-img'thread captures the 5.3.2 Adaptively changing image sizes images,the 'Tra-tip'thread processes the small images for As descried before,it is rather time-consuming to process keystroke detection,and the 'Loc-key'thread processes the the image with 640*480 pixels.Intuitively,if CamK operates large image to locate the keystroke.In this way,CamK on smaller images,it would reduce the time cost.However,will not miss the frames of next keystroke,because the to guarantee the keystroke localization accuracy,we can not 'Cap-img'thread does not stop taking images.As shown use a very small image.Therefore,we adaptively adopts in Fig.10,Camk utilizes consecutive frames to determine different sizes of images for processing,as shown in Fig. the keystroke and also introduces the online calibration to 9.We use smaller images to track the fingertips during improve the performance. two keystrokes and use larger images to locate the detected keystroke.Based on extensive experiments,we set the size Cap-img Tra-tip Loc-key of the small image to be 120 *90 pixels,while the large Producing Processing larg image size is 480*360 pixels.As shown in Fig.9,when a a白 Processing small images to track the keystroke will happen,CamK adaptively changes frame i oving states ofth Localicatio to be a large image (i.e.,480*360 pixels).After that,CamK fingertips results Online calibration changes the following frames to small images (i.e.,120*90 pixels)until the next keystroke is detected.The time cost in 'Cap-img (120*90 pixels)','Cap-img (480*360 pixels)'"Tra- Fig.10.Multi-thread Processing tip (120*90 pixels)','Loc-key (480*360 pixels)'is 32ms,75ms, By adopting multiple threads,we can estimate the min- 9ms,631ms,respectively.Then,we can estimate the minimal imal time to detect and locate the keystroke with Eq.(9), time cost Tk2 to detect and locate a keystroke with Eq.(6). which is 36.4%of that in Eq.(5).Because the frame rate is Here,Tk2 is 59.7%of T.However,more optimizations are 30fps,the interval between two frames is 33ms.Therefore, expected for large-size image processing. we use 33ms to replace 32ms('Cap-img'time).In regard to the 'Tra-tip'time(9ms),which is simultaneous with the 'Cap-img'time of the next frame,thus not being added in Eq.(9).When comparing with Eq.(8),Eq.(9)does not reduce much processing time.This is mainly caused by the time-consuming operations for writing and reading each image.Therefore,it is better eliminate the operations of writing/reading images. Tk4=(33+33+75+339)ms=480ms (9) Fig.9.Changing image sizes and focusing on the target area Tk2=(32+9+32+9+75+631)ms=788ms (6) 5.3.5 Processing without writing and reading images To remove the operations of writing and reading images,we 5.3.3 Optimizing large-size image processing store the image data captured by the camera in the RAM. Based on Fig.8,the keystroke is only related to a small Then,CamK accesses the data based on pass-by-reference. area of the large-size image.Thus we optimize CamK by It indicates that different functions access the same image only processing the small area around the StrokeTip:the red data.In this way,we remove the operations of reading and region of frame i shown in Fig.9.Suppose the position writing images.The corresponding time cost in each stage of the candidate fingertip in frame i-1 (small image) is shown in the Reference'row of Table 1.Here,the time is P-1(c,Ue),then we scale the position to P1(e ye) cost for capturing/storing the source data of a small image (corresponding position in large image),according to the and a large image is the same.This is because the limitation ratio Pst of the small image to the large image in width,i.e,of hardwares,in preview mode,the size of the source data 登=整=Pl.Here,Psl=l器.Then,CamK only processes in each frame is the same(e.g.,480*360 pixels).If we want the area S in frame i(large image)to reduce time cost,as to get the image with size 120*90 pixels,we will resize the shown in Eq.(7).We set Ax=40,Ay =20 by default. image during image processing. In Table 1,the time cost includes the time of reading Se={P(r,h)∈Sel li-xel≤△x,l-yel≤△y}(⑦image data from the RAM and processing the image data
8 IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. X, NO. X, XXXX 2018 cost in ‘Cap-img’, ‘Tra-tip’, ‘Loc-key’ is 99ms, 118ms, 787ms, respectively. Here, ‘Cap-img’ means the time for capturing an image, ‘Tra-tip’ means the time for processing the image to select the candidate fingertips, ‘Loc-key’ means the time for processing an image to locate the keystroke. We repeats the measurement for 500 times to get the average time cost. According to Section 5.1, we need to capture/process three images during a keystroke to guarantee the localization accuracy. Thus we can estimate the minimal time Tk1 of detecting and locating a keystroke with Eq. (5). Obviously, 1320ms is a very large latency, thus more optimizations are expected for CamK to realize real time processing. Tk1 = (99 + 118 + 99 + 118 + 99 + 787)ms = 1320ms (5) 5.3.2 Adaptively changing image sizes As descried before, it is rather time-consuming to process the image with 640∗480 pixels. Intuitively, if CamK operates on smaller images, it would reduce the time cost. However, to guarantee the keystroke localization accuracy, we can not use a very small image. Therefore, we adaptively adopts different sizes of images for processing, as shown in Fig. 9. We use smaller images to track the fingertips during two keystrokes and use larger images to locate the detected keystroke. Based on extensive experiments, we set the size of the small image to be 120 ∗ 90 pixels, while the large image size is 480*360 pixels. As shown in Fig. 9, when a keystroke will happen, CamK adaptively changes frame i to be a large image (i.e., 480*360 pixels). After that, CamK changes the following frames to small images (i.e., 120*90 pixels) until the next keystroke is detected. The time cost in ‘Cap-img (120*90 pixels)’, ‘Cap-img (480*360 pixels)’ ‘Tratip (120*90 pixels)’, ‘Loc-key (480*360 pixels)’ is 32ms, 75ms, 9ms, 631ms, respectively. Then, we can estimate the minimal time cost Tk2 to detect and locate a keystroke with Eq. (6). Here, Tk2 is 59.7% of Tk1. However, more optimizations are expected for large-size image processing. Frame i Frame i-3 Framei-2 Frame i-1 Frame i+1 Frame i+2 Frame i+3 Fig. 9. Changing image sizes and focusing on the target area Tk2 = (32 + 9 + 32 + 9 + 75 + 631)ms = 788ms (6) 5.3.3 Optimizing large-size image processing Based on Fig. 8, the keystroke is only related to a small area of the large-size image. Thus we optimize CamK by only processing the small area around the StrokeTip: the red region of frame i shown in Fig. 9. Suppose the position of the candidate fingertip in frame i − 1 (small image) is Pi−1(xc, yc), then we scale the position to P ′ i−1 (x ′ c , y ′ c ) (corresponding position in large image), according to the ratio ρsl of the small image to the large image in width, i.e, xc x ′ c = yc y ′ c = ρsl. Here, ρsl = 120 480 . Then, CamK only processes the area S ′ c in frame i (large image) to reduce time cost, as shown in Eq. (7). We set ∆x = 40, ∆y = 20 by default. S ′ c = {Pi(xi , yi) ∈ S ′ c | |xi − x ′ c | ≤ ∆x, |yi − y ′ c | ≤ ∆y} (7) Currently, the processing time for the large-size image is 339ms. We can estimate the minimal time to detect and locate a keystroke with Eq. (8), which is 37.6% of that in Eq. (5). However, the processing time 339ms for the largesize image is still a little high. If CamK works with a single thread, it may miss the next keystroke, due to the large processing time. Thus, CamK is expected to work with multiple threads in parallel. Tk3 = (32 + 9 + 32 + 9 + 75 + 339)ms = 496ms (8) 5.3.4 Multi-thread processing According the above conclusion, CamK uses three threads to capture, detect and locate the keystrokes in parallel. As shown in Fig. 10, the ‘Cap-img’ thread captures the images, the ‘Tra-tip’ thread processes the small images for keystroke detection, and the ‘Loc-key’ thread processes the large image to locate the keystroke. In this way, CamK will not miss the frames of next keystroke, because the ‘Cap-img’ thread does not stop taking images. As shown in Fig. 10, Camk utilizes consecutive frames to determine the keystroke and also introduces the online calibration to improve the performance. Cap-img Producing small images Producing large images Tra-tip Processing small images to track the moving states of the fingertips Moving Staying Loc-key Processing large images to locate the keystroke Online calibration Small images Large images Detection results Localization results Fig. 10. Multi-thread Processing By adopting multiple threads, we can estimate the minimal time to detect and locate the keystroke with Eq. (9), which is 36.4% of that in Eq. (5). Because the frame rate is 30fps, the interval between two frames is 33ms. Therefore, we use 33ms to replace 32ms (‘Cap-img’ time). In regard to the ‘Tra-tip’ time (9ms), which is simultaneous with the ‘Cap-img’ time of the next frame, thus not being added in Eq. (9). When comparing with Eq. (8), Eq. (9) does not reduce much processing time. This is mainly caused by the time-consuming operations for writing and reading each image. Therefore, it is better eliminate the operations of writing/reading images. Tk4 = (33 + 33 + 75 + 339)ms = 480ms (9) 5.3.5 Processing without writing and reading images To remove the operations of writing and reading images, we store the image data captured by the camera in the RAM. Then, CamK accesses the data based on pass-by-reference. It indicates that different functions access the same image data. In this way, we remove the operations of reading and writing images. The corresponding time cost in each stage is shown in the ‘Reference’ row of Table 1. Here, the time cost for capturing/storing the source data of a small image and a large image is the same. This is because the limitation of hardwares, in preview mode, the size of the source data in each frame is the same (e.g., 480 ∗ 360 pixels). If we want to get the image with size 120 ∗ 90 pixels, we will resize the image during image processing. In Table 1, the time cost includes the time of reading image data from the RAM and processing the image data
YIN et al:CAMK:CAMERA-BASED KEYSTROKE DETECTION AND LOCALIZATION FOR SMALL MOBILE DEVICES When processing images with 120*90 pixels,CamK only we show an example of ambiguity,the 7th finger in right needs to detect the candidate fingertips with hand seg- hand presses the key f',while the 10th fingertip in the mentation and fingertip discovery.When processing images image is located in 'shift',it seems that the user aims to with 480 360 pixels,CamK not only needs to detect get the capital letter F.In fact,the 3rd finger in left hand the fingertip,but also needs to select the candidate keys, presses the key 'command',i.e.,the user wants to call the calculate the covered area of the pressed key and correlate search function instead of getting the letter F',as shown in the candidate fingertip with the pressed key to locate the Fig.11(b).Therefore,CamK considers that the user uses key keystroke.Thus the time cost of processing the image with combination with two hands at the same time,i.e.,one hand 120*90 pixels is smaller than that of the image with 480*360 correlates with one keystroke.With the located keystroke pixels.According to Table 1,we can estimate the minimal in each hand,we verify whether the two keystrokes forms time to detect and locate a keystroke with Eq.(10),which is a special key combination.If it is true,CamK calls the only 15.5%of that in Eq.(5).Here,33ms means the waiting corresponding function.Otherwise,CamK determines the time between two images,because the maximum frame rate only efficient keystroke based on Section 4.3. is 30fps.Unit now,Tk is comparable to the duration of a keystroke,i.e.,the output time latency is usually within 50ms and below human response time [7]. Tk=(33+33+33+106)m5=205ms (10) TABLE 1 Time cost in GT-19100 smartphone(image is described in pixels) Cap-img Cap-img Tra-tip Loc-key (120*90) (480*360) (120*90) (480*360) (a)An input image (b)Multi-touch Reference 0.02ms 0.02ms 13ms 106ms Fig.11.Multi-touch Function in CamK 5.4 Reduction of Power Consumption To make Camk practical for small mobile devices,we need 6.2 Word Prediction to reduce the power consumption in CamK,especially in im- When the user has input one or more characters,CamK age processing.Based on the definition of Camera.Parameters will predict the word the user is probably going to type, [33]in Android APIs,the adjustable parameters of camera by using the common-word set and word frequencies [35] are picture size,preview frame rate,preview size,and of the typed words.In regard to the common-word set,we camera view size (i.e.,window size of camera). introduce Ne most common words [36],which are sorted by Among the parameters,the picfure size has no effect on frequency of use in descending order.We use Ei,i[1,Nel CamK.The preview frame rate and preview size (ie.,image to represent the priority level of the common word Wi, sizes)have already been optimized in Section 5.1 and 5.3. EE(0.1).In regard to word frequencies of Therefore,we only observe how camera view size affect the typed words,we use F;to represent the frequency of the power consumption by a Monsoon power monitor [34]. word Wi typed by the user in CamK.Initially,Fj is set When we respectively set the camera view size to 120*90, to zero.Everytime the user types a word Wi,the frequency 240*180,360*270,480*360,720*540,960*720,1200*900 pixels,, of Wi increases by one.Whatever the larger priority level the corresponding power consumption is 1204.6,1228.4, or the larger word frequency,it indicates the word has a 1219.8,1221.8,1222.9,1213.5,1222.2mW.The camera view higher probability to be typed.By matching the prefix Sp size has little effect on power consumption.In this paper, of word W&with that of common words,we get the top-me the camera view size is set to 480*360 pixels by default. candidate words le with the highest Ei.Similarly,we get the top-mu candidate words le with the highest Fj.After 6 EXTENSION:MULTI-TOUCH AND WORD PREDIC- that,we merge le and lez to get the candidate words in TION (le Ule2}.We set Ne 1000,me mu 5 by default. To provide a better user experience,we add multi-touch As shown in Fig.12,when the user types w', to allow the user to use a key combination,e.g.,pressing we get le as {with,we,what,who,would),while le is shift'and 'a'at the same time.Besides,we introduce word {word,words,we}.Then,we merge the candidate word- prediction to guess the possible word to be typed to improve s as {with,we,what,who,would,word,words.By press- the text-input speed. ing the button corresponding to the candidate word (e.g., "word"),the user can omit the following keystrokes (e.g., 6.1 Multi-touch Function 'o','r'and'd').Then,CamK can improve the input speed In Section 4.3,CamK determines one efficient keystroke. and reduce the computation/energy overhead. However,considering the actual typing behavior,we may use key combinations for special purposes.Take the Apple Wireless Keyboard as an example,we can press 'command' and 'c'at the same time to "copy".Therefore,we introduce the multi-touch for CamK,to allow the user to use special key combinations. In CamK,we consider the case that user presses the key combination by using left hand and right hand at the same time,in order to eliminate ambiguity.In Fig.11(a), Fig.12.Word prediction
YIN et al.: CAMK: CAMERA-BASED KEYSTROKE DETECTION AND LOCALIZATION FOR SMALL MOBILE DEVICES 9 When processing images with 120 ∗ 90 pixels, CamK only needs to detect the candidate fingertips with hand segmentation and fingertip discovery. When processing images with 480 ∗ 360 pixels, CamK not only needs to detect the fingertip, but also needs to select the candidate keys, calculate the covered area of the pressed key and correlate the candidate fingertip with the pressed key to locate the keystroke. Thus the time cost of processing the image with 120∗90 pixels is smaller than that of the image with 480∗360 pixels. According to Table 1, we can estimate the minimal time to detect and locate a keystroke with Eq. (10), which is only 15.5% of that in Eq. (5). Here, 33ms means the waiting time between two images, because the maximum frame rate is 30fps. Unit now, Tk is comparable to the duration of a keystroke, i.e., the output time latency is usually within 50ms and below human response time [7]. Tk = (33 + 33 + 33 + 106)ms = 205ms (10) TABLE 1 Time cost in GT-I9100 smartphone (image is described in pixels) Cap-img (120*90) Cap-img (480*360) Tra-tip (120*90) Loc-key (480*360) Reference 0.02ms 0.02ms 13ms 106ms 5.4 Reduction of Power Consumption To make Camk practical for small mobile devices, we need to reduce the power consumption in CamK, especially in image processing. Based on the definition of Camera.Parameters [33] in Android APIs, the adjustable parameters of camera are picture size, preview frame rate, preview size, and camera view size (i.e., window size of camera). Among the parameters, the picture size has no effect on CamK. The preview frame rate and preview size (i.e., image sizes) have already been optimized in Section 5.1 and 5.3. Therefore, we only observe how camera view size affect the power consumption by a Monsoon power monitor [34]. When we respectively set the camera view size to 120*90, 240*180, 360*270, 480*360, 720*540, 960*720, 1200*900 pixels, the corresponding power consumption is 1204.6, 1228.4, 1219.8, 1221.8, 1222.9, 1213.5, 1222.2 mW. The camera view size has little effect on power consumption. In this paper, the camera view size is set to 480*360 pixels by default. 6 EXTENSION: MULTI-TOUCH AND WORD PREDICTION To provide a better user experience, we add multi-touch to allow the user to use a key combination, e.g., pressing ‘shift’ and ‘a’ at the same time. Besides, we introduce word prediction to guess the possible word to be typed to improve the text-input speed. 6.1 Multi-touch Function In Section 4.3, CamK determines one efficient keystroke. However, considering the actual typing behavior, we may use key combinations for special purposes. Take the Apple Wireless Keyboard as an example, we can press ‘command’ and ‘c’ at the same time to “copy”. Therefore, we introduce the multi-touch for CamK, to allow the user to use special key combinations. In CamK, we consider the case that user presses the key combination by using left hand and right hand at the same time, in order to eliminate ambiguity. In Fig. 11(a), we show an example of ambiguity, the 7th finger in right hand presses the key ‘f’, while the 10th fingertip in the image is located in ‘shift’, it seems that the user aims to get the capital letter ‘F’. In fact, the 3rd finger in left hand presses the key ‘command’, i.e., the user wants to call the search function instead of getting the letter ‘F’, as shown in Fig. 11(b). Therefore, CamK considers that the user uses key combination with two hands at the same time, i.e., one hand correlates with one keystroke. With the located keystroke in each hand, we verify whether the two keystrokes forms a special key combination. If it is true, CamK calls the corresponding function. Otherwise, CamK determines the only efficient keystroke based on Section 4.3. command f shift 2 3 4 5 1 6 7 8 9 10 (a) An input image command f (b) Multi-touch Fig. 11. Multi-touch Function in CamK 6.2 Word Prediction When the user has input one or more characters, CamK will predict the word the user is probably going to type, by using the common-word set and word frequencies [35] of the typed words. In regard to the common-word set, we introduce Nc most common words [36], which are sorted by frequency of use in descending order. We use Ei , i ∈ [1, Nc] to represent the priority level of the common word Wi , Ei = Nc−i+1 Nc+1 , Ei ∈ (0, 1). In regard to word frequencies of the typed words, we use Fj to represent the frequency of word Wj typed by the user in CamK. Initially, Fj is set to zero. Everytime the user types a word Wj , the frequency of Wj increases by one. Whatever the larger priority level or the larger word frequency, it indicates the word has a higher probability to be typed. By matching the prefix Spk of word Wk with that of common words, we get the top−mc candidate words lc1 with the highest Ei . Similarly, we get the top − mu candidate words lc2 with the highest Fj . After that, we merge lc1 and lc2 to get the candidate words in {lc1 ∪ lc2 }. We set Nc = 1000, mc = mu = 5 by default. As shown in Fig. 12, when the user types ‘w’, we get lc1 as {with, we, what, who, would}, while lc2 is {word, words, we}. Then, we merge the candidate words as {with, we, what, who, would, word, words}. By pressing the button corresponding to the candidate word (e.g., “word”), the user can omit the following keystrokes (e.g., ‘o’, ‘r’ and ‘d’). Then, CamK can improve the input speed and reduce the computation/energy overhead. Fig. 12. Word prediction
10 IEEE TRANSACTIONS ON MOBILE COMPUTING,VOL.X,NO.X,XXXX 2018 7 PERFORMANCE EVALUATION 7.4 Effect of image size We implement CamK on smartphones running Android.We At first,we choose a constant image size.Based on Fig.17, use the layout of AWK (Apple Wireless Keyboard)as the as the size of image increases,the performance of CamK be- default keyboard layout,which is printed on a piece of US comes better.When the size is smaller than 480*360 pixels, Letter sized paper.Unless otherwise specified,we use the CamK can not extract the keys correctly,the performance is Samsung GT-19100 smartphone whose Android version is rather bad.When the size of image is 480*360 pixels,the 4.4.4,the frame rate is 15fps,the image size is 480 360 performance is good.Keeping increasing the size does not pixels,and CamK works in the office.We first evaluate bring obvious improvement.However,increasing the image each component of CamK.Then,we test the performance size will increase the time cost and power consumption of CamK in different environments.Finally,we recruit 9 measured by a Monsoon power monitor [34])for processing participants to use CamK and compare the performance of an image,as shown in Fig.18.Based on section 5.3,CamK CamK with other text-entry methods. adopts both large images and small images.To guarantee high accuracy and low false positive rate,and reduce the 7.1 Localization accuracy for known keystrokes time latency and power consumption,the size of the large To verify whether CamK has obtained the optimal param- image is set to 480 360 pixels.In regard to small images, when the image size decreases from 480 360 to 120 90 eters for image processing,we first measure the accuracy of keystroke localization with a Samsung SM-N9109W s- pixels,CamK has high accuracy and low false positive rate, as shown in Fig.19.If the size of small images continuously martphone,when CamK knows a keystroke is happening. The user presses 59 keys(excluding the PC function keys: decreases,the accuracy decreases a lot,and the false positive rate increases a lot.However,decreasing the image size first row,five keys in last row)on the paper keyboard. Specifically,we let the user press the sentences/words from means decreasing the time cost/power consumption,as shown in Fig.20.Combining Fig.19 and Fig.20,the size the standard MacKenzie set [371.Besides,we introduce some random characters to guarantee that each key will be of small image is set to 120 *90 pixels. pressed with fifty times.We record the typing process with 7.5 Time latency and power consumption a camera.When the occurrence of a keystroke is known,the Based on Fig.20,the time cost for locating a keystroke localization accuracy is close to 100%,as shown in Fig.13. is about 200ms,which is comparable to the duration of It indicates that CamK can adaptively select suitable values a keystroke,as shown in Fig.16.Thus CamK can output of the parameters used in image processing. the text without noticeable time latency.The time latency is usually within 50ms,which is below the human response 7.2 Accuracy in different environments time [7].In addition,we measure the power consumption To verify whether CamK can detect and locate the keystroke of a Samsung GT-19100 smartphone in following states:(1) accurately,we conduct the experiments in four typical sce- idle with the screen on;(2)writing an email;(3)keeping narios:an office environment (light's color is close to white), the camera on the preview mode (frame rate is 15fps);(4) outdoors (basic/pure light),a coffee shop (light's color is running CamK(frame rate is 15fps)for text-entry.The pow- a little bit closer to that of human skin),and a restaurant er consumption in each state is 505mW,1118mW,1189mW, (light is a bit dim).In each test,the user types words from 2159mW.The power consumption of CamK is a little high. the MacKenzie set [37]and makes Nk 500 keystrokes. However,using a smartphone with better hardwares can Suppose CamK locates Na keystrokes correctly and treats change multiple threads into one thread for power saving. Nf non-keystrokes as keystrokes wrongly.We define the localization accuracy as Pa while the false positive 7.6 Effect of keyboards with different layouts,colors rate as p=mim(总,).As shown in Fig.4 CamK can and textures achieve high accuracy(close to or larger than 85%)with low Different layouts:We use three common keyboard layouts, false positive rate (about 5%).In the office,the localization i.e.,AWK [38],US ANSI [39],and UK ISO [39],to verify accuracy can achieve above 95%. the scalability of CamK.Each keyboard layout is scaled on a piece of US Letter sized paper with similar inter-key 7.3 Effect of frame rate distances.Based on Fig.21,whatever the keyboard layout is,CamK has good performance.It can achieve above 93% As described in section 5.1,the frame rate affects the number accuracy of keystroke localization,while the false positive of images nf during a keystroke.If the value of nif is too s- rate is usually less than 7%. mall,CamK may miss the keystrokes.On the contrary,more Different colors/textures:We use keyboards with dif- frames will increase the time latency.Based on Fig.15,when ferent colors and textures to test CamK.As show in Fig nif >3,CamK has good performance.When nif 3,there 22,the background of the keyboard is set to black,blue, is no obvious performance improvement.While considering green,red and brown wood texture.When there is a large the accuracy,false positive,and time latency,we set nf =3. difference between the colors of the keyboard and the skin, Besides,we invited 5 users to test the duration At of e.g.,the keyboard with black,blue and green background, a keystroke.At represents the time when the StrokeTip is the localization accuracy is usually larger than 90%,while located in the StrokeKey from the view of the camera.Based the false positive rate is lower than 5%.In regard to the on Fig.16,At is usually larger than 150ms.When nif =3, keyboard with red color or brown wood texture,the color the frame rate is less than the maximum frame rate(30fps), of keyboard is close to that of human skin,it is harder to i.e.,CamK can work under the frame rate limitation of the extract the contour of the fingertip from the keyboard,thus smartphone.Therefore,nif =3 is a suitable choice. the localization accuracy of CamK decreases
10 IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. X, NO. X, XXXX 2018 7 PERFORMANCE EVALUATION We implement CamK on smartphones running Android. We use the layout of AWK (Apple Wireless Keyboard) as the default keyboard layout, which is printed on a piece of US Letter sized paper. Unless otherwise specified, we use the Samsung GT-I9100 smartphone whose Android version is 4.4.4, the frame rate is 15fps, the image size is 480 ∗ 360 pixels, and CamK works in the office. We first evaluate each component of CamK. Then, we test the performance of CamK in different environments. Finally, we recruit 9 participants to use CamK and compare the performance of CamK with other text-entry methods. 7.1 Localization accuracy for known keystrokes To verify whether CamK has obtained the optimal parameters for image processing, we first measure the accuracy of keystroke localization with a Samsung SM-N9109W smartphone, when CamK knows a keystroke is happening. The user presses 59 keys (excluding the PC function keys: first row, five keys in last row) on the paper keyboard. Specifically, we let the user press the sentences/words from the standard MacKenzie set [37]. Besides, we introduce some random characters to guarantee that each key will be pressed with fifty times. We record the typing process with a camera. When the occurrence of a keystroke is known, the localization accuracy is close to 100%, as shown in Fig. 13. It indicates that CamK can adaptively select suitable values of the parameters used in image processing. 7.2 Accuracy in different environments To verify whether CamK can detect and locate the keystroke accurately, we conduct the experiments in four typical scenarios: an office environment (light’s color is close to white), outdoors (basic/pure light), a coffee shop (light’s color is a little bit closer to that of human skin), and a restaurant (light is a bit dim). In each test, the user types words from the MacKenzie set [37] and makes Nk = 500 keystrokes. Suppose CamK locates Na keystrokes correctly and treats Nf non-keystrokes as keystrokes wrongly. We define the localization accuracy as pa = Na Nk , while the false positive rate as pf = min( Nf Nk , 1). As shown in Fig. 14, CamK can achieve high accuracy (close to or larger than 85%) with low false positive rate (about 5%). In the office, the localization accuracy can achieve above 95%. 7.3 Effect of frame rate As described in section 5.1, the frame rate affects the number of images n¯f during a keystroke. If the value of n¯f is too small, CamK may miss the keystrokes. On the contrary, more frames will increase the time latency. Based on Fig. 15, when n¯f ≥ 3, CamK has good performance. When n¯f > 3, there is no obvious performance improvement. While considering the accuracy, false positive, and time latency, we set n¯f = 3. Besides, we invited 5 users to test the duration ∆t of a keystroke. ∆t represents the time when the StrokeTip is located in the StrokeKey from the view of the camera. Based on Fig. 16, ∆t is usually larger than 150ms. When n¯f = 3, the frame rate is less than the maximum frame rate (30fps), i.e., CamK can work under the frame rate limitation of the smartphone. Therefore, n¯f = 3 is a suitable choice. 7.4 Effect of image size At first, we choose a constant image size. Based on Fig. 17, as the size of image increases, the performance of CamK becomes better. When the size is smaller than 480 ∗ 360 pixels, CamK can not extract the keys correctly, the performance is rather bad. When the size of image is 480 ∗ 360 pixels, the performance is good. Keeping increasing the size does not bring obvious improvement. However, increasing the image size will increase the time cost and power consumption ( measured by a Monsoon power monitor [34]) for processing an image, as shown in Fig. 18. Based on section 5.3, CamK adopts both large images and small images. To guarantee high accuracy and low false positive rate, and reduce the time latency and power consumption, the size of the large image is set to 480 ∗ 360 pixels. In regard to small images, when the image size decreases from 480 ∗ 360 to 120 ∗ 90 pixels, CamK has high accuracy and low false positive rate, as shown in Fig. 19. If the size of small images continuously decreases, the accuracy decreases a lot, and the false positive rate increases a lot. However, decreasing the image size means decreasing the time cost/power consumption, as shown in Fig. 20. Combining Fig. 19 and Fig. 20, the size of small image is set to 120 ∗ 90 pixels. 7.5 Time latency and power consumption Based on Fig. 20, the time cost for locating a keystroke is about 200ms, which is comparable to the duration of a keystroke, as shown in Fig. 16. Thus CamK can output the text without noticeable time latency. The time latency is usually within 50ms, which is below the human response time [7]. In addition, we measure the power consumption of a Samsung GT-I9100 smartphone in following states: (1) idle with the screen on; (2) writing an email; (3) keeping the camera on the preview mode (frame rate is 15fps); (4) running CamK (frame rate is 15fps) for text-entry. The power consumption in each state is 505mW, 1118mW, 1189mW, 2159mW. The power consumption of CamK is a little high. However, using a smartphone with better hardwares can change multiple threads into one thread for power saving. 7.6 Effect of keyboards with different layouts, colors and textures Different layouts: We use three common keyboard layouts, i.e., AWK [38], US ANSI [39], and UK ISO [39], to verify the scalability of CamK. Each keyboard layout is scaled on a piece of US Letter sized paper with similar inter-key distances. Based on Fig. 21, whatever the keyboard layout is, CamK has good performance. It can achieve above 93% accuracy of keystroke localization, while the false positive rate is usually less than 7%. Different colors/textures: We use keyboards with different colors and textures to test CamK. As show in Fig. 22, the background of the keyboard is set to black, blue, green, red and brown wood texture. When there is a large difference between the colors of the keyboard and the skin, e.g., the keyboard with black, blue and green background, the localization accuracy is usually larger than 90%, while the false positive rate is lower than 5%. In regard to the keyboard with red color or brown wood texture, the color of keyboard is close to that of human skin, it is harder to extract the contour of the fingertip from the keyboard, thus the localization accuracy of CamK decreases