This article has been accepted for publication in a future issue of this journal,but has not been fully edited.Content may change prior to final publication.Citation information:DOI 10.1109/TMC.2019.2961313.IEEE Transactions on Mobile Computing IEEE TRANSACTIONS ON MOBILE COMPUTING.2019 Video Stabilization for Camera Shoot in Mobile Devices via Inertial-Visual State Tracking Fei Han,Student Member,IEEE,Lei Xie,Member,IEEE,Yafeng Yin,Member,IEEE, Hao Zhang,Student Member,IEEE,Guihai Chen,Member,IEEE,and Sanglu Lu,Member,IEEE Abstract-Due to the sudden movement during the camera shoot,the videos retrieved from the hand-held mobile devices often suffer from undesired frame jitters,leading to the loss of video quality.In this paper,we present a video stabilization solution in mobile devices via inertial-visual state tracking.Specifically,during the video shoot,we use the gyroscope to estimate the rotation of camera,and use the structure-from-motion among the image frames to estimate the translation of camera.We build a camera projection model by considering the rotation and translation of the camera,and the camera motion model to depict the relationship between the inertial-visual state and the camera's 3D motion.By fusing the inertial measurement(IMU)-based method and the computer vision (CV)-based method,our solution is robust to the fast movement and violent jitters,moreover,it greatly reduces the computation overhead in video stabilization.In comparison to the IMU-based solution,our solution can estimate the translation in a more accurate manner,since we use the feature point pairs in adjacent image frames,rather than the error-prone accelerometers,to estimate the translation.In comparison to the CV-based solution,our solution can estimate the translation with less number of feature point pairs since the number of undetermined degrees of freedom in the 3D motion directly reduces from 6 to 3.We implemented a prototype system on smart glasses and smart phones,and evaluated the performance under real scenarios,i.e.,the human subjects used mobile devices to shoot videos while they were walking,climbing or riding.The experiment results show that our solution achieves 32% better performance than the state-of-art solutions in regard to video stabilization.Moreover,the average processing time latency is 32.6ms,which is lower than the conventional inter-frame time interval,i.e.,33ms,and thus meets the real-time requirement for online processing Index Terms-Video Stabilization,Mobile Device,3D Motion Sensing,Inertial-Visual State Tracking 1 INTRODUCTION UE to the proliferation of mobile devices,nowadays niques [3],[4].The inertial measurement-based approaches more and more people tend to use their mobile devices mainly use the built-in inertial measurement unit (IMU) to take videos.Such devices can be smart phones and smart to continuously track the 3D motion of the mobile device. glasses.However,due to the sudden movement from the However,they mainly focus on the rotation while ignoring users during the camera shoot,the videos retrieved from the translation of the camera.The reason is two folds:First such mobile devices often suffer from undesired frame the gyroscope in the IMU is usually able to accurately jitters.This usually leads to the loss of video quality.track the rotation,whereas the accelerometer in the IMU Therefore,a number of video stabilization techniques are usually fails to accurately track the translation due to the proposed to remove the undesired jitters and obtain stable large cumulative tracking errors.The computer vision(CV)- videos [1],[2],[3],[4],[5],[61,[7].Recently,by leveraging the based approaches mainly use the structure-from-motion embedded sensors,new opportunities have been raised to [11]among the image frames to estimate both the rota- perform video stabilization in the mobile devices.For the tion and translation of the camera.Although they achieve mobile devices,conventional video stabilization schemes enough accuracy for the camera motion estimation,they involves estimating the motion of the camera,smoothing require plenty of feature point pairs and long feature point the camera's motion to remove the undesired jitters,and tracks.The requirement of massive feature points for mo- warping the frames to stabilize the videos.Among these tion estimation increases the computational overhead in the procedures,it is especially important to accurately estimate resource-constrained mobile devices.This makes the real- the camera's motion during the camera shoot,since it is a time processing impractical in the mobile devices.Hence,to key precondition for the following jitters removal and frame achieve a tradeoff between performance and computation warping. overhead,only rotation estimation is considered for the Conventionally,the motion estimation of the camera in state-of-the-art solutions.Second,according to our empirical 3D space is either based on the inertial measurement-based studies,when the target is at a distance greater than 100cm, techniques [8],[9],[10]or the computer vision-based tech-the rotation usually brings greater pixel jitters than the translation,hence,most previous work consider the rotation Fei Han,Lei Xie,Yafeng Yin,Hao Zhang,Guihai Chen and Sanglu has a greater impact on performance than the translation. Lu are with the State Key Laboratory for Novel Software Technology, Nanjing UIniversity,China However,when the target is within a close range,e.g., E-mail: feihan@smail.nju.edu.cn. lxie,yafeng @nju.edu.cn, at the distance less than 100cm,the translation usually H.Zhang@smail.nju.edu.cn,{gchen,sangluj@nju.edu.cn. brings greater pixel jitters than the rotation,thus the trans- Lei Xie is the corresponding author. lation tracking is also very essential for real applications of camera shooting.Therefore,to efficiently perform video 1536-1233(c)2019 IEEE Personal use is permitted,but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. Authorized licensed use limited to:Nanjing University.Downloaded on October 08,2020 at 13:50:43 UTC from IEEE Xplore.Restrictions apply
1536-1233 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMC.2019.2961313, IEEE Transactions on Mobile Computing IEEE TRANSACTIONS ON MOBILE COMPUTING, 2019 1 Video Stabilization for Camera Shoot in Mobile Devices via Inertial-Visual State Tracking Fei Han, Student Member, IEEE, Lei Xie, Member, IEEE, Yafeng Yin, Member, IEEE, Hao Zhang, Student Member, IEEE, Guihai Chen, Member, IEEE, and Sanglu Lu, Member, IEEE Abstract—Due to the sudden movement during the camera shoot, the videos retrieved from the hand-held mobile devices often suffer from undesired frame jitters, leading to the loss of video quality. In this paper, we present a video stabilization solution in mobile devices via inertial-visual state tracking. Specifically, during the video shoot, we use the gyroscope to estimate the rotation of camera, and use the structure-from-motion among the image frames to estimate the translation of camera. We build a camera projection model by considering the rotation and translation of the camera, and the camera motion model to depict the relationship between the inertial-visual state and the camera’s 3D motion. By fusing the inertial measurement (IMU)-based method and the computer vision (CV)-based method, our solution is robust to the fast movement and violent jitters, moreover, it greatly reduces the computation overhead in video stabilization. In comparison to the IMU-based solution, our solution can estimate the translation in a more accurate manner, since we use the feature point pairs in adjacent image frames, rather than the error-prone accelerometers, to estimate the translation. In comparison to the CV-based solution, our solution can estimate the translation with less number of feature point pairs, since the number of undetermined degrees of freedom in the 3D motion directly reduces from 6 to 3. We implemented a prototype system on smart glasses and smart phones, and evaluated the performance under real scenarios, i.e., the human subjects used mobile devices to shoot videos while they were walking, climbing or riding. The experiment results show that our solution achieves 32% better performance than the state-of-art solutions in regard to video stabilization. Moreover, the average processing time latency is 32.6ms, which is lower than the conventional inter-frame time interval, i.e., 33ms, and thus meets the real-time requirement for online processing. Index Terms—Video Stabilization, Mobile Device, 3D Motion Sensing, Inertial-Visual State Tracking ✦ 1 INTRODUCTION D UE to the proliferation of mobile devices, nowadays more and more people tend to use their mobile devices to take videos. Such devices can be smart phones and smart glasses. However, due to the sudden movement from the users during the camera shoot, the videos retrieved from such mobile devices often suffer from undesired frame jitters. This usually leads to the loss of video quality. Therefore, a number of video stabilization techniques are proposed to remove the undesired jitters and obtain stable videos [1], [2], [3], [4], [5], [6], [7]. Recently, by leveraging the embedded sensors, new opportunities have been raised to perform video stabilization in the mobile devices. For the mobile devices, conventional video stabilization schemes involves estimating the motion of the camera, smoothing the camera’s motion to remove the undesired jitters, and warping the frames to stabilize the videos. Among these procedures, it is especially important to accurately estimate the camera’s motion during the camera shoot, since it is a key precondition for the following jitters removal and frame warping. Conventionally, the motion estimation of the camera in 3D space is either based on the inertial measurement-based techniques [8], [9], [10] or the computer vision-based tech- • Fei Han, Lei Xie, Yafeng Yin, Hao Zhang, Guihai Chen and Sanglu Lu are with the State Key Laboratory for Novel Software Technology, Nanjing University, China. E-mail: feihan@smail.nju.edu.cn, {lxie,yafeng}@nju.edu.cn, H.Zhang@smail.nju.edu.cn, {gchen,sanglu}@nju.edu.cn. • Lei Xie is the corresponding author. niques [3], [4]. The inertial measurement-based approaches mainly use the built-in inertial measurement unit (IMU) to continuously track the 3D motion of the mobile device. However, they mainly focus on the rotation while ignoring the translation of the camera. The reason is two folds: First, the gyroscope in the IMU is usually able to accurately track the rotation, whereas the accelerometer in the IMU usually fails to accurately track the translation due to the large cumulative tracking errors. The computer vision (CV)- based approaches mainly use the structure-from-motion [11] among the image frames to estimate both the rotation and translation of the camera. Although they achieve enough accuracy for the camera motion estimation, they require plenty of feature point pairs and long feature point tracks. The requirement of massive feature points for motion estimation increases the computational overhead in the resource-constrained mobile devices. This makes the realtime processing impractical in the mobile devices. Hence, to achieve a tradeoff between performance and computation overhead, only rotation estimation is considered for the state-of-the-art solutions. Second, according to our empirical studies, when the target is at a distance greater than 100cm, the rotation usually brings greater pixel jitters than the translation, hence, most previous work consider the rotation has a greater impact on performance than the translation. However, when the target is within a close range, e.g., at the distance less than 100cm, the translation usually brings greater pixel jitters than the rotation, thus the translation tracking is also very essential for real applications of camera shooting. Therefore, to efficiently perform video Authorized licensed use limited to: Nanjing University. Downloaded on October 08,2020 at 13:50:43 UTC from IEEE Xplore. Restrictions apply
This article has been accepted for publication in a future issue of this journal,but has not been fully edited.Content may change prior to final publication.Citation information:DOI 10.1109/TMC.2019.2961313.IEEE Transactions on Mobile Computing IEEE TRANSACTIONS ON MOBILE COMPUTING,2019 nal frames the 3 independent Euler angles separately.In this way,we are able to effectively smooth the rotation while maintaining the consistency among multiple parameters.Secondly,we build a camera projection model by considering the rotation and translation of the camera.Then,by substituting the esti- mated rotation into the camera projection model,we directly estimate the translation according to the matched feature point pairs in adjacent image frames.For the situation of Fig.1.Video Stabilization in Mobile Devices.Videos captured with mo fast movement and violent jitters,it is usually difficult to bile devices often suffer from undesired frame jitters due to the sudden find enough feature point pairs between adjacent image movement from the users.We first estimate the original camera path (red)via inertial-visual state tracking,then smooth the original camera frames to estimate the camera's 3D motion.In comparison to path to obtain the smoothed camera path(blue),and finally obtain the the traditional CV-based approaches,our solution requires stabilized frames by warping the original frames. less number of feature point pairs,as we directly reduce the number of undetermined degrees of freedom in the 3D stabilization in mobile devices,it is essential to fuse the CV- motion from 6 to 3.The second challenge is to sufficiently reduce based and IMU-based approaches to accurately estimate the the computation overhead of video stabilization,so as to make the camera's 3D motion,including the rotation and franslation. real-time processing practical in the resource-constrained mobile In this paper,we propose a video stabilization scheme devices.For traditional CV-based approaches,they usually for camera shoot in mobile devices,based on the visual and require at least 5~8 pairs of feature points to estimate the inertial state tracking.Our approach is able to accurately rotation and translation.They involve 6 degrees of freedom, estimate the camera's 3D motion by sufficiently fusing both thus they usually incur large computation overhead,failing the CV-based and IMU-based methods.Specifically,during to perform the video stabilization in a real-time manner. the process of video shoot,we use the gyroscope to es- To address this challenge,our solution reduces the com- timate the rotation of camera,and use the structure-from- putation overhead by directly reducing the undetermined motion among the image frames to estimate the translation degrees of freedom from 6 to 3.Specifically,we use the of the camera.Different from the pure CV-based approaches, inertial measurements to estimate the rotation.Our solution which estimate the rotation and translation simultaneously only requires at least 3 pairs of feature points to estimate the according to the camera projection model,our solution first translation,which reduces over 50%of the burden in the estimates the rotation based on the gyroscope measurement, CV-based processing.This makes the real-time processing and plugs the estimated rofation into the camera projection possible in the mobile devices. model,then we estimate the franslation according to the We make three key contributions in this paper.1)We camera projection model.In comparison to the CV-based investigate video stabilization for camera shoot in mobile solution,our solution can estimate the franslation in a more devices.By fusing the IMU-based method and the CV-based accurate manner with less number of feature point pairs, method,our solution is robust to the fast movement and vi- since the number of undetermined degrees of freedom in olent jitters,and greatly reduces the computation overhead the 3D motion directly reduces from 6 to 3.After that,we in video stabilization.2)We conduct empirical studies to further smooth the camera's motion to remove the unde- investigate the impact of movement jitters,and the measure- sired jitters during the moving process.As shown in Fig.1, ment errors in IMU-based approaches.We build a camera according to the mapping relationship between the original projection model by considering the rotation and translation moving path and the smoothed moving path,we warp of the camera.We further build the camera motion model to each pixel from the original frame into a corresponding depict the relationship between the inertial-visual state and pixel in the stabilized frame.In this way,the stabilized the camera's 3D motion.3)We implemented a prototype video appears to have been captured along the smoothed system on smart glasses and smart phones,and evaluated moving path of the camera.In the context of recent visual- the performance under real scenarios,i.e.,the human sub- inertial based video stabilization methods [12],[13],our jects used mobile devices to shoot videos while they were solution is able to estimate the translation and rotation in a walking,climbing or riding.The experiment results show more accurate manner,and meets the real time requirement that our solution achieves 32%better performance than the for online processing,by directly reducing the number of state-of-art solutions in regard to video stabilization.More- undetermined degrees of freedom from 6 to 3 for CV-based over,the average processing time latency is 32.6ms,which processing. is lower than the conventional inter-frame time interval,i.e., There are two key challenges to address in this paper. 33ms,and thus meets the real-time requirement for online The first challenge is to accurately estimate and effectively smooth processing the camera's 3D motion in the situation of fast movement and violent jitters,due to the sudden movement during the video shoot. 2 RELATED WORK To address this challenge,firstly,we use the gyroscope to CV-based Solution:Traditional CV-based solutions for perform the rotation estimation to figure out a 3 x 3 rotation video stabilization can be roughly divided into 2D stabiliza- matrix,since it can accurately estimate the rotation even if tion and 3D stabilization.2D video stabilization solutions the fast movement and violent jitters occur.Then,to smooth use a series of 2D transformations between adjacent frames the rotation,instead of smoothing the 9 dependent parame- to represent the camera motion,and smooth these transfor- ters separately,we further transform the 3x3 rotation matrix mations to stabilize the video [1],[2],[14].However,these into the 1 x3 Euler angles,and apply the low pass filter over methods cannot figure out the camera's 3D motion,thus 136-1233(c)2019IEEE Personal use is permitted,but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. Authorized licensed use limited to:Nanjing University.Downloaded on October 08,2020 at 13:50:43 UTC from IEEE Xplore.Restrictions apply
1536-1233 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMC.2019.2961313, IEEE Transactions on Mobile Computing IEEE TRANSACTIONS ON MOBILE COMPUTING, 2019 2 Google Glass Smartphone Original frames Stabilized frames Fig. 1. Video Stabilization in Mobile Devices. Videos captured with mobile devices often suffer from undesired frame jitters due to the sudden movement from the users. We first estimate the original camera path (red) via inertial-visual state tracking, then smooth the original camera path to obtain the smoothed camera path (blue), and finally obtain the stabilized frames by warping the original frames. stabilization in mobile devices, it is essential to fuse the CVbased and IMU-based approaches to accurately estimate the camera’s 3D motion, including the rotation and translation. In this paper, we propose a video stabilization scheme for camera shoot in mobile devices, based on the visual and inertial state tracking. Our approach is able to accurately estimate the camera’s 3D motion by sufficiently fusing both the CV-based and IMU-based methods. Specifically, during the process of video shoot, we use the gyroscope to estimate the rotation of camera, and use the structure-frommotion among the image frames to estimate the translation of the camera. Different from the pure CV-based approaches, which estimate the rotation and translation simultaneously according to the camera projection model, our solution first estimates the rotation based on the gyroscope measurement, and plugs the estimated rotation into the camera projection model, then we estimate the translation according to the camera projection model. In comparison to the CV-based solution, our solution can estimate the translation in a more accurate manner with less number of feature point pairs, since the number of undetermined degrees of freedom in the 3D motion directly reduces from 6 to 3. After that, we further smooth the camera’s motion to remove the undesired jitters during the moving process. As shown in Fig.1, according to the mapping relationship between the original moving path and the smoothed moving path, we warp each pixel from the original frame into a corresponding pixel in the stabilized frame. In this way, the stabilized video appears to have been captured along the smoothed moving path of the camera. In the context of recent visualinertial based video stabilization methods [12], [13], our solution is able to estimate the translation and rotation in a more accurate manner, and meets the real time requirement for online processing, by directly reducing the number of undetermined degrees of freedom from 6 to 3 for CV-based processing. There are two key challenges to address in this paper. The first challenge is to accurately estimate and effectively smooth the camera’s 3D motion in the situation of fast movement and violent jitters, due to the sudden movement during the video shoot. To address this challenge, firstly, we use the gyroscope to perform the rotation estimation to figure out a 3×3 rotation matrix, since it can accurately estimate the rotation even if the fast movement and violent jitters occur. Then, to smooth the rotation, instead of smoothing the 9 dependent parameters separately, we further transform the 3×3 rotation matrix into the 1×3 Euler angles, and apply the low pass filter over the 3 independent Euler angles separately. In this way, we are able to effectively smooth the rotation while maintaining the consistency among multiple parameters. Secondly, we build a camera projection model by considering the rotation and translation of the camera. Then, by substituting the estimated rotation into the camera projection model, we directly estimate the translation according to the matched feature point pairs in adjacent image frames. For the situation of fast movement and violent jitters, it is usually difficult to find enough feature point pairs between adjacent image frames to estimate the camera’s 3D motion. In comparison to the traditional CV-based approaches, our solution requires less number of feature point pairs, as we directly reduce the number of undetermined degrees of freedom in the 3D motion from 6 to 3. The second challenge is to sufficiently reduce the computation overhead of video stabilization, so as to make the real-time processing practical in the resource-constrained mobile devices. For traditional CV-based approaches, they usually require at least 5∼8 pairs of feature points to estimate the rotation and translation. They involve 6 degrees of freedom, thus they usually incur large computation overhead, failing to perform the video stabilization in a real-time manner. To address this challenge, our solution reduces the computation overhead by directly reducing the undetermined degrees of freedom from 6 to 3. Specifically, we use the inertial measurements to estimate the rotation. Our solution only requires at least 3 pairs of feature points to estimate the translation, which reduces over 50% of the burden in the CV-based processing. This makes the real-time processing possible in the mobile devices. We make three key contributions in this paper. 1) We investigate video stabilization for camera shoot in mobile devices. By fusing the IMU-based method and the CV-based method, our solution is robust to the fast movement and violent jitters, and greatly reduces the computation overhead in video stabilization. 2) We conduct empirical studies to investigate the impact of movement jitters, and the measurement errors in IMU-based approaches. We build a camera projection model by considering the rotation and translation of the camera. We further build the camera motion model to depict the relationship between the inertial-visual state and the camera’s 3D motion. 3) We implemented a prototype system on smart glasses and smart phones, and evaluated the performance under real scenarios, i.e., the human subjects used mobile devices to shoot videos while they were walking, climbing or riding. The experiment results show that our solution achieves 32% better performance than the state-of-art solutions in regard to video stabilization. Moreover, the average processing time latency is 32.6ms, which is lower than the conventional inter-frame time interval, i.e., 33ms, and thus meets the real-time requirement for online processing. 2 RELATED WORK CV-based Solution: Traditional CV-based solutions for video stabilization can be roughly divided into 2D stabilization and 3D stabilization. 2D video stabilization solutions use a series of 2D transformations between adjacent frames to represent the camera motion, and smooth these transformations to stabilize the video [1], [2], [14]. However, these methods cannot figure out the camera’s 3D motion, thus Authorized licensed use limited to: Nanjing University. Downloaded on October 08,2020 at 13:50:43 UTC from IEEE Xplore. Restrictions apply
This article has been accepted for publication in a future issue of this journal,but has not been fully edited.Content may change prior to final publication.Citation information:DOI 10.1109/TMC.2019.2961313.IEEE Transactions on Mobile Computing IEEE TRANSACTIONS ON MOBILE COMPUTING.2019 3 they usually fail to compute the changes of projection for its image projection pixel P'=[u,v]T in the 2D image plane the target scene when there exist significant depth changes. can be represented as: Recent 3D video stabilization solutions [3],[4],[15]all seek to stabilize the videos based on the 3D camera motion model.They use the structure-from-motion among the im- KP (1) age frames to estimate the 3D camera motion,thus they can deal with parallax distortions caused by depth variations. where K is the camera intrinsic matrix [111,which contains Hence,they are usually more effective and robust in video the camera's intrinsic parametersBand f.Here, stabilization,at the cost of large computation overhead. is the pixel coordinate of the principal point C Therefore,they are usually performed in an offline manner in the image plane.f is the camera focal length,which is for video stabilization.Moreover,when the camera moves represented in physical measurements,i.e.,meters,and is fast or experiences violent jitters,they may not find enough equal to the distance from the camera center O.to the image amount of feature points to estimate the motion. plane,i.e.,OeC.Considering that the projected points in IMU-based Solution:For mobile devices,since the built- the image plane are described in pixels,while 3D points in in gyroscopes and accelerometers can be directly used to es- the camera coordinate system are represented in physical timate the camera's motion,the IMU-based solutions [8],[9] measurements,i.e.,meters,we introduce the parameters a [16],[17]are proposed for video stabilization recently.For and B to correlate the same points in different coordinate video stabilization,Karpenko et al.calculate the camera's systems using different units.Thus the parameters a and rotation by integrating the gyroscope readings directly [81,B are the number of pixels per meter(i.e.,unit distance whereas Hanning et al.take into account the noise of the gy- in physical measurements)along zi-axis and yi-axis,as roscope readings,they estimate the camera's rotation with shown in Fig.2.Note that a and B may be different because an extended Kalman filter to fuse the readings of gyroscope the aspect ratio of the unit pixel is not guaranteed to be and accelerometer [9].These IMU-based solutions are much one.We can obtain these camera's intrinsic parameters in faster than the CV-based solutions,but they only consider advance from prior calibration [21].Then,the coordinate of the rotation in modeling the camera motion without the the projection P'in the 2D image plane,i.e.,[u,T,can be translation,since the gyroscope can accurately track the computed according to Eq.(1) rotation,whereas the accelerometer usually fail to accurately track the translation due to large cumulative tracking errors. Hybrid Solution:Recent work seek to fuse the inertial and visual-based methods to track the camera's motion [18],[19],[20].Yang et al.fuse the visual and inertial mea- surements to track the camera state for augmented reality [19].In video stabilization,Jia et al.propose an EKF-based method to estimate the 3D camera rotation by using both the video and inertial measurements [201.Still,they only use the pure rotation to depict the camera motion and ignore the camera's translation.In this paper,we investigate video Fig.2.Pinhole camera model stabilization in mobile devices,by accurately estimating and smoothing the camera's 3D motion,i.e.,camera rotation A EMPIRICAL STUDY and translation.By fusing the IMU-based method and the During the camera shoot,it is known that the camera is CV-based method,our solution is robust to the fast move- usually experiencing back-and-forth movement jitters of ment and violent jitters,moreover,it greatly reduces the fairly high frequency,fast speed and small rotations and computation overhead in video stabilization.In the context translations.In this section,we perform empirical studies of recent visual-inertial based video stabilization methods on the real-world testbed in regard to the movement jitters [12],[13,our solution is able to estimate the translation and and measurement errors,so as to investigate the following rotation in a more accurate manner,and meets the real time issues:1)In what level do the movement jitters in the requirement for online processing,by directly reducing the 3D space affect the pixel jitters in the image plane of the number of undetermined degrees of freedom from 6 to 3 for camera?2)What are the average measurement errors in CV-based processing. measuring the rotation and translation of the camera with the inertial sensors? Without loss of generality,we use the smart phone 3 PRELIMINARY Lenovo PHAB2 Pro as the testing platform.This platform has a 16-megapixel camera,we use it to capture the 1080p To illustrate the principle of camera shoot in the mobile videos at 30 frames per second.Moreover,this platform has devices,we can use the pinhole camera model [11]to depict the an inertial measurement unit(BOSCH BMI160)consisting camera projection.As illustrated in Fig.2,for an arbitrary of a 3-axis accelerometer and a 3-axis gyroscope,we use point P from the specified object in the scene,a ray from this them to capture the linear acceleration and the angular rate 3D point P to the camera optical center Oc intersects the of the body frame at a frequency of 200Hz,respectively.To image plane at a point P.Then,the relationship between capture the ground-truth of the 3D motion for the mobile the point P=[X,Y,Z]T in the 3D camera coordinate and device,including the rotation and translation,we use the OptiTrack system [22]to collect the experiment data 1536-1233(c)2019 IEEE Personal use is permitted,but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. Authorized licensed use limited to:Nanjing University.Downloaded on October 08,2020 at 13:50:43 UTC from IEEE Xplore.Restrictions apply
1536-1233 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMC.2019.2961313, IEEE Transactions on Mobile Computing IEEE TRANSACTIONS ON MOBILE COMPUTING, 2019 3 they usually fail to compute the changes of projection for the target scene when there exist significant depth changes. Recent 3D video stabilization solutions [3], [4], [15] all seek to stabilize the videos based on the 3D camera motion model. They use the structure-from-motion among the image frames to estimate the 3D camera motion, thus they can deal with parallax distortions caused by depth variations. Hence, they are usually more effective and robust in video stabilization, at the cost of large computation overhead. Therefore, they are usually performed in an offline manner for video stabilization. Moreover, when the camera moves fast or experiences violent jitters, they may not find enough amount of feature points to estimate the motion. IMU-based Solution: For mobile devices, since the builtin gyroscopes and accelerometers can be directly used to estimate the camera’s motion, the IMU-based solutions [8], [9], [16], [17] are proposed for video stabilization recently. For video stabilization, Karpenko et al. calculate the camera’s rotation by integrating the gyroscope readings directly [8], whereas Hanning et al. take into account the noise of the gyroscope readings, they estimate the camera’s rotation with an extended Kalman filter to fuse the readings of gyroscope and accelerometer [9]. These IMU-based solutions are much faster than the CV-based solutions, but they only consider the rotation in modeling the camera motion without the translation, since the gyroscope can accurately track the rotation, whereas the accelerometer usually fail to accurately track the translation due to large cumulative tracking errors. Hybrid Solution: Recent work seek to fuse the inertial and visual-based methods to track the camera’s motion [18], [19], [20]. Yang et al. fuse the visual and inertial measurements to track the camera state for augmented reality [19]. In video stabilization, Jia et al. propose an EKF-based method to estimate the 3D camera rotation by using both the video and inertial measurements [20]. Still, they only use the pure rotation to depict the camera motion and ignore the camera’s translation. In this paper, we investigate video stabilization in mobile devices, by accurately estimating and smoothing the camera’s 3D motion, i.e., camera rotation and translation. By fusing the IMU-based method and the CV-based method, our solution is robust to the fast movement and violent jitters, moreover, it greatly reduces the computation overhead in video stabilization. In the context of recent visual-inertial based video stabilization methods [12], [13], our solution is able to estimate the translation and rotation in a more accurate manner, and meets the real time requirement for online processing, by directly reducing the number of undetermined degrees of freedom from 6 to 3 for CV-based processing. 3 PRELIMINARY To illustrate the principle of camera shoot in the mobile devices, we can use the pinhole camera model [11] to depict the camera projection. As illustrated in Fig. 2, for an arbitrary point P from the specified object in the scene, a ray from this 3D point P to the camera optical center Oc intersects the image plane at a point P 0 . Then, the relationship between the point P = [X, Y, Z] T in the 3D camera coordinate and its image projection pixel P 0 = [u, v] T in the 2D image plane can be represented as: Z u v 1 = αf 0 cx 0 βf cy 0 0 1 X Y Z = KP, (1) where K is the camera intrinsic matrix [11], which contains the camera’s intrinsic parameters [cx, cy] T , α, β and f. Here, [cx, cy] T is the pixel coordinate of the principal point C in the image plane. f is the camera focal length, which is represented in physical measurements, i.e., meters, and is equal to the distance from the camera center Oc to the image plane, i.e., OcC. Considering that the projected points in the image plane are described in pixels, while 3D points in the camera coordinate system are represented in physical measurements, i.e., meters, we introduce the parameters α and β to correlate the same points in different coordinate systems using different units. Thus the parameters α and β are the number of pixels per meter(i.e., unit distance in physical measurements) along xi-axis and yi-axis, as shown in Fig.2. Note that α and β may be different because the aspect ratio of the unit pixel is not guaranteed to be one. We can obtain these camera’s intrinsic parameters in advance from prior calibration [21]. Then, the coordinate of the projection P 0 in the 2D image plane, i.e., [u, v] T , can be computed according to Eq.(1). Image plane optical axis P P ′ C Oi Oc camera coordinate X Y Z z x y f xi yi v cy cx u Fig. 2. Pinhole camera model. 4 EMPIRICAL STUDY During the camera shoot, it is known that the camera is usually experiencing back-and-forth movement jitters of fairly high frequency, fast speed and small rotations and translations. In this section, we perform empirical studies on the real-world testbed in regard to the movement jitters and measurement errors, so as to investigate the following issues: 1) In what level do the movement jitters in the 3D space affect the pixel jitters in the image plane of the camera? 2) What are the average measurement errors in measuring the rotation and translation of the camera with the inertial sensors? Without loss of generality, we use the smart phone Lenovo PHAB2 Pro as the testing platform. This platform has a 16-megapixel camera, we use it to capture the 1080p videos at 30 frames per second. Moreover, this platform has an inertial measurement unit (BOSCH BMI160) consisting of a 3-axis accelerometer and a 3-axis gyroscope, we use them to capture the linear acceleration and the angular rate of the body frame at a frequency of 200Hz, respectively. To capture the ground-truth of the 3D motion for the mobile device, including the rotation and translation, we use the OptiTrack system [22] to collect the experiment data. Authorized licensed use limited to: Nanjing University. Downloaded on October 08,2020 at 13:50:43 UTC from IEEE Xplore. Restrictions apply
This article has been accepted for publication in a future issue of this journal,but has not been fully edited.Content may change prior to final publication.Citation information:DOI 10.1109/TMC.2019.2961313.IEEE Transactions on Mobile Computing IEEE TRANSACTIONS ON MOBILE COMPUTING,2019 十4 100150200250 150 n 45 90 180 10-15 25-30 40-45 target point cm) target p Angle of rotation(deg) Dista of translation(cm) (a)The jitter of pixels due to(b)The jitter of pixels due to (c)The rotation measurement error (d)The translation measurement rotation-based jitter,60 =100 translation-based jitter,ot 5cm from gyroscope error from accelerometer Fig.3.The experiment results of the empirical study. 4.1 Observations from 130 pixels to 43 pixels Observation 1.When the camera is subject to the same rotation- Observation 3.When the mobile device is rotating,the gyro- scope is able to accurately measure the rotation in low,medium based jitters,the stationary target points with closer distance and high speed mode.To evaluate the average measurement to the image plane are suffering from stronger pixel jitters in the image plane.To evaluate how the rotation-based jitters errors in measuring the rotation with the gyroscope,without affect the pixel jitters in the image plane,we deployed the loss of generality,we rotated the mobile device around the stationary target points in a line parallel to the optical axis z-axis of the local coordinate system with the angle of 45 of camera and with different distances to the image plane. 90°andl80°,respectively.Besides,for each rotation angle,, Without loss of generality,we performed the rotation-based we evaluate the measurement errors with the low speed (10°/s),medium speed(40°/s)and high speed(100°/s) jitters around the y-axis of the camera coordinate system, mode,respectively.Specifically,the measurement errors are which leads to the coordinate change of the projection in x-axis.The maximum rotation angle 60 is set to 10 degrees calculated by comparing the gyroscope measurement with by default.Then,we measured the pixel jitter for the target the ground truth.According to the experiment results in points with different depths,i.e.,coordinate difference in Fig.3(c),we found that,as the rotation angle increases from pixels between the projections before and after the rotation- 45°to180°,the measurement error is slightly increasing, which is always less than 2 in all cases. based jitter.As shown in Fig.3(a),we use the pinhole camera model to predict the pixel jitter of an object at a given Observation 4.When the mobile device is moving back and forth,the accelerometer usually fails to accurately measure the distance,and plot it as the curve shown in green color,then we plot the corresponding experiment results for pixel jitter translation in low,medium and high speed mode.To evaluate of an object at a given distance.The comparison between the average measurement errors in measuring the trans- the theoretical results and the experiment results shows that lation with the accelerometer,without loss of generality, the observations from the experiments are consistent with we move the mobile device back and forth in the range the theoretical hypothesis from the pinhole camera model. of [-5cm,+5cm]along the z-axis of the local coordinate According to the experiment results,we found that as the system,by varying the overall distance from 10~15cm to depth of the target point increases from 10cm to 50cm,the 40~45cm,respectively.Besides,for each moving distance, pixel jitter decreases rapidly from 314 pixels to 235 pixels. we evaluate the measurement errors with the low speed Then,as the depth further increases from 50cm to 150cm (3cm/s),medium speed(30cm/s)and high speed(100cm/s) the pixel jitter decreases very slowly from 235 pixels to 230 mode,respectively.Specifically,the measurement errors are pixels. calculated by comparing the accelerometer measurement with the ground truth.As shown in Fig.3(d),we found Observation 2.When the camera is subiect to the same translation-based jitters,the stationary target points with closer that,for all three speed modes,as the moving distance increases from 10~15cm to 40~45cm,the corresponding distance to the image plane are suffering from stronger pixel jitters in the image plane.To evaluate how the translation- measurement errors are linearly increasing.Nevertheless, the measurement errors of all speed modes with all moving based jitters affect the pixel jitters in the image plane of the camera,we deployed the target points in the optical axis of distances are all greater than 10cm.Since the actual trans- the camera and with different distances to the image plane. lation ranges in [-5cm,+5cm],and the maximum moving Without loss of generality,we performed the translation- distance is less than 45cm,thus the average measurement error (whether displacement error or distance error)greater based jitters along the x-axis of the camera coordinate system,the maximum displacement 6t is set to 5cm by than 10cm is not acceptable at all. default.Then,we also measured the pixel jitter for the target 4.2 Summary points with different depths.As shown in Fig.3(b),we use the pinhole camera model to predict the pixel jitter of an Both the rotation-based jitters and the translation-based object at a given distance,and plot it as the curve shown jitters cause non-negligible pixel jitters in the image plane in green color,then we plot the corresponding experiment during video shoot.With the inertial measurement units, results for pixel jitter of an object at a given distance.We usually the rofation can be accurately measured by the found that as the depth of the target point increases from gyroscope,whereas the translation fails to be accurately 10cm to 50cm,the pixel jitter decreases rapidly from 650 measured by the accelerometer.Therefore,it is essential pixels to 130 pixels.Then,as the depth further increases to estimate the translation in an accurate and lightweight from 50cm to 150cm,the pixel jitter decreases very slowly manner,such that the video stabilization can be effectively performed. 36-1233(c)2019IEEE Personal use is permitted,but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. Authorized licensed use limited to:Nanjing University.Downloaded on October 08,2020 at 13:50:43 UTC from IEEE Xplore.Restrictions apply
1536-1233 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMC.2019.2961313, IEEE Transactions on Mobile Computing IEEE TRANSACTIONS ON MOBILE COMPUTING, 2019 4 0 50 100 150 200 250 300 Distance of target points (cm) 220 240 260 280 300 320 Pixel jitters (pixel) (a) The jitter of pixels due to rotation-based jitter, δθ = 10◦ 0 50 100 150 200 250 300 Distance of target points (cm) 0 200 400 600 800 Pixel jitters (pixel) (b) The jitter of pixels due to translation-based jitter, δt = 5cm 45 90 180 Angle of rotation(deg) 0 1 2 3 Measurement error(deg) Slow speed Medium speed Fast speed (c) The rotation measurement error from gyroscope 10-15 25-30 40-45 Distance of translation(cm) 0 10 20 30 40 50 Translation error(cm) Slow speed Medium speed Fast speed (d) The translation measurement error from accelerometer Fig. 3. The experiment results of the empirical study. 4.1 Observations Observation 1. When the camera is subject to the same rotationbased jitters, the stationary target points with closer distance to the image plane are suffering from stronger pixel jitters in the image plane. To evaluate how the rotation-based jitters affect the pixel jitters in the image plane, we deployed the stationary target points in a line parallel to the optical axis of camera and with different distances to the image plane. Without loss of generality, we performed the rotation-based jitters around the y-axis of the camera coordinate system, which leads to the coordinate change of the projection in x-axis. The maximum rotation angle δθ is set to 10 degrees by default. Then, we measured the pixel jitter for the target points with different depths, i.e., coordinate difference in pixels between the projections before and after the rotationbased jitter. As shown in Fig.3(a), we use the pinhole camera model to predict the pixel jitter of an object at a given distance, and plot it as the curve shown in green color, then we plot the corresponding experiment results for pixel jitter of an object at a given distance. The comparison between the theoretical results and the experiment results shows that the observations from the experiments are consistent with the theoretical hypothesis from the pinhole camera model. According to the experiment results, we found that as the depth of the target point increases from 10cm to 50cm, the pixel jitter decreases rapidly from 314 pixels to 235 pixels. Then, as the depth further increases from 50cm to 150cm, the pixel jitter decreases very slowly from 235 pixels to 230 pixels. Observation 2. When the camera is subject to the same translation-based jitters, the stationary target points with closer distance to the image plane are suffering from stronger pixel jitters in the image plane. To evaluate how the translationbased jitters affect the pixel jitters in the image plane of the camera, we deployed the target points in the optical axis of the camera and with different distances to the image plane. Without loss of generality, we performed the translationbased jitters along the x-axis of the camera coordinate system, the maximum displacement δt is set to 5cm by default. Then, we also measured the pixel jitter for the target points with different depths. As shown in Fig.3(b), we use the pinhole camera model to predict the pixel jitter of an object at a given distance, and plot it as the curve shown in green color, then we plot the corresponding experiment results for pixel jitter of an object at a given distance. We found that as the depth of the target point increases from 10cm to 50cm, the pixel jitter decreases rapidly from 650 pixels to 130 pixels. Then, as the depth further increases from 50cm to 150cm, the pixel jitter decreases very slowly from 130 pixels to 43 pixels. Observation 3. When the mobile device is rotating, the gyroscope is able to accurately measure the rotation in low, medium and high speed mode. To evaluate the average measurement errors in measuring the rotation with the gyroscope, without loss of generality, we rotated the mobile device around the z-axis of the local coordinate system with the angle of 45◦ , 90◦ and 180◦ , respectively. Besides, for each rotation angle, we evaluate the measurement errors with the low speed (10◦/s), medium speed (40◦/s) and high speed (100◦/s) mode, respectively. Specifically, the measurement errors are calculated by comparing the gyroscope measurement with the ground truth. According to the experiment results in Fig.3(c), we found that, as the rotation angle increases from 45◦ to 180◦ , the measurement error is slightly increasing, which is always less than 2◦ in all cases. Observation 4. When the mobile device is moving back and forth, the accelerometer usually fails to accurately measure the translation in low, medium and high speed mode. To evaluate the average measurement errors in measuring the translation with the accelerometer, without loss of generality, we move the mobile device back and forth in the range of [-5cm, +5cm] along the z-axis of the local coordinate system, by varying the overall distance from 10∼15cm to 40∼45cm, respectively. Besides, for each moving distance, we evaluate the measurement errors with the low speed (3cm/s), medium speed (30cm/s) and high speed (100cm/s) mode, respectively. Specifically, the measurement errors are calculated by comparing the accelerometer measurement with the ground truth. As shown in Fig.3(d), we found that, for all three speed modes, as the moving distance increases from 10∼15cm to 40∼45cm, the corresponding measurement errors are linearly increasing. Nevertheless, the measurement errors of all speed modes with all moving distances are all greater than 10cm. Since the actual translation ranges in [-5cm, +5cm], and the maximum moving distance is less than 45cm, thus the average measurement error (whether displacement error or distance error) greater than 10cm is not acceptable at all. 4.2 Summary Both the rotation-based jitters and the translation-based jitters cause non-negligible pixel jitters in the image plane during video shoot. With the inertial measurement units, usually the rotation can be accurately measured by the gyroscope, whereas the translation fails to be accurately measured by the accelerometer. Therefore, it is essential to estimate the translation in an accurate and lightweight manner, such that the video stabilization can be effectively performed. Authorized licensed use limited to: Nanjing University. Downloaded on October 08,2020 at 13:50:43 UTC from IEEE Xplore. Restrictions apply
This article has been accepted for publication in a future issue of this journal,but has not been fully edited Content may change prior to final publication.Citation information:DOI 10.1109/TMC.2019.2961313.IEEE Transactions on Mobile Computing IEEE TRANSACTIONS ON MOBILE COMPUTING,2019 PROBLEM FORMULATION AND MODELING camera is dynamically moving in the 3D space,the camera 5.1 Problem Formulation coordinate system as well as the image plane is also contin- uously moving,which involves rotation and translation.In According to the observations in the empirical study,in order to achieve video stabilization,we need to accurately this way,even if the point P keeps still in the 3D space,the corresponding projection P'is dynamically changing in the track the rotation and franslation during the video shot, so as to effectively remove the jitters from the rotation 2D plane,thus further leading to video shaking in the image plane. and translation.Meanwhile,we need to perform the rota- As any 3D motion can be decomposed into the combi- tion/translation estimation in a lightweight manner,so as to make the computation overhead suitable for real-time processing. nation of rotation and translation,we can use the rotation Therefore,based on the above understanding,it is essential matrix Rto.t and a vector Tio.t to represent the rotation and to statistically minimize the expectation of both rotation translation of the camera coordinate system,respectively, estimation error and translation estimation error during the from the time to to the time t.Then,for a target point Pi in the camera coordinate system,if its coordinate at time to is process of video shot.Meanwhile,we need to effectively limit the expected computation overhead within a certain denoted as Pi.to,then,after the rotation and translation of threshold,say T.Specifically,let the rotation estimation error the camera coordinate system,its coordinate Pi.t at time t can be computed by and translation estimation error be or and ot,respectively, and let the computation overhead for rotation estimation Pi.t=Rto.tPi,to Tto.t. (3) and translation estimation be cr and ct,respectively.We use the function exp()to denote the expectation.Then,the Therefore,according to Eq.(1),for the point Pi.t at time t, objective of our solution is to the corresponding projection in the image plane,i.e.,Pi.t= u,can be computed by min exp()+exp(6:), (2) Zi.t[ui.t,vi.t,1]=KPi.t=K(Rto.Pi.to +Tto.t),(4) subject to: exp(cr)+exp(ce)≤T. where Zi.t is the coordinate of Pi.t in the z-axis of the camera coordinate at time t,K is the camera intrinsic matrix. To achieve the above objective,we first analyze the pros and cons for the IMU-based and CV-based approaches,as 5.3 Camera Motion Model shown in Table 1.To track the translation,considering that 5.3.1 Coordinate Transformation only the CV-based approach is able to track the translation with high accuracy,we thus use the CV-based approach to As the mobile devices are usually equipped with Inertial Measurement Units(IMU),thus the motion of the camera estimate the translation.Moreover,to track the rotation,on can be measured by IMU,in the local coordinate system of one hand,both the IMU-based and CV-based approaches the body frame,as shown in Fig.4.As aforementioned in are able to track the rotation with high accuracy,on the other Section 5.2,the camera projection is measured in the camera hand,the compute complexity of the CV-based approach is coordinate system,once we figure out the camera's motion relatively high,especially when the 6 degrees of freedom from the inertial measurements in the local coordinate system, (DoF)are undetermined.Hence,we use the IMU-based it is essential to transform the camera's motion into the approach to estimate the rotation,due to its low compute camera coordinate system. complexity.In this way,the compute overhead of the CV- based approach is greatly reduced,since the undetermined Local coordinate DoF for CV-based processing is greatly reduced from 6 to 3. system Rotation Translation Compute Tracking Tracking Complexity M Camera coordinate IMU-based High Accuracy Low Accuracy Low system (3 DoF) (3 DoF) CV-based High Accuracy High Accuracy High (3 DoF) (3 DoF) TABLE 1 Pros and cons of IMU and CV-based approaches for video stabilization. Fig.4.The local coordinate system and the camera coordinate system Therefore,after formulating the video stabilization prob- of the rear camera. lem in an expectation-minimization framework,we can For the embedded camera of the mobile device,we take decompose and solve this complex optimization problem the mostly used rear camera as an example.As shown in by breaking it down into two subproblems,i.e.,using the Fig.4,we show the camera coordinate system and the local IMU-based approach to estimate the rotation and using the coordinate system,respectively.According to the relationship CV-based approach to estimate the translation. between camera coordinate system and the local coordinate sys- 0 -1 0 fem,we can use a 3x3 rotation matrix M= -10 5.2 Camera Projection Model 00 According to the pinhole camera model,for any arbitrary to denote the coordinate transformation between the two 3D point P from the stationary object in the scene,the coordinate systems.For any other camera,we can also use corresponding 2D projection P'in the image plane always a similar rotation matrix M'to denote the corresponding keeps unchanged.However,when the body frame of the coordinate transformation. 1s36-1233(c)2019 IEEE Personal use is permitted,but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. Authorized licensed use limited to:Nanjing University.Downloaded on October 08,2020 at 13:50:43 UTC from IEEE Xplore.Restrictions apply
1536-1233 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMC.2019.2961313, IEEE Transactions on Mobile Computing IEEE TRANSACTIONS ON MOBILE COMPUTING, 2019 5 5 PROBLEM FORMULATION AND MODELING 5.1 Problem Formulation According to the observations in the empirical study, in order to achieve video stabilization, we need to accurately track the rotation and translation during the video shot, so as to effectively remove the jitters from the rotation and translation. Meanwhile, we need to perform the rotation/translation estimation in a lightweight manner, so as to make the computation overhead suitable for real-time processing. Therefore, based on the above understanding, it is essential to statistically minimize the expectation of both rotation estimation error and translation estimation error during the process of video shot. Meanwhile, we need to effectively limit the expected computation overhead within a certain threshold, say τ . Specifically, let the rotation estimation error and translation estimation error be δr and δt, respectively, and let the computation overhead for rotation estimation and translation estimation be cr and ct, respectively. We use the function exp() to denote the expectation. Then, the objective of our solution is to min exp(δr) + exp(δt), (2) subject to: exp(cr) + exp(ct) ≤ τ. To achieve the above objective, we first analyze the pros and cons for the IMU-based and CV-based approaches, as shown in Table 1. To track the translation, considering that only the CV-based approach is able to track the translation with high accuracy, we thus use the CV-based approach to estimate the translation. Moreover, to track the rotation, on one hand, both the IMU-based and CV-based approaches are able to track the rotation with high accuracy, on the other hand, the compute complexity of the CV-based approach is relatively high, especially when the 6 degrees of freedom (DoF) are undetermined. Hence, we use the IMU-based approach to estimate the rotation, due to its low compute complexity. In this way, the compute overhead of the CVbased approach is greatly reduced, since the undetermined DoF for CV-based processing is greatly reduced from 6 to 3. Rotation Translation Compute Tracking Tracking Complexity IMU-based High Accuracy Low Accuracy Low (3 DoF) (3 DoF) CV-based High Accuracy High Accuracy High (3 DoF) (3 DoF) TABLE 1 Pros and cons of IMU and CV-based approaches for video stabilization. Therefore, after formulating the video stabilization problem in an expectation-minimization framework, we can decompose and solve this complex optimization problem by breaking it down into two subproblems, i.e., using the IMU-based approach to estimate the rotation and using the CV-based approach to estimate the translation. 5.2 Camera Projection Model According to the pinhole camera model, for any arbitrary 3D point P from the stationary object in the scene, the corresponding 2D projection P 0 in the image plane always keeps unchanged. However, when the body frame of the camera is dynamically moving in the 3D space, the camera coordinate system as well as the image plane is also continuously moving, which involves rotation and translation. In this way, even if the point P keeps still in the 3D space, the corresponding projection P 0 is dynamically changing in the 2D plane, thus further leading to video shaking in the image plane. As any 3D motion can be decomposed into the combination of rotation and translation, we can use the rotation matrix Rt0,t and a vector Tt0,t to represent the rotation and translation of the camera coordinate system, respectively, from the time t0 to the time t. Then, for a target point Pi in the camera coordinate system, if its coordinate at time t0 is denoted as Pi,t0 , then, after the rotation and translation of the camera coordinate system, its coordinate Pi,t at time t can be computed by Pi,t = Rt0,tPi,t0 + Tt0,t. (3) Therefore, according to Eq. (1), for the point Pi,t at time t, the corresponding projection in the image plane, i.e., P 0 i,t = [ui,t, vi,t] T , can be computed by Zi,t · [ui,t, vi,t, 1]T = KPi,t = K(Rt0,tPi,t0 + Tt0,t), (4) where Zi,t is the coordinate of Pi,t in the z−axis of the camera coordinate at time t, K is the camera intrinsic matrix. 5.3 Camera Motion Model 5.3.1 Coordinate Transformation As the mobile devices are usually equipped with Inertial Measurement Units (IMU), thus the motion of the camera can be measured by IMU, in the local coordinate system of the body frame, as shown in Fig. 4. As aforementioned in Section 5.2, the camera projection is measured in the camera coordinate system, once we figure out the camera’s motion from the inertial measurements in the local coordinate system, it is essential to transform the camera’s motion into the camera coordinate system. Camera coordinate system Local coordinate system P P ′ M x y z OL OC x y z Fig. 4. The local coordinate system and the camera coordinate system of the rear camera. For the embedded camera of the mobile device, we take the mostly used rear camera as an example. As shown in Fig. 4, we show the camera coordinate system and the local coordinate system, respectively. According to the relationship between camera coordinate system and the local coordinate system, we can use a 3×3 rotation matrix M = 0 −1 0 −1 0 0 0 0 −1 to denote the coordinate transformation between the two coordinate systems. For any other camera, we can also use a similar rotation matrix M0 to denote the corresponding coordinate transformation. Authorized licensed use limited to: Nanjing University. Downloaded on October 08,2020 at 13:50:43 UTC from IEEE Xplore. Restrictions apply
This article has been accepted for publication in a future issue of this journal,but has not been fully edited.Content may change prior to final publication.Citation information:DOI 10.1109/TMC.2019.2961313.IEEE Transactions on Mobile Computing IEEE TRANSACTIONS ON MOBILE COMPUTING.2019 5.3.2 Rotation Estimation P According to Eg.(4),it is essential to accurately estimate the rotation matrix Rto.t and the translation vector Tto.t, P such that the projection of Pi at time t in the image plane, i.e.,Pi.t =[ui.t,vi.t,can be figured out.To estimate the camera's rotation Rto.t from the time to to time t,we first ● use the gyroscope to measure the angular speed in each axis of the local coordinate system.Then,according to the small angle approximation [23],we can compute the rotation Fig.5.Epipolar geometry. matrix At.t+6t relating the local coordinate at time t to the one at time t+ot.After obtaining At.t+t,we can further point Pt and the camera optical center Oto the target 3D update the rotation matrix R'to.t for the local coordinate point Pi must locate in the ray OtoPtSimilarly,the target system as follows: point Pi should also locate in the ray Ot P Thus Pi is R'to.t+8t=At.t+5tR'to.t. (5) the intersection point of OtoPi.to and O Pi.tIn computer vision,this is referred to the epipolar constraint [11].Then, Hence,considering the coordinate transformation between we can use the fundamental matrix Fto.t [11]to describe the local coordinate system and the camera coordinate sys- the epipolar constraint,i.e., tem,we further compute the camera's rotation in camera PiFo Pto=0. (7 coordinate system as follows: Here,the fundamental matrix Fto.t can be generated by the Rto.t MR'to.M-1 (6) relative rotation Rto.t and translation Tto.t from the image Ito to image It,i.e., 5.3.3 Translation Estimation Fto=K-T[Tto.RtoK-1, (8) Considering the nonnegligible error accumulation of using where K is the camera intrinsic matrix,[Tto.tlx is a 3 x 3 linear acceleration to calculate the translation,we introduce matrix.Specifically,let To=[TTT,then the computer vision(CV)-based method,which utilizes the 0 T ,ti feature point pairs to estimate the motion between two Tio.ti [Tto,t]×= 0 -1tot Therefore,by sub- frames.However,different from traditional CV-based meth- to,t1 0 ods which calculate both rotation and translation in each Tiot stituting Eq.(8)to Eq.(7),we have axis,i.e.,6 degrees of freedom (DOF),we have calculated rotation from gyroscope and only need to calculate the (PiK-T)[Tx(RK-Pit)=0. (9) unknown translation,i.e.,3 DOFs.Therefore,we reduce the DOFs in the 3d motion from 6 to 3.Specifically,we first Here,the rotation matrix Rto.t can be estimated from the detect the feature point pairs to estimate the 3D motion of gyroscope measurements.Then,the only unknown factor camera.After that,we subtract the rotation measured by in Eq.(9)is [Tto.t]x,which has three unknown parameters IMU from the estimated 3D motion to obtain the translation. TTTTherefore,as long as we can obtain more However,due to the continuous change of camera coor- than three pairs of matching feature points,we can solve dinate system and the non-unified unit for the estimated Tto.]x based on the Least Square Error (LSE)method. translation,we introduce the initialization to define the However,there could be multiple solutions for [Tto.x, unified translation unit and represent the fixed 3D points as we can multiply a nonzero coefficient on both sides of in the unified unit to estimate the following translation in a Eq.(9),whose right side is 0.For convenience,we figure unified unit. out one of the solutions for TTand T in camera Feature Point Extraction.According to each image frame coordinate system,based on Eq.(9).It is noteworthy that the of the video,we first utilize the FAST(Features from Accel- calculated translation Tto.t from Eq.(9)is represented in a relative manner,instead of in a absolute unit.Therefore, erated Segment Test)keypoint detector to detect the feature point,and then calculate the binary BRIEF(Binary Robust there is a scale factor a between the calculated translation Independent Elementary Features)descriptor [24]of the and the actual translation Tin the absolute unit,as feature point.Both the feature point and the descriptor form shown in Eq.(10),where TT and T mean the an ORB feature [25].Specifically,for the feature point Pito actual translation along x-axis,y-axis,z-axis. in the image frame Ito and the feature point Pt in image T0t=T1a,T%=T%1·a,T01=T0a.(0) frame It,we use Di and Di to represent their descriptors, respectively.Then,the similarity between Pand P However,it is actually difficult to transform the calculated can be measured by the hamming distance [26]between their translation Tto.t in the absolute unit,since the values of o, descriptors Di and Di.Given Pito we choose the feature TTand T are all unknown.To tackle the above issue,we define point with the nearest hamming distance in the image It to be the matching feature point for Pi.to and they form a Tto|=V(T%t)2+(T%ta)2+(T%6,)2 (11) feature point pair.The coordinate difference between the feature point pair can be used to estimate the camera's as the translation unit.In the following frames,when we motion. calculate a new translation,we represent it in the above Initialization for Translation Unit.As shown in Fig.5, translation unit.Consequently,we can represent the calcu- for each feature point pair(Pi.to Pit),given the projection lated translation over all frames with a unified unit. 1536-1233(c)2019 IEEE Personal use is permitted,but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. Authorized licensed use limited to:Nanjing University.Downloaded on October 08,2020 at 13:50:43 UTC from IEEE Xplore.Restrictions apply
1536-1233 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMC.2019.2961313, IEEE Transactions on Mobile Computing IEEE TRANSACTIONS ON MOBILE COMPUTING, 2019 6 5.3.2 Rotation Estimation According to Eq. (4), it is essential to accurately estimate the rotation matrix Rt0,t and the translation vector Tt0,t, such that the projection of Pi at time t in the image plane, i.e., P 0 i,t = [ui,t, vi,t] T , can be figured out. To estimate the camera’s rotation Rt0,t from the time t0 to time t, we first use the gyroscope to measure the angular speed in each axis of the local coordinate system. Then, according to the small angle approximation [23], we can compute the rotation matrix At,t+δt relating the local coordinate at time t to the one at time t + δt. After obtaining At,t+δt, we can further update the rotation matrix R’t0,t for the local coordinate system as follows: R’t0,t+δt = At,t+δtR’t0,t. (5) Hence, considering the coordinate transformation between the local coordinate system and the camera coordinate system, we further compute the camera’s rotation in camera coordinate system as follows: Rt0,t = MR’t0,tM−1 . (6) 5.3.3 Translation Estimation Considering the nonnegligible error accumulation of using linear acceleration to calculate the translation, we introduce the computer vision (CV)-based method, which utilizes the feature point pairs to estimate the motion between two frames. However, different from traditional CV-based methods which calculate both rotation and translation in each axis, i.e., 6 degrees of freedom (DOF), we have calculated rotation from gyroscope and only need to calculate the unknown translation, i.e., 3 DOFs. Therefore, we reduce the DOFs in the 3d motion from 6 to 3. Specifically, we first detect the feature point pairs to estimate the 3D motion of camera. After that, we subtract the rotation measured by IMU from the estimated 3D motion to obtain the translation. However, due to the continuous change of camera coordinate system and the non-unified unit for the estimated translation, we introduce the initialization to define the unified translation unit and represent the fixed 3D points in the unified unit to estimate the following translation in a unified unit. Feature Point Extraction. According to each image frame of the video, we first utilize the FAST (Features from Accelerated Segment Test) keypoint detector to detect the feature point, and then calculate the binary BRIEF (Binary Robust Independent Elementary Features) descriptor [24] of the feature point. Both the feature point and the descriptor form an ORB feature [25]. Specifically, for the feature point P 0 i,t0 in the image frame It0 and the feature point P 0 j,t1 in image frame It1 , we use Di and Dj to represent their descriptors, respectively. Then, the similarity between P 0 i,t0 and P 0 j,t1 can be measured by the hamming distance [26] between their descriptors Di and Dj . Given P 0 i,t0 , we choose the feature point with the nearest hamming distance in the image It1 to be the matching feature point for P 0 i,t0 , and they form a feature point pair. The coordinate difference between the feature point pair can be used to estimate the camera’s motion. Initialization for Translation Unit. As shown in Fig. 5, for each feature point pair (P 0 i,t0 , P 0 i,t1 ), given the projection Ot0 Ot R 1 t0,t1 , Tt0,t1 It0 It1 P ′ i,t0 P ′ i,t1 O ′ t0 Pi1 Pi2 Pi3 Pi O ′ t1 Fig. 5. Epipolar geometry. point P 0 i,t0 and the camera optical center Ot0 , the target 3D point Pi must locate in the ray Ot0 P 0 i,t0 . Similarly, the target point Pi should also locate in the ray Ot1 P 0 i,t1 . Thus Pi is the intersection point of Ot0 P 0 i,t0 and Ot1 P 0 i,t1 . In computer vision, this is referred to the epipolar constraint [11]. Then, we can use the fundamental matrix Ft0,t1 [11] to describe the epipolar constraint, i.e., P 0 i,t1 T Ft0,t1 P 0 i,t0 = 0. (7) Here, the fundamental matrix Ft0,t1 can be generated by the relative rotation Rt0,t1 and translation Tt0,t1 from the image It0 to image It1 , i.e., Ft0,t1 = K −T [Tt0,t1 ]×Rt0,t1K −1 , (8) where K is the camera intrinsic matrix, [Tt0,t1 ]× is a 3 × 3 matrix. Specifically, let Tt0,t1 = [T x t0,t1 , Ty t0,t1 , Tz t0,t1 ] T , then [Tt0,t1 ]× = 0 −T z t0,t1 T y t0,t1 T z t0,t1 0 −T x t0,t1 −T y t0,t1 T x t0,t1 0 . Therefore, by substituting Eq. (8) to Eq. (7), we have (P 0 i,t1 T K −T )[Tt0,t1 ]×(Rt0,t1K −1P 0 i,t0 ) = 0. (9) Here, the rotation matrix Rt0,t1 can be estimated from the gyroscope measurements. Then, the only unknown factor in Eq. (9) is [Tt0,t1 ]×, which has three unknown parameters T x t0,t1 , T y t0,t1 , T z t0,t1 . Therefore, as long as we can obtain more than three pairs of matching feature points, we can solve [Tt0,t1 ]× based on the Least Square Error (LSE) method. However, there could be multiple solutions for [Tt0,t1 ]×, as we can multiply a nonzero coefficient on both sides of Eq. (9), whose right side is 0. For convenience, we figure out one of the solutions for T x t0,t1 , T y t0,t1 , and T z t0,t1 in camera coordinate system, based on Eq.(9). It is noteworthy that the calculated translation Tt0,t1 from Eq. (9) is represented in a relative manner, instead of in a absolute unit. Therefore, there is a scale factor α between the calculated translation and the actual translation T ∗ t0,t1 in the absolute unit, as shown in Eq. (10), where T x∗ t0,t1 , T y∗ t0,t1 and T z∗ t0,t1 mean the actual translation along x-axis, y-axis, z-axis. T x t0,t1 = T x∗ t0,t1 · α, Ty t0,t1 = T y∗ t0,t1 · α, Tz t0,t1 = T z∗ t0,t1 · α. (10) However, it is actually difficult to transform the calculated translation Tt0,t1 in the absolute unit, since the values of α, T x∗ t0,t1 , T y∗ t0,t1 , and T z∗ t0,t1 are all unknown. To tackle the above issue, we define |Tt0,t1 | = q (T x t0,t1 ) 2 + (T y t0,t1 ) 2 + (T z t0,t1 ) 2 (11) as the translation unit. In the following frames, when we calculate a new translation, we represent it in the above translation unit. Consequently, we can represent the calculated translation over all frames with a unified unit. Authorized licensed use limited to: Nanjing University. Downloaded on October 08,2020 at 13:50:43 UTC from IEEE Xplore. Restrictions apply
This article has been accepted for publication in a future issue of this journal,but has not been fully edited.Content may change prior to final publication.Citation information:DOI 10.1109/TMC.2019.2961313.IEEE Transactions on Mobile Computing IEEE TRANSACTIONS ON MOBILE COMPUTING,2019 Camera Calibration Compute Coordinates of Fixed 3D Points in Unified Unit.According to Eq.(4),representing the translation Tto.t in the unified unit also means representing the coordinate of 3D point Pit in the unified unit.Since the 3D points are stationary,we can use the coordinate of 3D point Pi.to at time to to represent the coordinate of P:at any time.In regard to the coordinate of any fixed 3D point at time to, according to Eq.(4),we can use the rotation Rto.t and the translation Tto.t from to to ti to calculate it in the Preprocess unified unit,since Tto.t is a unified unit.Specifically,for an Video Stabilization arbitrary target point Pi,suppose the 3D coordinates of P:in Fig.6.System Framework the camera coordinate system are Pi.to=[Xi.to,Yi.to,Zi.to] and Pi.t=[Xi.,Yi.t,Zit,respectively,at the time to 6 SYSTEM DESIGN and t1.Then,the corresponding 2D projections in the image 6.1 System Overview plane are Pand Prespectively.Hence,based on the camera projection model in Eq.(4),we have The system architecture is shown in Fig.6.We take as input frames from original video and sensor readings from motion Zi.to Pi.to =KPi.toZi.toK-1Pi.to=Pi.to sensor.We first perform Preprocessing to estimate the 3D Zi.t Pi.t =KPi,t Zi.t K-Pit Pi.t (12) rotation of the camera based on the solution aforemen- tioned in Section 5.3.2,and extract features for video frames. After we calculate the rotation Rto.t and translation The estimated rotation and the video frames with feature Tto.t for the camera coordinate system between the two points will be served for two tasks,Camera Calibration and time points to and t1,we have Pi.t=Rto.t Pi.to+Tto.ti Video Stabilization.The Camera Calibration performs feature based on Eq.(3).Thus according to Eq.(12),we further have tracking between consecutive video frames to obtain feature point pairs,and then uses feature point pairs to calculate Pi.t=Zi,to Rto.tK Pi.to Tto.ti (13) camera intrinsic parameters.The Video Stabilization per- forms video stabilization in three major steps.First,the If we let Xi.to =K-IPito and Xit=K-Pi.t then, 3D translation of the camera is estimated based on the according to Eq.(12)and Eq.(13),we have solution aforementioned in Section 5.3.3.Second,the 3D motion of the camera is sufficiently smoothed to remove Pi.t1 Zi.t Xi,t1 Zi.to Rto.ti Xi,to +Tto.t1 (14) the undesired jitters,thus a smoothed moving path of the camera is generated.Finally,given a smoothed moving path Thus,to compute the coordinate of Pi.t,we only need to of the camera in the 3D space,the stabilized video is created solve Zito or Zi.t.By multiplying both sides of Eq.(14) by the frame warping,i.e.,warping each pixel in the original with the vector Xi,t,,we can eliminate the unknown param- frame to the corresponding stabilized frame,according to eter Zi.t and then calculate the unknown parameter Zi.to the mapping relationship between the original moving path Specifically,the left side of Eq.(14),i.e.,Zi.t (Xi.t x Xi.t) and the smoothed moving path.After that,each frame of the should be equal to 0,since the cross product of any vector stabilized video appears to be captured along the smoothed itself should be equal to 0,then the right side is moving path. Zi,to(Rto,t1Xi,to)×Xi,t1+(Tto,t1)×Xi,t1=0. (15) 6.2 Camera Calibration According to Eq.(15),we are able to solve Zi.to.Then,based According to the pinhole camera model aforementioned in on Eq.(14),we can further calculate Pi.t.Similarly,we can Section 3,in order to depict the camera projection,we need also calculate Pi.to as well as the 3D coordinates of other to know the camera's intrinsic parameters,i.e.,[c cu,a,B target points. and f.Here,[cr,culT is the pixel coordinate of the principal Translation Estimation in Unified Unit.According to point in the image plane.Without loss of generality,if we the projection model in Eq.(4),at any time t during the set the image size to (w,h)in pixels,then [c,cul is ideally camera shoot,we can depict the relationship between the equal to However,due to the sensor manufacturing 3D point and its corresponding projection in the image errors,the principal point,which is the intersection point of plane.Here,K is a known parameter,as aforementioned,the the optical axis and the image plane,will be slightly offest rotation Rto.t can be calculated with the gyroscope-based from the center of the image,ie.,[c,cu]T will not be equal method,and the 3D coordinates of Pito can be calculated to and needs us to estimate.f is the camera focal with the CV-based method.Thus,the only unknown pa- length,which is represented in physical measurements,i.e., rameters are To[TTTand Zi..To solve the meters.a and B are the number of pixels per unit distance above four parameters,we need at least two pairs of feature in physical measurements (i.e.meter)along the ri-axis and points to set up four equations.We can use the Least Square yi-axis of the image plane,and they are used to correlate the Error (LSE)method to solve the overdetermined equation image plane using pixels and the camera coordinate system system.After that,we are able to depict the translation with using meters.If given an arbitrary camera,we may not have the unified translation unit.Specifically,let u Tto.t,then access to these parameters.However,we can access to the we can denote T=Yz·u,T%t=g·u,T%t=2·u.n images the camera takes.Thus we can find a way to deduce this way,we can estimate the translation of the camera with these parameters from images,which is referred as camera the unified unit. 1536-1233(c)2019IEEE Personal use is permitted,but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. Authorized licensed use limited to:Nanjing University.Downloaded on October 08,2020 at 13:50:43 UTC from IEEE Xplore.Restrictions apply
1536-1233 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMC.2019.2961313, IEEE Transactions on Mobile Computing IEEE TRANSACTIONS ON MOBILE COMPUTING, 2019 7 Compute Coordinates of Fixed 3D Points in Unified Unit. According to Eq. (4), representing the translation Tt0,t in the unified unit also means representing the coordinate of 3D point Pi,t in the unified unit. Since the 3D points are stationary, we can use the coordinate of 3D point Pi,t0 at time t0 to represent the coordinate of Pi at any time. In regard to the coordinate of any fixed 3D point at time t0, according to Eq. (4), we can use the rotation Rt0,t1 and the translation Tt0,t1 from t0 to t1 to calculate it in the unified unit, since Tt0,t1 is a unified unit. Specifically, for an arbitrary target point Pi , suppose the 3D coordinates of Pi in the camera coordinate system are Pi,t0 = [Xi,t0 , Yi,t0 , Zi,t0 ] and Pi,t1 = [Xi,t1 , Yi,t1 , Zi,t1 ], respectively, at the time t0 and t1. Then, the corresponding 2D projections in the image plane are P 0 i,t0 and P 0 i,t1 , respectively. Hence, based on the camera projection model in Eq.(4), we have Zi,t0 P 0 i,t0 = KPi,t0 ⇒ Zi,t0K −1P 0 i,t0 = Pi,t0 , Zi,t1 P 0 i,t1 = KPi,t1 ⇒ Zi,t1K −1P 0 i,t1 = Pi,t1 . (12) After we calculate the rotation Rt0,t1 and translation Tt0,t1 for the camera coordinate system between the two time points t0 and t1, we have Pi,t1 = Rt0,t1 Pi,t0 + Tt0,t1 , based on Eq. (3). Thus according to Eq.(12), we further have Pi,t1 = Zi,t0Rt0,t1K −1P 0 i,t0 + Tt0,t1 . (13) If we let Xi,t0 = K −1P 0 i,t0 and Xi,t1 = K −1P 0 i,t1 , then, according to Eq.(12) and Eq.(13), we have Pi,t1 = Zi,t1 Xi,t1 = Zi,t0Rt0,t1 Xi,t0 + Tt0,t1 . (14) Thus, to compute the coordinate of Pi,t1 , we only need to solve Zi,t0 or Zi,t1 . By multiplying both sides of Eq.(14) with the vector Xi,t1 , we can eliminate the unknown parameter Zi,t1 and then calculate the unknown parameter Zi,t0 . Specifically, the left side of Eq.(14), i.e., Zi,t1 (Xi,t1 × Xi,t1 ) should be equal to 0, since the cross product of any vector itself should be equal to 0, then the right side is Zi,t0 (Rt0,t1 Xi,t0 ) × Xi,t1 + (Tt0,t1 ) × Xi,t1 = 0. (15) According to Eq. (15), we are able to solve Zi,t0 . Then, based on Eq. (14), we can further calculate Pi,t1 . Similarly, we can also calculate Pi,t0 as well as the 3D coordinates of other target points. Translation Estimation in Unified Unit. According to the projection model in Eq. (4), at any time t during the camera shoot, we can depict the relationship between the 3D point and its corresponding projection in the image plane. Here, K is a known parameter, as aforementioned, the rotation Rt0,t can be calculated with the gyroscope-based method, and the 3D coordinates of Pi,t0 can be calculated with the CV-based method. Thus, the only unknown parameters are Tt0,t = [T x t0,t, Ty t0,t, Tz t0,t] and Zi,t. To solve the above four parameters, we need at least two pairs of feature points to set up four equations. We can use the Least Square Error (LSE) method to solve the overdetermined equation system. After that, we are able to depict the translation with the unified translation unit. Specifically, let u = |Tt0,t1 |, then we can denote T x t0,t = γx · u, Ty t0,t = γy · u, Tz t0,t = γz · u. In this way, we can estimate the translation of the camera with the unified unit. Original Video Motion Sensor Rotation Estimation Feature Extraction Feature Tracking Translation Estimation Motion Smoothing Frame Warping Camera Intrinsic Parameters Stabilized Video Camera Calibration Camera Calibration Video Stabilization Preprocessing Fig. 6. System Framework 6 SYSTEM DESIGN 6.1 System Overview The system architecture is shown in Fig.6. We take as input frames from original video and sensor readings from motion sensor. We first perform Preprocessing to estimate the 3D rotation of the camera based on the solution aforementioned in Section 5.3.2, and extract features for video frames. The estimated rotation and the video frames with feature points will be served for two tasks, Camera Calibration and Video Stabilization. The Camera Calibration performs feature tracking between consecutive video frames to obtain feature point pairs, and then uses feature point pairs to calculate camera intrinsic parameters. The Video Stabilization performs video stabilization in three major steps. First, the 3D translation of the camera is estimated based on the solution aforementioned in Section 5.3.3. Second, the 3D motion of the camera is sufficiently smoothed to remove the undesired jitters, thus a smoothed moving path of the camera is generated. Finally, given a smoothed moving path of the camera in the 3D space, the stabilized video is created by the frame warping, i.e., warping each pixel in the original frame to the corresponding stabilized frame, according to the mapping relationship between the original moving path and the smoothed moving path. After that, each frame of the stabilized video appears to be captured along the smoothed moving path. 6.2 Camera Calibration According to the pinhole camera model aforementioned in Section 3, in order to depict the camera projection, we need to know the camera’s intrinsic parameters, i.e., [cx, cy] T , α, β and f. Here, [cx, cy] T is the pixel coordinate of the principal point in the image plane. Without loss of generality, if we set the image size to (w, h) in pixels, then [cx, cy] T is ideally equal to [ w 2 , h 2 ] T . However, due to the sensor manufacturing errors, the principal point, which is the intersection point of the optical axis and the image plane, will be slightly offest from the center of the image, i.e., [cx, cy] T will not be equal to [ w 2 , h 2 ] T and needs us to estimate. f is the camera focal length, which is represented in physical measurements, i.e., meters. α and β are the number of pixels per unit distance in physical measurements (i.e. meter) along the xi-axis and yi-axis of the image plane, and they are used to correlate the image plane using pixels and the camera coordinate system using meters. If given an arbitrary camera, we may not have access to these parameters. However, we can access to the images the camera takes. Thus we can find a way to deduce these parameters from images, which is referred as camera Authorized licensed use limited to: Nanjing University. Downloaded on October 08,2020 at 13:50:43 UTC from IEEE Xplore. Restrictions apply
This article has been accepted for publication in a future issue of this journal,but has not been fully edited.Content may change prior to final publication.Citation information:DOI 10.1109/TMC.2019.2961313.IEEE Transactions on Mobile Computing IEEE TRANSACTIONS ON MOBILE COMPUTING,2019 calibration.There are many different approaches to calculate where N is the number of frames,Ni is the number of fea- the intrinsic parameters for a camera,which can be divided ture point pairs of j-th consecutive frame.By solving this into two main categories,ie.,traditional camera calibration optimization problem based on the Least Square Error(LSE) which uses reference objects with known geometry (e.g. method,we can calculate the camera intrinsic parameters. spin table and checkerboard)[21],and automatic-calibration Note that the camera calibration only needs to be done once which does not use any known pattern [27].Taking into for each camera. account the convenience of operations for everyday use,we propose motion-assisted calibration to calculate the intrinsic 6.3 Video Stabilization parameters of camera,by performing a structured move- 6.3.1 Camera Translation Estimation ment. According to the method aforementioned in Section 5.3.3, We design a simple camera calibration process,i.e.,the user only needs to use the camera to shoot a video for we can estimate the camera's 3D translation [Tto.t},from 5~10 seconds.In the process of shooting video,the user the time toto the time t.In particular,during the procedure of the translation estimation,we use the pairs of feature keeps the camera's position unchanged,and just changes points as reference points.We expect the 3D target point of the camera's orientation by rotation.Then we perform fea- ture tracking between consecutive frames to obtain feature feature points to be stationary in regard to the earth coordi- nate system,such that the camera motion and frame warp- point pairs,and use these feature point pairs to calculate the camera intrinsic parameters.Specifically,during the camera ing can be accurately performed based on the coordinate calibration,the motion of camera only involves rotation.For variation of feature points.However,in real applications, a target 3D point Pi,if its coordinate at time to is denoted these 3D target points can be dynamically moving instead as Pi.to,then at time t,after the camera rotation Rto.t from of keeping stationary.For example,the feature points can time to to time t,its corresponding projection in the image be extracted from the moving human subjects in the scene. plane,ie.,Pit,can be computed by These feature points should be regarded as outliers.There- fore,we use the Random Sample Consensus (RANSAC) Pit KRto.tPi.to (16) algorithm [28 to detect the outliers.Specifically,at any time t,we calculate the translation Tto.t through multiple where K is the camera intrinsic matrix,which contains the iterations.In each iteration,e.g.,the kth iteration,we first camera's intrinsic parameters.In order to calculate K,we select two pairs of matching points randomly to calculate the translation T Then,we use the translation Tt to calculate the reprojection error Ei.k for each matching point pair (Pi,to,Pi.t),as shown in Eq.(19). Ei=Pit -K(Rto.Pito+To) (19) If the average reprojection error of all matching pairs at the kth iteration is below a certain threshold,we add the calculated Tt to the candidate translations.In addition, R the matching point pairs whose reprojection errors are be- low the certain threshold are classified as inliers.While the Fig.7.The pure rotation motion model matching point pairs whose reprojection errors are above use feature point pairs of consecutive frames.As shown the certain threshold are classified as outliers.If we have in Fig.7,for each feature point pair (PP)in the enough inliers or the iteration is repeated a fixed number of consecutive frames (ItI),we have Pi.t=KRto.t,Pi.to times,the iteration stops.Finally,we choose the candidate and P=KRPito based on Eq-(16).Thus the translation with minimal rejection error as the translation at mapping relationship from feature point Pit,to feature the time t. point Pi.t can be represented as: In addition,due to the motion of camera,some 3D points will move out of view and cannot be used to estimate 卫t+=KRto.t+Ro,K-lP (17 the following translation.Therefore,when the number of where the coordinates of the feature point pair(PP) available 3D points is less than a threshold,we will detect can be obtained by feature tracking,the rotation Rto.t new 3D points based on the feature point pairs of frames at time t-1 and time t,as mentioned before. and Ro.can be obtained by rotation estimation afore- mentioned in Section 5.3.2.Then,the only unknown factor 6.3.2 Camera Motion Smoothing in Eq.(17)is K,which has five unknown parameters,i.e., [cr,cul,a,B and f.To solve the above five parameters,we Due to the sudden movement of the mobile devices during the camera shooting,there might exist a number of jitters formulate camera calibration as an optimization problem, in regard to the measurements related to the rotation and where we want to minimize the reprojection error of all translation.Specifically,suppose that we set the time to as feature point pairs: the initial time,according to the method aforementioned N-1N; in Section 5.3,we can calculate the rotation matrix [Rto.t} K*=argmin∑∑P+-KRo+Ro,K-lP, and the translation vector [Tto.t),respectively,from the j=1=1 time to to the time t.We set the sequence of {Rto.t}and (18) [Tto.t}as the original camera motion,as it represents the 1536-1233(c)2019 IEEE Personal use is permitted,but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. Authorized licensed use limited to:Nanjing University.Downloaded on October 08,2020 at 13:50:43 UTC from IEEE Xplore.Restrictions apply
1536-1233 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMC.2019.2961313, IEEE Transactions on Mobile Computing IEEE TRANSACTIONS ON MOBILE COMPUTING, 2019 8 calibration. There are many different approaches to calculate the intrinsic parameters for a camera, which can be divided into two main categories, i.e., traditional camera calibration which uses reference objects with known geometry (e.g. spin table and checkerboard) [21], and automatic-calibration which does not use any known pattern [27]. Taking into account the convenience of operations for everyday use, we propose motion-assisted calibration to calculate the intrinsic parameters of camera, by performing a structured movement. We design a simple camera calibration process, i.e., the user only needs to use the camera to shoot a video for 5∼10 seconds. In the process of shooting video, the user keeps the camera’s position unchanged, and just changes the camera’s orientation by rotation. Then we perform feature tracking between consecutive frames to obtain feature point pairs, and use these feature point pairs to calculate the camera intrinsic parameters. Specifically, during the camera calibration, the motion of camera only involves rotation. For a target 3D point Pi , if its coordinate at time t0 is denoted as Pi,t0 , then at time t, after the camera rotation Rt0,t from time t0 to time t, its corresponding projection in the image plane, i.e., P 0 i,t, can be computed by P 0 i,t = KRt0,tPi,t0 , (16) where K is the camera intrinsic matrix, which contains the camera’s intrinsic parameters. In order to calculate K, we Oc Rtj ,tj+1 Pi P ′ i,tj P ′ i,tj+1 Itj Itj+1 P ′ i,tj Fig. 7. The pure rotation motion model use feature point pairs of consecutive frames. As shown in Fig.7, for each feature point pair (P 0 i,tj , P 0 i,tj+1 ) in the consecutive frames (Itj , Itj+1 ), we have P 0 i,tj = KRt0,tjPi,t0 and P 0 i,tj+1 = KRt0,tj+1 Pi,t0 , based on Eq.(16). Thus the mapping relationship from feature point P 0 i,tj to feature point P 0 i,tj+1 can be represented as: P 0 i,tj+1 = KRt0,tj+1R −1 t0,tjK −1P 0 i,tj , (17) where the coordinates of the feature point pair (P 0 i,tj , P 0 i,tj+1 ) can be obtained by feature tracking, the rotation Rt0,tj and Rt0,tj+1 can be obtained by rotation estimation aforementioned in Section 5.3.2. Then, the only unknown factor in Eq.(17) is K, which has five unknown parameters, i.e., [cx, cy] T , α, β and f. To solve the above five parameters, we formulate camera calibration as an optimization problem, where we want to minimize the reprojection error of all feature point pairs: K ∗ = arg min K N X−1 j=1 X Nj i=1 kP 0 i,tj+1 − KRt0,tj+1R −1 t0,tjK −1P 0 i,tj k 2 , (18) where N is the number of frames, Nj is the number of feature point pairs of j−th consecutive frame. By solving this optimization problem based on the Least Square Error(LSE) method, we can calculate the camera intrinsic parameters. Note that the camera calibration only needs to be done once for each camera. 6.3 Video Stabilization 6.3.1 Camera Translation Estimation According to the method aforementioned in Section 5.3.3, we can estimate the camera’s 3D translation {Tt0,t}, from the time t0 to the time t. In particular, during the procedure of the translation estimation, we use the pairs of feature points as reference points. We expect the 3D target point of feature points to be stationary in regard to the earth coordinate system, such that the camera motion and frame warping can be accurately performed based on the coordinate variation of feature points. However, in real applications, these 3D target points can be dynamically moving instead of keeping stationary. For example, the feature points can be extracted from the moving human subjects in the scene. These feature points should be regarded as outliers. Therefore, we use the Random Sample Consensus (RANSAC) algorithm [28] to detect the outliers. Specifically, at any time t, we calculate the translation Tt0,t through multiple iterations. In each iteration, e.g., the kth iteration, we first select two pairs of matching points randomly to calculate the translation T k t0,t. Then, we use the translation T k t0,t to calculate the reprojection error Ei,k for each matching point pair (Pi,t0 , P 0 i,t), as shown in Eq. (19). Ei,k = kP 0 i,t − K(Rt0,tPi,t0 + T k t0,t)k 2 . (19) If the average reprojection error of all matching pairs at the kth iteration is below a certain threshold, we add the calculated T k t0,t to the candidate translations. In addition, the matching point pairs whose reprojection errors are below the certain threshold are classified as inliers. While the matching point pairs whose reprojection errors are above the certain threshold are classified as outliers. If we have enough inliers or the iteration is repeated a fixed number of times, the iteration stops. Finally, we choose the candidate translation with minimal rejection error as the translation at the time t. In addition, due to the motion of camera, some 3D points will move out of view and cannot be used to estimate the following translation. Therefore, when the number of available 3D points is less than a threshold, we will detect new 3D points based on the feature point pairs of frames at time t − 1 and time t, as mentioned before. 6.3.2 Camera Motion Smoothing Due to the sudden movement of the mobile devices during the camera shooting, there might exist a number of jitters in regard to the measurements related to the rotation and translation. Specifically, suppose that we set the time t0 as the initial time, according to the method aforementioned in Section 5.3, we can calculate the rotation matrix {Rt0,t} and the translation vector {Tt0,t}, respectively, from the time t0 to the time t. We set the sequence of {Rt0,t} and {Tt0,t} as the original camera motion, as it represents the Authorized licensed use limited to: Nanjing University. Downloaded on October 08,2020 at 13:50:43 UTC from IEEE Xplore. Restrictions apply
This article has been accepted for publication in a future issue of this journal,but has not been fully edited.Content may change prior to final publication.Citation information:DOI 10.1109/TMC.2019.2961313.IEEE Transactions on Mobile Computing IEEE TRANSACTIONS ON MOBILE COMPUTING,2019 9 original moving path of the camera which involves the unexpected jitters.Then,to smooth the unexpected jitters moothed Trajectory from the original camera motion,which is usually existing in high frequency band,we apply the low-pass filter on the sequence of [Rto.t}and [Tto.t},respectively,to obtain the smoothed camera motion. To smooth the unexpected jitters in the translation,it is known that the translation vector {Tto.t}has three degrees 02 of freedom,we can apply the low-pass filter on each of 0-0.8-0.6 0.4y02 the three dimensions directly.Without loss of generality,we Fig.8.The original moving path vs smoothed moving path describe our solution in regard to the z-dimension of the translation vector [Ttot),i.e.,[T).Specifically,we apply original moving path and the positions in the smoothed a weighted moving average filter on the sequence of [T moving path. using a sliding window.To calculate the smoothed trans- 6.3.3 Frame Warping for Video Stabilization lation,we provide different weights at different positions of the sliding window.Here,we use a Gaussian function to During a video shoot,if we do not introduce any stabiliza- calculate the weights,so that for the time t,the weight given tion operation,each image frame generated from the camera to the close positions is higher than those distant positions view is called original frame.However,if we introduce the Assume the length of the sliding window is n,then at the estimated camera motion to calibrate the original image time ti,the smoothed translation Tto.t can be frame,we can replace the original frame with a new image frame,which is called stabilized frame.At first,we transform 、一To,4-1 1西 (20) the pixels Pit corresponding to the feature points in the original frame to the pixelsPin the stabilized frame.For where wj is the weight at the position j.In this way,we are the feature point Pt in the original frame,the coordinate able to smooth the translation-based jitters of the camera. of its corresponding 3D point at time to,i.e.,Pi.to,can be To smooth the unexpected jitters in the rotation,it is calculated based on the original camera motion.With the known that the 3 x 3 rotation matrix Rto.t involves 9 param- known coordinate Pi.to at time to,and the known smoothed eters.As the camera coordinate system is rotating,we obtain camera motion (Rtt,Tto.t)at the time t by referring to the 9 streams of these parameters over time.However,these pa- status at the time to,we can transform Pi.to to Pi.t,then to rameters are not mutually independent to each other,since Pbased on the camera projection model shown in Eq. the rotation usually has three degrees of freedom in the 3D (23). space.Therefore,in order to smooth the jitters in the rotation (23) measurement,we first transform the 3 x 3 rotation matrix Zi.tPit KPit =K(Rto.Pi,to Tto.t). Rtot into the corresponding Euler angles [tt,to.t,to. Here,Pi.t is the coordinate of P;in the smoothed camera [Ri R12 13 Specifically,suppose Rto.t R21 R22 R23 then,the coordinate system at the time t,Zi.t is the coordinate of Pit R31 R32 R33 in the z-axis,i.e.,the depth value of the pixel PAmong the Euler angles (,t.t,)can be calculated by parameters,Pi.to is obtained through initialization,R.to and pto,t=arctan(RRa Tt.to are obtained through smoothed camera motion,while K 0to.t arctan( is a known parameter,thus we can calculate Pi.t.After that, Ri2+Ri3 (21) we can obtain Zi.t,which is the coordinate of Pi,t in z-axis. Vto.t=arctan( 21 R11 That is to say,there is only one unknown parameterP Then we use the low pass filter such as the same weighted which can be solved based on Eq.(23).In this way,we can moving average filter to smooth the jitters in the Euler transform each feature point Pt in the original frame to the angles.After that,we further transform the smoothed Euler 3D point Pi.to,and then transform Pi.to to the corresponding angles to the rotation matrix Rto.t in a similar manner as pixel Pt in the stabilized frame. follows: For the other pixels not belonging to feature points, Rto.t=Rio.tRto.Rto.t (22) it is difficult to transform them into the corresponding pixels in the stabilized frame,because of the unknown 3D Here,RRt,Rio.t mean the rotation matrix transformed points corresponding to the pixels.At this time,we combine from the Euler angles oto.t,Oto.t,to.t,respectively.They the original frame and the projected feature points in the also correspond to rotation around x-axis,y-axis,z-axis, stabilized frame to stabilize the video.That is to say,we in- respectively.In this way,we are able to smooth the rotation- troduce a standard texture mapping algorithm according to based jitters of the camera. the warped mesh [15],which proposed content-preserving Fig.8 shows an example of the original moving path and warps to transform each pixel in the original frame to the the smoothed moving path in a global coordinate system, stabilized frame. the moving path involves the rotation and translation of the camera in the 3D space.It is found that the original moving path is full of jitters,while the smoothed moving 6.4 Multi-Thread Optimization for Video Stabilization path is fairly flat and smooth in the contour.The red arrows In order to implement real-time video stabilization,our sys- show the mapping relationship between the positions in the tem needs to output the stabilized frame without noticeable 1536-1233(c)2019 IEEE Personal use is permitted,but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. Authorized licensed use limited to:Nanjing University.Downloaded on October 08,2020 at 13:50:43 UTC from IEEE Xplore.Restrictions apply
1536-1233 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMC.2019.2961313, IEEE Transactions on Mobile Computing IEEE TRANSACTIONS ON MOBILE COMPUTING, 2019 9 original moving path of the camera which involves the unexpected jitters. Then, to smooth the unexpected jitters from the original camera motion, which is usually existing in high frequency band, we apply the low-pass filter on the sequence of {Rt0,t} and {Tt0,t}, respectively, to obtain the smoothed camera motion. To smooth the unexpected jitters in the translation, it is known that the translation vector {Tt0,t} has three degrees of freedom, we can apply the low-pass filter on each of the three dimensions directly. Without loss of generality, we describe our solution in regard to the x-dimension of the translation vector {Tt0,t}, i.e., {T x t0,t}. Specifically, we apply a weighted moving average filter on the sequence of {T x t0,t} using a sliding window. To calculate the smoothed translation, we provide different weights at different positions of the sliding window. Here, we use a Gaussian function to calculate the weights, so that for the time t, the weight given to the close positions is higher than those distant positions. Assume the length of the sliding window is n, then at the time ti , the smoothed translation Tˆ x t0,ti can be Tˆ x t0,ti = Xn j=1 P wj n j=1 wj T x t0,ti−j+1 , (20) where wj is the weight at the position j. In this way, we are able to smooth the translation-based jitters of the camera. To smooth the unexpected jitters in the rotation, it is known that the 3×3 rotation matrix Rt0,t involves 9 parameters. As the camera coordinate system is rotating, we obtain 9 streams of these parameters over time. However, these parameters are not mutually independent to each other, since the rotation usually has three degrees of freedom in the 3D space. Therefore, in order to smooth the jitters in the rotation measurement, we first transform the 3 × 3 rotation matrix Rt0,t into the corresponding Euler angles [φt0,t, θt0,t, ψt0,t]. Specifically, suppose Rt0,t = " R11 R12 R13 R21 R22 R23 R31 R32 R33# , then, the Euler angles (φt0,t, θt0,t, ψt0,t) can be calculated by φt0,t = arctan( R32 R33 ), θt0,t = arctan(√ −R31 R2 32+R2 33 ), ψt0,t = arctan( R21 R11 ). (21) Then we use the low pass filter such as the same weighted moving average filter to smooth the jitters in the Euler angles. After that, we further transform the smoothed Euler angles to the rotation matrix Rbt0,t in a similar manner as follows: Rbt0,t = R z t0,tR y t0,tR x t0,t. (22) Here, R x t0,t, R y t0,t, R z t0,t mean the rotation matrix transformed from the Euler angles φt0,t, θt0,t, ψt0,t, respectively. They also correspond to rotation around x-axis, y-axis, z-axis, respectively. In this way, we are able to smooth the rotationbased jitters of the camera. Fig. 8 shows an example of the original moving path and the smoothed moving path in a global coordinate system, the moving path involves the rotation and translation of the camera in the 3D space. It is found that the original moving path is full of jitters, while the smoothed moving path is fairly flat and smooth in the contour. The red arrows show the mapping relationship between the positions in the y z x -1 -1 0 -0.5 1 0.2 0 -0.2 2 -0.4 0 -0.6 -0.8 Original Trajectory with Jitters Smoothed Trajectory y x -1 -1 -0.8 -0.6 0 -0.4 z -0.2 0.1 0 -0.1 -0.2 1 -0.3 -0.4 0 -0.5 -0.6 -0.7 2 Original Trajectory Smoothed Trajectory Mapping relationship between original position and smoothed position Fig. 8. The original moving path vs smoothed moving path original moving path and the positions in the smoothed moving path. 6.3.3 Frame Warping for Video Stabilization During a video shoot, if we do not introduce any stabilization operation, each image frame generated from the camera view is called original frame. However, if we introduce the estimated camera motion to calibrate the original image frame, we can replace the original frame with a new image frame, which is called stabilized frame. At first, we transform the pixels P 0 i,t corresponding to the feature points in the original frame to the pixels Pˆ 0 i,t in the stabilized frame. For the feature point P 0 i,t in the original frame, the coordinate of its corresponding 3D point at time t0, i.e., Pi,t0 , can be calculated based on the original camera motion. With the known coordinate Pi,t0 at time t0, and the known smoothed camera motion (Rˆt0,t, Tˆt0,t) at the time t by referring to the status at the time t0, we can transform Pi,t0 to Pˆi,t, then to Pˆ 0 i,t, based on the camera projection model shown in Eq. (23). Zˆ i,t · Pˆ 0 i,t = KPˆi,t = K(Rˆt0,tPi,t0 + Tˆt0,t). (23) Here, Pˆi,t is the coordinate of Pi in the smoothed camera coordinate system at the time t, Zˆ i,t is the coordinate of Pˆi,t in the z-axis, i.e., the depth value of the pixel Pˆ 0 i,t. Among the parameters, Pi,t0 is obtained through initialization, Rˆt,t0 and Tˆt,t0 are obtained through smoothed camera motion, while K is a known parameter, thus we can calculate Pˆi,t. After that, we can obtain Zˆ i,t, which is the coordinate of Pˆi,t in z-axis. That is to say, there is only one unknown parameter Pˆ 0 i,t, which can be solved based on Eq. (23). In this way, we can transform each feature point P 0 i,t in the original frame to the 3D point Pi,t0 , and then transform Pi,t0 to the corresponding pixel Pˆ 0 i,t in the stabilized frame. For the other pixels not belonging to feature points, it is difficult to transform them into the corresponding pixels in the stabilized frame, because of the unknown 3D points corresponding to the pixels. At this time, we combine the original frame and the projected feature points in the stabilized frame to stabilize the video. That is to say, we introduce a standard texture mapping algorithm according to the warped mesh [15], which proposed content-preserving warps to transform each pixel in the original frame to the stabilized frame. 6.4 Multi-Thread Optimization for Video Stabilization In order to implement real-time video stabilization, our system needs to output the stabilized frame without noticeable Authorized licensed use limited to: Nanjing University. Downloaded on October 08,2020 at 13:50:43 UTC from IEEE Xplore. Restrictions apply
This article has been accepted for publication in a future issue of this journal,but has not been fully edited.Content may change prior to final publication.Citation information:DOI 10.1109/TMC.2019.2961313.IEEE Transactions on Mobile Computing IEEE TRANSACTIONS ON MOBILE COMPUTING,2019 10 山 eature Extracton Motion Estmato and Tracking and Smoothing Writing Frame to Frame Warping Stabilized Video Single Thread (a)Single thread processing HUAWEI SAMSUNG ■3ta91Sta9e23a9e3■Sa24☐a9e☒ EssT室配 Thread Thread Fig.9.Time cost in different stages g latency.If the frame rate is 30fps,we want the time latency to be lower than the waiting time between two frames, i.e.,33ms.According to Section 6.1,in order to output a stabilized video,we need to capture original frame,extract (b)Multi-thread processing and track features,estimate and smooth camera motion, Fig.10.Multi-thread processing perform frame warping to obtain the stabilized frame,and finally write the stabilized frame to the stabilized video. The large time cost in image processing leads to large time latency for video stabilization.To solve this problem,we first profile the time cost of each stage in video stabilization, and then introduce multi-thread optimization to reduce the time cost. 6.4.1 Time Cost in Different Stages There are five main stages in our video stabilization sys- Threadl Thread2 Thread4 tem,i.e.,capturing original frame,feature extraction and tracking,camera motion estimation and smoothing,frame Fig.11.Multi-thread feature extraction warping,and writing the stabilized frame to the stabilized to the buffer.The Write Frmae thread reads the stabilized video.The stages are respectively called 'Stagel','Stage2', frames from the buffer and writes them to video.By adopt- 'Stage3','Stage4'and 'Stage5'for short.We first set the ing multiple threads,the time cost of processing one frame image size to 1920*1080 pixels,and then measure the time can be calculated by cost of processing one original frame in each stage.Without loss of generality,we use the smartphones Lenovo PHAB2 T=max(26.8,48.1+8.6+9.2,30.4)ms=65.9ms.(25) Pro,HUAWEI MATE20 Pro and SAMSUNG Galaxy Note8 as the testing platforms.We repeat the measurement for Compared to single-threaded processing,the time cost is 500 times to get the average time cost.According to the reduced from 123.1 ms to 65.9 ms.However,the time latency measurement shown in Fig.9,without loss of generality,we is still higher than the waiting time between two frames,i.e., use the time cost with the Lenovo PHAB2 Pro by default, 33ms.According to Eq.(25),we can find that the time latency and the time cost in five stages with the Lenovo PHAB2 is determined by the time cost of Process Frame thread,where Pro smartphone is 26.8,48.1,8.6,9.2,30.4 ms,respectively. the processing time of feature extraction and tracking takes Thus the time cost of processing one original frame can the largest proportion,i.e.,73%.Therefore,it is better to be calculated by Eq.(24).Obviously,123.1 ms is very large optimize the operators of feature extraction and tracking. latency,thus more optimizations are expected for our video As aforementioned in Section 5.3.3,feature extraction stabilization system to realize real time processing. only requires local information,i.e.,for each pixel,feature extractor only uses the pixels around it to classify whether it is a feature point.Therefore,we can split a frame into T=(26.8+48.1+8.6+9.2+30.4)ms=123.1ms.(24) blocks without affecting the locality of feature extraction, and then use multiple threads to extract features for each 6.4.2 Multi-Thread Processing block separately.Specifically,as shown in Fig.11,without As shown in Fig.10(a),according to Eq.(24),if our system loss of generality,we first divide the frame into sixteen works with a single thread,it takes 123.1 ms to process blocks,and for each block,we calculate the color variance one frame.Thus,we introduce multi-thread optimization to inside the block to measure its salience.A larger color reduce the time cost.We use three threads to capture frame, variance implies higher salience.Hence,according to the process frame and write frame in parallel.As shown in Fig. value of color variance,we then divide the blocks into 10(b),the Capture Frame thread captures frames from the three categories,i.e,low salience block,medium salience original video and writes the original frames to the buffer. block and high salience block.For low,medium and high The Process Frame thread first reads the original frame from salience block,we extract 10,20 and 40 features respectively. the buffer,then performs feature extraction and tracking, Finally,these sixteen blocks are distributed to four threads motion estimation and smoothing,frame warping to obtain sequentially from left to right and then from top to bottom. the stabilized frame,and finally writes the stabilized frame In order to achieve load balancing,we let each thread 136-1233(c)2019IEEE Personal use is permitted,but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. Authorized licensed use limited to:Nanjing University.Downloaded on October 08,2020 at 13:50:43 UTC from IEEE Xplore.Restrictions apply
1536-1233 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMC.2019.2961313, IEEE Transactions on Mobile Computing IEEE TRANSACTIONS ON MOBILE COMPUTING, 2019 10 Lenovo HUAWEI SAMSUNG 0 10 20 30 40 50 Time(ms) Stage1 Stage2 Stage3 Stage4 Stage5 Fig. 9. Time cost in different stages latency. If the frame rate is 30fps, we want the time latency to be lower than the waiting time between two frames, i.e., 33ms. According to Section 6.1, in order to output a stabilized video, we need to capture original frame, extract and track features, estimate and smooth camera motion, perform frame warping to obtain the stabilized frame, and finally write the stabilized frame to the stabilized video. The large time cost in image processing leads to large time latency for video stabilization. To solve this problem, we first profile the time cost of each stage in video stabilization, and then introduce multi-thread optimization to reduce the time cost. 6.4.1 Time Cost in Different Stages There are five main stages in our video stabilization system, i.e., capturing original frame, feature extraction and tracking, camera motion estimation and smoothing, frame warping, and writing the stabilized frame to the stabilized video. The stages are respectively called ‘Stage1’, ‘Stage2’, ‘Stage3’, ‘Stage4’ and ‘Stage5’ for short. We first set the image size to 1920 ∗ 1080 pixels, and then measure the time cost of processing one original frame in each stage. Without loss of generality, we use the smartphones Lenovo PHAB2 Pro, HUAWEI MATE20 Pro and SAMSUNG Galaxy Note8 as the testing platforms. We repeat the measurement for 500 times to get the average time cost. According to the measurement shown in Fig. 9, without loss of generality, we use the time cost with the Lenovo PHAB2 Pro by default, and the time cost in five stages with the Lenovo PHAB2 Pro smartphone is 26.8, 48.1, 8.6, 9.2, 30.4 ms, respectively. Thus the time cost of processing one original frame can be calculated by Eq.(24). Obviously, 123.1 ms is very large latency, thus more optimizations are expected for our video stabilization system to realize real time processing. T = (26.8 + 48.1 + 8.6 + 9.2 + 30.4)ms = 123.1ms. (24) 6.4.2 Multi-Thread Processing As shown in Fig. 10(a), according to Eq.(24), if our system works with a single thread, it takes 123.1 ms to process one frame. Thus, we introduce multi-thread optimization to reduce the time cost. We use three threads to capture frame, process frame and write frame in parallel. As shown in Fig. 10(b), the Capture Frame thread captures frames from the original video and writes the original frames to the buffer. The Process Frame thread first reads the original frame from the buffer, then performs feature extraction and tracking, motion estimation and smoothing, frame warping to obtain the stabilized frame, and finally writes the stabilized frame Capturing Original Frame Feature Extraction and Tracking Motion Estimation and Smoothing Frame Warping Writing Frame to Stabilized Video Single Thread (a) Single thread processing Capturing Original Frame Capture Frame Thread Process Frame Thread Write Frame Thread Original Frame Buffer Feature Extraction and Tracking Motion Estimation and Smoothing Frame Warping Writing Frame to Stabilized Video Stabilized Frame Buffer Write Write Read Read (b) Multi-thread processing Fig. 10. Multi-thread processing 1 1 1 1 1 2 2 2 3 3 3 4 4 4 4 4 Thread1 Thread2 Thread3 Thread4 Fig. 11. Multi-thread feature extraction to the buffer. The Write Frmae thread reads the stabilized frames from the buffer and writes them to video. By adopting multiple threads, the time cost of processing one frame can be calculated by T = max(26.8, 48.1 + 8.6 + 9.2, 30.4)ms = 65.9ms. (25) Compared to single-threaded processing, the time cost is reduced from 123.1 ms to 65.9 ms. However, the time latency is still higher than the waiting time between two frames, i.e., 33ms. According to Eq.(25), we can find that the time latency is determined by the time cost of Process Frame thread, where the processing time of feature extraction and tracking takes the largest proportion, i.e., 73%. Therefore, it is better to optimize the operators of feature extraction and tracking. As aforementioned in Section 5.3.3, feature extraction only requires local information, i.e., for each pixel, feature extractor only uses the pixels around it to classify whether it is a feature point. Therefore, we can split a frame into blocks without affecting the locality of feature extraction, and then use multiple threads to extract features for each block separately. Specifically, as shown in Fig. 11, without loss of generality, we first divide the frame into sixteen blocks, and for each block, we calculate the color variance inside the block to measure its salience. A larger color variance implies higher salience. Hence, according to the value of color variance, we then divide the blocks into three categories, i.e, low salience block, medium salience block and high salience block. For low, medium and high salience block, we extract 10, 20 and 40 features respectively. Finally, these sixteen blocks are distributed to four threads sequentially from left to right and then from top to bottom. In order to achieve load balancing, we let each thread Authorized licensed use limited to: Nanjing University. Downloaded on October 08,2020 at 13:50:43 UTC from IEEE Xplore. Restrictions apply