1536-1233 (c) 2019 IEEE. Personal use_中国高校课件下载中心

点击下载：《分布式计算实验室》课程教学资源（阅读文献）Video Stabilization for Camera Shoot in Mobile Devices via Inertial-Visual State Tracking

正在加载图片...

This article has been accepted for publication in a future issue of this journal,but has not been fully edited.Content may change prior to final publication.Citation information:DOI 10.1109/TMC.2019.2961313.IEEE Transactions on Mobile Computing IEEE TRANSACTIONS ON MOBILE COMPUTING,2019 11 process different number of blocks,and make the number 12(a),the Gyro-based solution and our solution outperform of features that each thread needs to extract are roughly the CV-based solution.Specifically,as the jitter's range is the same.By adopting multiple threads,the time cost of increasing,the estimation error of the Gyro-based solution feature extraction and tracking is reduced from 48.1 ms to and our solution is increasing from0.7°tol.7°，whereas 14.8 ms.And the time cost of processing one original frame the rotation error of CV-based solution is increasing from is reduced from 65.9 ms to 32.6 ms,which is lower than the 2.70 to 4.6.For the translation estimation,as shown in Fig. waiting time between two frames,i.e.,33 ms.Unit now,the 12(b),the estimation error of Acc-based solution is rather time latency meets the real-time requirement. large,whereas the estimation error of the CV-based solution 7 and our solution is rather small.Moreover,by fusing the DISCUSSION IMU-based method and the CV-based method,our solution Translation Estimation:As described in Section 5.3.3,we further outperforms the CV-based solution in the translation are expected to select the feature point pairs generated from estimation.Specifically,as the jitter's range is increasing, fixed 3D points to calculate the translation.If the camera the estimation error of CV-based solution is increasing from takes fast moving objects,our method may fail to estimate 2.5cm to 4.0cm,whereas the estimation error of our solution the camera translation,due to the failure of extracting and is always less than 2.3cm. tracking enough feature points.This is also a common chal- lenge faced by many CV-based methods [1],[15],[29],[30]. 8.2.2 Time Efficiency To tackle this issue,we can introduce the Extended Kalman We then evaluated the time delay of camera motion esti- Filter(EKF)to further fuse the IMU-based method and CV- mation per frame (one frame lasts 33ms in our setting).We based method,to mitigate this problem.As we know,the compared our solution with a traditional CV-based solu- IMU-based method has the problem of measurement noise tions called eight-point-algorithm [11],which estimates the and error accumulation.However,by fusing the CV-based rotation and translation simultaneously,by using eight pairs method and introducing the EKF,we may quantify and of feature points.We applied our solution and the CV-based calibrate the measurement noise in acceleration,and then solution in 30 videos,which are classified into three cate- estimate the translation with higher accuracy. Frame Warping:As described in Section 6.3.3,when we gories based on scene type,i.e.,simple,normal,and com- plex.For simple,normal and complex scene,we extracted perform frame warping to obtain stabilized frames,because 50~100,200~300 and 500~600 pairs of features points, the position of each frame is changed from the original respectively,for camera motion estimation.As shown in Fig. moving path to the smoothed moving path,there are some 12(c),our solution using fewer feature point pairs clearly missing areas within stabilized frames.In order to hide these outperforms the CV-based solution.Specifically,as the scene missing areas,we crop stabilized frames.This operation varies from simple to complex,the time delay for our causes the loss of information at video boundaries.To tackle solution increases from 1.5ms to 5ms per frame,whereas this issue,we can use inpainting algorithms [14],[31],[32] the time delay for CV-based solution increases from 7ms to to perform full-frame video stabilization.The idea of image 16ms per frame. inpainting is using the pixel information from the surround- ing areas and nearby frames to complete the missing pixels. 8.3 Evaluate the Performance of Video Stabilization For example,according to the method described in [31],we can search the most similar patch in color space among the To evaluate the performance of video stabilization,we used nearby frames,and use it to complete the missing areas. the metric of Inter-frame Transformation Fidelity (ITF)[29], ie,ITF=Nf与∑PSNR(k).Here,Ne is the num- 8 PERFORMANCE EVALUATION ber of video frames,while PSNR(k)is the Peak Signal-to- 8.1 Experimental Setup Noise Ratio between two consecutive frames Fk and F+1. We have implemented a prototype system for video sta- PSNR measures how an image is similar to another one, as the consecutive frames in the stabilized video should be bilization using an Android phone (Lenovo PHAB2 Pro) more continuous than the original video,thus ITF can be which is embedded with the inertial sensors including an used to evaluate the stabilization degree of a video.Hence, accelerometer and a gyroscope.In the experiment,we used larger ITF implies that better performance is achieved for the Android phone to capture the 1080p videos,where the video stabilization.We compared our solution with three sampling rate of the camera is 30 frames per second and the sampling rate of the inertial sensors is 200Hz baseline solutions,i.e.,1)IMU-based solution:it estimates the camera rotation via the gyroscope and estimates the camera translation via the accelerometer [12];2)CV-based 8.2 Evaluate the Performance of Motion Estimation solution:it estimates the camera rotation and translation via 8.2.1 Accuracy CV-based method [13],[15];3)Warp Stabilizer:it is built We first evaluated the accuracy of camera motion estima- upon CV-based method [30]and adopted in the state-of-art tion.We compared our solution with three baseline so- commercial offline system Named Adobe After Effects CC lutions,i.e.,the Gyro-based solution which estimates the 2018. rotation via the gyroscope,the Acc-based solution which estimates the translation via the accelerometer,and the CV- 8.3.1 Performance comparison with COTS solutions based solution called eight-point-algorithm [11].We use the We compared our solution with the optical image stabiliza- OptiTrack system [22]to capture the ground-truth of the tion method,which is widely adopted in commercial mobile camera motion.For the rotation estimation,as shown in Fig devices,e.g.,iPhone 8 and Samsung S9.Specifically,by 1536-1233(c)2019 IEEE Personal use is permitted,but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. Authorized licensed use limited to:Nanjing University.Downloaded on October 08,2020 at 13:50:43 UTC from IEEE Xplore.Restrictions apply.1536-1233 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMC.2019.2961313, IEEE Transactions on Mobile Computing IEEE TRANSACTIONS ON MOBILE COMPUTING, 2019 11 process different number of blocks, and make the number of features that each thread needs to extract are roughly the same. By adopting multiple threads, the time cost of feature extraction and tracking is reduced from 48.1 ms to 14.8 ms. And the time cost of processing one original frame is reduced from 65.9 ms to 32.6 ms, which is lower than the waiting time between two frames, i.e., 33 ms. Unit now, the time latency meets the real-time requirement. 7 DISCUSSION Translation Estimation: As described in Section 5.3.3, we are expected to select the feature point pairs generated from fixed 3D points to calculate the translation. If the camera takes fast moving objects, our method may fail to estimate the camera translation, due to the failure of extracting and tracking enough feature points. This is also a common challenge faced by many CV-based methods [1], [15], [29], [30]. To tackle this issue, we can introduce the Extended Kalman Filter (EKF) to further fuse the IMU-based method and CVbased method, to mitigate this problem. As we know, the IMU-based method has the problem of measurement noise and error accumulation. However, by fusing the CV-based method and introducing the EKF, we may quantify and calibrate the measurement noise in acceleration, and then estimate the translation with higher accuracy. Frame Warping: As described in Section 6.3.3, when we perform frame warping to obtain stabilized frames, because the position of each frame is changed from the original moving path to the smoothed moving path, there are some missing areas within stabilized frames. In order to hide these missing areas, we crop stabilized frames. This operation causes the loss of information at video boundaries. To tackle this issue, we can use inpainting algorithms [14], [31], [32] to perform full-frame video stabilization. The idea of image inpainting is using the pixel information from the surrounding areas and nearby frames to complete the missing pixels. For example, according to the method described in [31], we can search the most similar patch in color space among the nearby frames, and use it to complete the missing areas. 8 PERFORMANCE EVALUATION 8.1 Experimental Setup We have implemented a prototype system for video stabilization using an Android phone (Lenovo PHAB2 Pro), which is embedded with the inertial sensors including an accelerometer and a gyroscope. In the experiment, we used the Android phone to capture the 1080p videos, where the sampling rate of the camera is 30 frames per second and the sampling rate of the inertial sensors is 200Hz. 8.2 Evaluate the Performance of Motion Estimation 8.2.1 Accuracy We first evaluated the accuracy of camera motion estimation. We compared our solution with three baseline solutions, i.e., the Gyro-based solution which estimates the rotation via the gyroscope, the Acc-based solution which estimates the translation via the accelerometer, and the CVbased solution called eight-point-algorithm [11]. We use the OptiTrack system [22] to capture the ground-truth of the camera motion. For the rotation estimation, as shown in Fig. 12(a), the Gyro-based solution and our solution outperform the CV-based solution. Specifically, as the jitter’s range is increasing, the estimation error of the Gyro-based solution and our solution is increasing from 0.7◦ to 1.7◦ , whereas the rotation error of CV-based solution is increasing from 2.7◦ to 4.6◦ . For the translation estimation, as shown in Fig. 12(b), the estimation error of Acc-based solution is rather large, whereas the estimation error of the CV-based solution and our solution is rather small. Moreover, by fusing the IMU-based method and the CV-based method, our solution further outperforms the CV-based solution in the translation estimation. Specifically, as the jitter’s range is increasing, the estimation error of CV-based solution is increasing from 2.5cm to 4.0cm, whereas the estimation error of our solution is always less than 2.3cm. 8.2.2 Time Efficiency We then evaluated the time delay of camera motion estimation per frame (one frame lasts 33ms in our setting). We compared our solution with a traditional CV-based solutions called eight-point-algorithm [11], which estimates the rotation and translation simultaneously, by using eight pairs of feature points. We applied our solution and the CV-based solution in 30 videos, which are classified into three categories based on scene type, i.e., simple, normal, and complex. For simple, normal and complex scene, we extracted 50∼100, 200∼300 and 500∼600 pairs of features points, respectively, for camera motion estimation. As shown in Fig. 12(c), our solution using fewer feature point pairs clearly outperforms the CV-based solution. Specifically, as the scene varies from simple to complex, the time delay for our solution increases from 1.5ms to 5ms per frame, whereas the time delay for CV-based solution increases from 7ms to 16ms per frame. 8.3 Evaluate the Performance of Video Stabilization To evaluate the performance of video stabilization, we used the metric of Inter-frame Transformation Fidelity (ITF) [29], i.e., IT F = 1 NF −1 PNF −1 k=1 P SNR(k). Here, NF is the number of video frames, while PSNR(k) is the Peak Signal-toNoise Ratio between two consecutive frames Fk and Fk+1. PSNR measures how an image is similar to another one, as the consecutive frames in the stabilized video should be more continuous than the original video, thus ITF can be used to evaluate the stabilization degree of a video. Hence, larger ITF implies that better performance is achieved for video stabilization. We compared our solution with three baseline solutions, i.e., 1) IMU-based solution: it estimates the camera rotation via the gyroscope and estimates the camera translation via the accelerometer [12]; 2) CV-based solution: it estimates the camera rotation and translation via CV-based method [13], [15]; 3) Warp Stabilizer: it is built upon CV-based method [30] and adopted in the state-of-art commercial offline system Named Adobe After Effects CC 2018. 8.3.1 Performance comparison with COTS solutions We compared our solution with the optical image stabilization method, which is widely adopted in commercial mobile devices, e.g., iPhone 8 and Samsung S9. Specifically, by Authorized licensed use limited to: Nanjing University. Downloaded on October 08,2020 at 13:50:43 UTC from IEEE Xplore. Restrictions apply

<<向上翻页向下翻页>>

点击下载：《分布式计算实验室》课程教学资源（阅读文献）Video Stabilization for Camera Shoot in Mobile Devices via Inertial-Visual State Tracking