This article has been accepted for publication in a future issue of this journal,but has not been fully edited Content may change prior to final publication.Citation information:DOI 10.1109/TMC.2019.2961313.IEEE Transactions on Mobile Computing IEEE TRANSACTIONS ON MOBILE COMPUTING,2019 PROBLEM FORMULATION AND MODELING camera is dynamically moving in the 3D space,the camera 5.1 Problem Formulation coordinate system as well as the image plane is also contin- uously moving,which involves rotation and translation.In According to the observations in the empirical study,in order to achieve video stabilization,we need to accurately this way,even if the point P keeps still in the 3D space,the corresponding projection P'is dynamically changing in the track the rotation and franslation during the video shot, so as to effectively remove the jitters from the rotation 2D plane,thus further leading to video shaking in the image plane. and translation.Meanwhile,we need to perform the rota- As any 3D motion can be decomposed into the combi- tion/translation estimation in a lightweight manner,so as to make the computation overhead suitable for real-time processing. nation of rotation and translation,we can use the rotation Therefore,based on the above understanding,it is essential matrix Rto.t and a vector Tio.t to represent the rotation and to statistically minimize the expectation of both rotation translation of the camera coordinate system,respectively, estimation error and translation estimation error during the from the time to to the time t.Then,for a target point Pi in the camera coordinate system,if its coordinate at time to is process of video shot.Meanwhile,we need to effectively limit the expected computation overhead within a certain denoted as Pi.to,then,after the rotation and translation of threshold,say T.Specifically,let the rotation estimation error the camera coordinate system,its coordinate Pi.t at time t can be computed by and translation estimation error be or and ot,respectively, and let the computation overhead for rotation estimation Pi.t=Rto.tPi,to Tto.t. (3) and translation estimation be cr and ct,respectively.We use the function exp()to denote the expectation.Then,the Therefore,according to Eq.(1),for the point Pi.t at time t, objective of our solution is to the corresponding projection in the image plane,i.e.,Pi.t= ucan be computed by min exp()+exp(6:), (2) Zi.t[ui.t,vi.t,1]=KPi.t=K(Rto.Pi.to +Tto.t),(4) subject to: exp(cr)+exp(ce)≤T. where Zi.t is the coordinate of Pi.t in the z-axis of the camera coordinate at time t,K is the camera intrinsic matrix. To achieve the above objective,we first analyze the pros and cons for the IMU-based and CV-based approaches,as 5.3 Camera Motion Model shown in Table 1.To track the translation,considering that 5.3.1 Coordinate Transformation only the CV-based approach is able to track the translation with high accuracy,we thus use the CV-based approach to As the mobile devices are usually equipped with Inertial Measurement Units(IMU),thus the motion of the camera estimate the translation.Moreover,to track the rotation,on can be measured by IMU,in the local coordinate system of one hand,both the IMU-based and CV-based approaches the body frame,as shown in Fig.4.As aforementioned in are able to track the rotation with high accuracy,on the other Section 5.2,the camera projection is measured in the camera hand,the compute complexity of the CV-based approach is coordinate system,once we figure out the camera's motion relatively high,especially when the 6 degrees of freedom from the inertial measurements in the local coordinate system, (DoF)are undetermined.Hence,we use the IMU-based it is essential to transform the camera's motion into the approach to estimate the rotation,due to its low compute camera coordinate system. complexity.In this way,the compute overhead of the CV- based approach is greatly reduced,since the undetermined Local coordinate DoF for CV-based processing is greatly reduced from 6 to 3. system Rotation Translation Compute Tracking Tracking Complexity M Camera coordinate IMU-based High Accuracy Low Accuracy Low system (3 DoF) (3 DoF) CV-based High Accuracy High Accuracy High (3 DoF) (3 DoF) TABLE 1 Pros and cons of IMU and CV-based approaches for video stabilization. Fig.4.The local coordinate system and the camera coordinate system Therefore,after formulating the video stabilization prob- of the rear camera. lem in an expectation-minimization framework,we can For the embedded camera of the mobile device,we take decompose and solve this complex optimization problem the mostly used rear camera as an example.As shown in by breaking it down into two subproblems,i.e.,using the Fig.4,we show the camera coordinate system and the local IMU-based approach to estimate the rotation and using the coordinate system,respectively.According to the relationship CV-based approach to estimate the translation. between camera coordinate system and the local coordinate sys- 0-1 0 5.2 Camera Projection Model fem,we can use a 3x3 rotation matrix M= -10 0 00 According to the pinhole camera model,for any arbitrary to denote the coordinate transformation between the two 3D point P from the stationary object in the scene,the coordinate systems.For any other camera,we can also use corresponding 2D projection P'in the image plane always a similar rotation matrix M'to denote the corresponding keeps unchanged.However,when the body frame of the coordinate transformation. 1536-1233 (c)2019 IEEE Personal use is permitted,but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.1536-1233 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMC.2019.2961313, IEEE Transactions on Mobile Computing IEEE TRANSACTIONS ON MOBILE COMPUTING, 2019 5 5 PROBLEM FORMULATION AND MODELING 5.1 Problem Formulation According to the observations in the empirical study, in order to achieve video stabilization, we need to accurately track the rotation and translation during the video shot, so as to effectively remove the jitters from the rotation and translation. Meanwhile, we need to perform the rotation/translation estimation in a lightweight manner, so as to make the computation overhead suitable for real-time processing. Therefore, based on the above understanding, it is essential to statistically minimize the expectation of both rotation estimation error and translation estimation error during the process of video shot. Meanwhile, we need to effectively limit the expected computation overhead within a certain threshold, say τ . Specifically, let the rotation estimation error and translation estimation error be δr and δt, respectively, and let the computation overhead for rotation estimation and translation estimation be cr and ct, respectively. We use the function exp() to denote the expectation. Then, the objective of our solution is to min exp(δr) + exp(δt), (2) subject to: exp(cr) + exp(ct) ≤ τ. To achieve the above objective, we first analyze the pros and cons for the IMU-based and CV-based approaches, as shown in Table 1. To track the translation, considering that only the CV-based approach is able to track the translation with high accuracy, we thus use the CV-based approach to estimate the translation. Moreover, to track the rotation, on one hand, both the IMU-based and CV-based approaches are able to track the rotation with high accuracy, on the other hand, the compute complexity of the CV-based approach is relatively high, especially when the 6 degrees of freedom (DoF) are undetermined. Hence, we use the IMU-based approach to estimate the rotation, due to its low compute complexity. In this way, the compute overhead of the CVbased approach is greatly reduced, since the undetermined DoF for CV-based processing is greatly reduced from 6 to 3. Rotation Translation Compute Tracking Tracking Complexity IMU-based High Accuracy Low Accuracy Low (3 DoF) (3 DoF) CV-based High Accuracy High Accuracy High (3 DoF) (3 DoF) TABLE 1 Pros and cons of IMU and CV-based approaches for video stabilization. Therefore, after formulating the video stabilization problem in an expectation-minimization framework, we can decompose and solve this complex optimization problem by breaking it down into two subproblems, i.e., using the IMU-based approach to estimate the rotation and using the CV-based approach to estimate the translation. 5.2 Camera Projection Model According to the pinhole camera model, for any arbitrary 3D point P from the stationary object in the scene, the corresponding 2D projection P 0 in the image plane always keeps unchanged. However, when the body frame of the camera is dynamically moving in the 3D space, the camera coordinate system as well as the image plane is also continuously moving, which involves rotation and translation. In this way, even if the point P keeps still in the 3D space, the corresponding projection P 0 is dynamically changing in the 2D plane, thus further leading to video shaking in the image plane. As any 3D motion can be decomposed into the combination of rotation and translation, we can use the rotation matrix Rt0,t and a vector Tt0,t to represent the rotation and translation of the camera coordinate system, respectively, from the time t0 to the time t. Then, for a target point Pi in the camera coordinate system, if its coordinate at time t0 is denoted as Pi,t0 , then, after the rotation and translation of the camera coordinate system, its coordinate Pi,t at time t can be computed by Pi,t = Rt0,tPi,t0 + Tt0,t. (3) Therefore, according to Eq. (1), for the point Pi,t at time t, the corresponding projection in the image plane, i.e., P 0 i,t = [ui,t, vi,t] T , can be computed by Zi,t · [ui,t, vi,t, 1]T = KPi,t = K(Rt0,tPi,t0 + Tt0,t), (4) where Zi,t is the coordinate of Pi,t in the z−axis of the camera coordinate at time t, K is the camera intrinsic matrix. 5.3 Camera Motion Model 5.3.1 Coordinate Transformation As the mobile devices are usually equipped with Inertial Measurement Units (IMU), thus the motion of the camera can be measured by IMU, in the local coordinate system of the body frame, as shown in Fig. 4. As aforementioned in Section 5.2, the camera projection is measured in the camera coordinate system, once we figure out the camera’s motion from the inertial measurements in the local coordinate system, it is essential to transform the camera’s motion into the camera coordinate system. Camera coordinate system Local coordinate system P P ′ M x y z OL OC x y z Fig. 4. The local coordinate system and the camera coordinate system of the rear camera. For the embedded camera of the mobile device, we take the mostly used rear camera as an example. As shown in Fig. 4, we show the camera coordinate system and the local coordinate system, respectively. According to the relationship between camera coordinate system and the local coordinate system, we can use a 3×3 rotation matrix M = 0 −1 0 −1 0 0 0 0 −1 to denote the coordinate transformation between the two coordinate systems. For any other camera, we can also use a similar rotation matrix M0 to denote the corresponding coordinate transformation