3D photography on your desk Jean-Yves Bouguett and Pietro Peronatt California Institute of Technology,136-93,Pasadena,CA 91125,USA Universita di Padova,Italy [bougueti,peronal @vision.caltech.edu Abstract A simple and inexpensive approach for extracting the three- dimensional shape of objects is presented.It is based on 'weak structured lighting';it differs from other conventional struc- tured lighting approaches in that it requires very little hard- ware besides the camera:a desk-lamp,a:pencil and a checker- board.The camera faces the object,which is illuminated by the desk-lamp.The user moves a pencil in front of the light source casting a moving shadow on the object.The 3D shape of the object is extracted from the spatial and temporal location of the observed shadow.Experimental results are presented on three different scenes demonstrating that the error in reconstructing Figure 1:The general setup of the proposed method: the surface is less than 1%. The camera is facing the scene illuminated by a halogen desk 1 Introduction and Motivation lamp (left).The scene consists of objects on a plane (the desk). When an operator freely moves a stick in front of the lamp (over One of the most valuable functions of our visual sys- the desk),a shadow is cast on the scene.The camera acquires tem is informing us about the shape of the objects that a sequence of images /(,y,t)as the operator moves the stick surround us.Manipulation,recognition,and naviga- so that the shadow scans the entire scene.This constitutes the input data to the 3D reconstruction system.The variables tion are amongst the tasks that we can better accom- x and y are the pixel coordinates (also referred to as spatial plish by seeing shape.Ever-faster computers,progress coordinates),and t the time (or frame number).The three in computer graphics,and the widespread expansion dimensional shape of the scene is reconstructed using the spatial of the Internet have recently generated much inter- and temporal properties of the shadow boundary throughout est in systems that may be used for imaging both the the input sequence.The right-hand figure shows the necessary geometry and surface texture of object.The applica- equipment besides the camera:a desk lamp,a calibration grid and a pencil for calibration,and a stick for the shadow.One tions are numerous.Perhaps the most important ones could use the pencil instead of the stick are animation and entertainment,industrial design, archiving,virtual visits to museums and commercial on-line catalogues A number of passive cues have long been known In designing a system for recovering shape,differ- to contain information on 3D shape:stereoscopic ent engineering tradeoffs are proposed by each appli- disparity,texture,motion parallax,(de)focus,shad- cation.The main parameters to be considered are: ows,shading and specularities,occluding contours and cost,accuracy,ease of use and speed of acquisition.So other surface discontinuities amongst them.At the far,the commercial 3D scanners (e.g.the Cyberware current state of vision research stereoscopic dispar- scanner)have emphasized accuracy over the other pa- ity is the single passive cue that gives reasonable ac- rameters.These systems use motorized transport of curacy.Unfortunately it has two major drawbacks: the object,and active (laser,LCD projector)lighting (a)it requires two cameras thus increasing complexity of the scene,which makes them very accurate,but and cost,(b)it cannot be used on untextured surfaces expensive and bulky [1,15,16,12,2]. (which are common for industrially manufactured ob- An interesting challenge for computer vision re- jects). searchers is to take the opposite point of view:em- We propose a method for capturing 3D surfaces phasize cost and simplicity,perhaps sacrificing some that is based on 'weak structured lighting'.It yields amount of accuracy,and design 3D scanners that de- good accuracy and requires minimal equipment be- mand little more hardware than a PC and a video sides a computer and a camera:a pencil (two uses),a camera,by now almost standard equipment both in checkerboard and a desk-lamp-all readily available in offices and at home,by making better use of the data most homes;some intervention by a human operator, that is available in the images. acting as a low precision motor,is also required. 43
3D photography on your desk Jean-Yves Bouguett and Pietro Peronat$ t California Institute of Technology, 136-93, Pasadena, CA 91125, USA $ Universiti di Padova, Italy { bouguetj ,perona} @vision.caltech.edu Abstract A simple and inexpensive approach for extracting the threedimensional shape of objects is presented. It is based on ‘weak structured lighting’, it differs from other conventional structured lighting approaches in that it requires very little hardware besides the camera: a desk-lamp, a pencil and a checkerboard. The camera faces the object, which is illuminated by the desk-lamp. The user moves a pencil in front of the light source casting a moving shadow on the object. The 3D shape of the object is extracted from the spatial and temporal location of the obsened shadow. Experimental results are presented on three different scenes demonstrating that the error in reconstructing the surface is less than 1%. 1 Introduction and Motivation One of the most valuable functions of our visual system is informing us about the shape of the objects that surround us. Manipulation, recognition, and navigation are amongst the tasks that we can better accomplish by seeing shape. Ever-faster computers, progress in computer graphics, and the widespread expansion of the Internet have recently generated much interest in systems that may be used for imaging both the geometry and surface texture of object. The applications are numerous. Perhaps the most important ones are animation and entertainment, industrial design, archiving, virtual visits to museums and commercial on-line catalogues. In designing a system for recovering shape, different engineering tradeoffs are proposed by each application. The main parameters to be considered are: cost, accuracy, ease of use and speed of acquisition. So far, the commercial 3D scanners (e.g. the Cyberware scanner) have emphasized accuracy over the other parameters. These systems use motorized transport of the object, and active (laser. LCD projector) lighting of the scene, which makes them very accurate, but expensive and bulky [l, 15, 16, 12, 21. An inteiesting challenge for computer vision researchers is to take the opposite point of view: emphasize cost and simplicity, perhaps sacrificing some amount of accuracy, and design 3D scanners that demand little more hardware than a PC and a video camera, by now almost standard equipment both in offices and at home, by making better use of the data that is available in the images. Figure 1: The general setup of the proposed method: The camera is facing the scene illuminated by a lialogen desk lamp (left). The scene consists of objects on a plane (the desk). When an operator freely moves a stick in front of thle lamp (over the desk), a shadow is cast on the scene. The camera acquires a sequence of images l(z,y,t) as the operator moves the stick so that the shadow scans the entire scene. This constitutes the input data to the 3D reconstruction system. The variables x and y are the pixel coordinates (also referred to as spatial coordinates), and t the time (or frame number). The three dimensional shape of the scene is reconstructed using the spatial and temporal properties of the shadow boundary thr xighout the input sequence. The right-hand figure shows the necessary equipment besides the camera: a desk lamp, a caliibration grid and a pencil for calibration, and a stick for the shadow. One could use the pencil instead of the stick. A number of passive cues have long b,een known to contain information on 3D shape: stereoscopic disparity, texture, motion parallax, (de)focus, shadows, shading and specularities, occluding contours and other surface discontinuities amongst them. At the current state of vision research stereoscopic disparity is the single passive cue that gives reasonable accuracy. Unfortunately it has two major drawbacks: (a) it requires two cameras thus increasing complexity and cost, (b) it cannot be used on untextured surfaces (which are common for industrially manufactured objects). We propose a method for capturing 311 surfaces that is based on ‘weak structured lighting’. It yields good accuracy and requires minimal equipment besides a cowputer and a camera: a pencil (two uses), a checkerboard a& a desk-lamp - all readily available in most homes; somb intervention by a human operator, acting as a low precision motor, is also required. 43
We start with a description of the method in Sec.2, in the image.Figure 2 gives a geometrical description followed in Sec.3 by a noise sensitivity analysis,and of the method that we propose to achieve that goal. in Sec.4 by a number of experiments that assess the 2.1 Calibration convenience and accuracy of the system.We end with a discussion and conclusions in Sec.5. The goal of calibration is to recover the geometry of the setup (that is.the location of the desk plane The user IIa and that of the light source S)as well as the in- holding a stick The edge of the shadow generated by the stick trinsic parameters of the camera(focal length,optical center and radial distortion factor).We decompose Desk plane Desk lamp the procedure into two successive stages:first camera A() calibration and then lamp calibration. Camera calibration:Estimate the intrinsic cam- era parameters and the location of the desk plane II (tabletop)with respect to the camera.The procedure consists of first placing a planar checkerboard pattern B() (see figure 1)on the desk in the location of the objects to scan.From the image captured by the camera,we infer the intrinsic and extrinsic (rigid motion between Reference top row camera and desk reference frame)parameters of the camera,by matching the projections onto the image () plane of the known grid corners with the expected pro- jection directly measured on the image (extracted cor- Reference bottom row ners of the grid).This method is very much inspired Camera Image plane by the algorithm proposed by Tsai [13].Note that since our calibration rig is planar,the optical center cannot be recovered through that process,and there- fore is assumed to be fixed at the center of the image. A description of the whole procedure can be found Y in [3].The reader can also refer to Faugeras 6 for Figure 2:Geometrical principle of the method:Approx- further insights on camera calibration.Notice that imate the light source with a point S,and denote by IId the the extrinsic parameters directly lead to the position desk plane.Assume that the positions of the light source S and the plane nId in the camera reference frame are known from cal of the tabletop Ila in the camera reference frame. ibration.The goal is to estimate the 3D location of the poin Lamp calibration:After camera calibration,esti- P in space corresponding to every pixel e in the image.Call mate the 3D location of the point light source S. t the time at which a given pixel esees'the shadow bound- Figure 3 gives a description of our method. ary (later referred to as the shadow time).Denote by II(t)the corresponding shadow plane at that time t.Assume that two 2.2 Spatial and temporal shadow edge lo- portions of the shadow projected on the desk plane are visi- calization ble on two given rows of the image (top and bottom rows in the figure).After extracting the shadow boundary along those A fundamental stage of the method is the detection rows ztop(t)and bot(t),we find two points on the shadow of the line of intersection of the shadow plane II(t) plane A(t)and B(t)by intersecting Ia with the optical rays with the desktop IIa;a simple approach may be used (Oe,op(t))and (Oe,ot(t))respectively.The shadow plane if we make sure that the top and bottom edges of the nI(t)is then inferred from the three points in space S,A(t)and B(t).Finally,the point P corresponding to o is retrieved by image are free from objects.Then the two tasks to ac- intersecting I(t)with the optical ray (Oc,Fe).This final stage complish are:(a)Localize the edge of the shadow that is called triangulation.Notice that the key steps in the whole is directly projected on the tabletop (top(t),Tbot(t)) scheme are:(a)estimate the shadow time ta(c)at every pixel at every time instant t (every frame),leading to the e (temporal processing),and (b)locate the reference points set of all shadow planes II(t),(b)Estimate the time Ftop(t)and bot(t)at every time instant t (spatial processing). ts(c)(shadow time)where the edge of the shadow These two are discussed in detail in section 2.2. passes through any given pixel e=(e,ye)in the im- age.Curless and Levoy demonstrated in 4 that such 2 Description of the method a spatio-temporal approach is appropriate to preserve The general principle consists of casting a shadow sharp discontinuities in the scene.Details of our im- onto the scene with a pencil or another stick,and us- plementation are given in figure 4.Notice that the ing the image of the deformed shadow to estimate the shadow was scanned from the left to the right side of three dimensional shape of the scene.Figure I shows the scene.This explains why the right edge of the the required hardware and the setup of the system. shadow corresponds to the front edge of the temporal The objective is to extract scene depth at every pixel profile in figure 4. 44
We start with a description of the method in Sec. 2, followed in Sec. 3 by a noise sensitivity analysis, and in Sec. 4 by a number of experiments that assess the convenience and accuracy of the system. We end with a discussion and conclusions in Sec. 5. The edge of the shadow generated by the stick The user holding a stickd OC r- xc Figure 2: Geometrical principle of the method: Approximate the light source with a point S, and denote by IId the desk plane. Assume that the positions of the light source S and the plane IId in the camera reference frame are known from calibration. The goal is to estimate the 3D location of the point P in space corresponding to every pixel Z,. in the image. Call t the time at which a given pixel Zc ‘sees’ t,he shadow boundary (later referred to as, the shadow time). Denote by Il(t) the corresponding shadow plane at that time t. Assume that two portions of the shadow projected on the desk plane are visible on two given rows of the image (tup 2nd bottom rows in the figure). After extracting the shadow boundary along those rows Ztop(t) and Zbot(t), we find two points on the shadow plane A(t) and B(t) by intersecting with the optical rays (O,,Ztop(t)) and (Oc,&,t(t)) respectively. The shadow plane n(t).is then inferred from the three points in space S, A(t) and B(t). Finally, the point P corresponding to Tc is retrieved by intersecting n(t) with the optical ray (Oc,Zc). This final stage is called triangulation. Notice that the key steps in the whole scheme are: (a) estimate the shadow time ts(Tc) at every pixel zc (temporal processing), and (b) locate the reference points Etop(t) and Zbot(t) at every time instant t (spatial processing). These two are discussed in detail in section 2.2. - 2 Description of the method The general principle consists of casting a shadow onto the scene with a pencil or another stick, and using the image of the deformed shadow to estimate the three dimensional shape of the scene. Figure 1 shows the required hardware and the setup of the system. The objective is to extract scene depth at every pixel in the image. Figure 2 gives a geometrical description of the method that we propose to achieve that goal. 2.1 Calibration The goal of calibration is to recover the geometry of the setup (that is, the location of the desk plane IId and that of the light source S) as well as the intrinsic parameters of the camera (focal length, optical center and radial distortion factor). We decompose the procedure into two successive stages: first camera calibration and then lamp calibration. Camera calibration: Estimate the intrinsic camera parameters and the location of the desk plane Ed (tabletop) with respect to the camera. The procedure consists of first placing a planar checkerboard pattern (see figure 1) on the desk in the location of the objects to scan. From the image captured by the camera, we infer the intrinsic and extrinsic (rigid motion between camera and desk reference frame) paramet,ers of the camera, by matching the projections onto the image plane of the known grid corners with the expected projection directly measured on the image (extracted corners of the grid). This method is very much inspired by the algorithm proposed by Tsai [13]. Note that since our calibration rig is planar, the optical center cannot be recovered through that process, and therefore is assumed to be fixed at the center of the image. A description of the whole procedure can be found in [3]. The reader can also refer to Faugeras [6] for further insights on camera calibration. Notice that the extrinsic parameters directly lead to the positlion of the tabletop IId in the camera reference frame. Lamp calibration: After camera calibration, estimate the 3D location of the point light source S. Figure 3 gives a description of our method. 2.2 Spatial and temporal shadow edge localizat ion A fundamental stage of the method is the detection of the line of intersection of the shadow plane n(t) with the desktop Ed; a simple approach may be used if we make sure that the top and bottom edges of the image are free from objects. Then the two tasks to accomplish are: (a) Localize the edge of the shadow that is directly projected on the tabletop (:top(t), Zbot(t)) at every time instant t (every frame), leading to the set of all shadow planes II(t), (b) Estimate the time t5(ZC) (shadow time) where the edge of the shadow passes through any given pixel ?Fe = (xc, yc) in the image. Curless and Levoy demonstrated in [4] that such a spatio-temporal approach is appropriate to preserve sharp discontinuities in the scene. Details of our implementation are given in figure 4. Notice that the shadow was scanned from the left to the right side of the scene. This explains why the right edge of the shadow corresponds to the front edge of the temporal profile in figure 4. 44
A pencil of known height h orthogonal to the desk 0131 Light source Desk plane △1(xy0 S must lie on the x0po=118.42 x.y=133.27 ine△=(T,Ts! 400- 0 250 Figure 4:Spatial and temporal shadow location:The first step consists of localizing spatially the shadow edge (to(to),bot(to))at every integer time to (i.e.every frame). The top and bottom rows are ytop =10 and =230 on the top figure.This leads to an estimate of the shadow Camera Image plane plane l1(to)at every frame.The second processing step con- Z sists of extracting at every pixel c,the time ts(Te)of passage of the shadow edge.For any given pixel c=(r,y),define Imin(x,)三mint(I(x,y,t)and /max(x,)兰max:(I(x,y,t) as its minimum and maximum brightness throughout the entire sequence.We then define the shadow edge to be the locations (in space-time)where'the image /(c,y,t)intersects with the threshold image Ishadow(x,y)三(/min(r,y)+Imx(r,y)》/2. This may be also regarded as the zero crossings of the dif- Figure 3:Lamp calibration:The operator places a pencil ference image Al(z,y,t)=I(x,y,t)-Ishadow (z,y).The two on the desk plane Ila,orthogonal to it (top-left).The camera bottom plots illustrate the shadow edge detection in the spa- observes the shadow of the pencil projected on the tabletop. tial domain (to find Etop and bot)and in the temporal do- The acquired image is shown on the top-right.From the two main (to find ta(e)).The bottom-left figure shows the pro- points b and ts on this image,one can infer the positions in file of Al(r,y,t)along the top reference row y ytop 10 space of K and Ta,respectively the base of the pencil,and the at time t to 134 versus the column pixel coordinate z. tip of the pencil shadow (see bottom figure).This is done by The second zero crossing of that profile corresponds to the intersecting the optical rays (O,b)and (Oe,ts)with Il (known top reference point top(to)=(118.42,10)(computed at sub- from camera calibration).In addition,given that the height of pixel accuracy).Identical processing is applied on the bottom the pencil h is known,the coordinates of its tip T can be directly row to obtain bot(to)=(130.6,230).Similarly,the bottom- inferred from K.Then,the light source point S has to lie on right figure shows the temporal profile Al(rc,ye,t)at the pixel the line A =(T,T)in space.This yields one linear constraint Fe =(ze,ye)=(104,128)versus time t (or frame number) on the light source position.By taking a second view,with the The shadow time at that pixel is defined as the first zero cross- pencil at a different location on the desk,one can retrieve a ing location of that profile:ts(104,128)=133.27 (computed at second independent constraint with another line A'.A closed sub-frame accuracy). form solution for the 3D coordinate of S is then derived by intersecting the two lines A and A'(in the least squares sense). Notice that since the problem is linear,one can easily integrate Notice that the pixels corresponding to regions in the information from more than 2 views and then make the the scene that are not illuminated by the lamp(shad- estimation more accurate.If N>2 images are used,one can ows due to occlusions)do not provide any relevant obtain a closed form solution for the best intersection points depth information.For this reason we can restrict the of the N inferred lines (in the least squares sense).We also processing to pixels that have sufficient swing between estimate the uncertainty on that estimate from the distance of maximum and minimum brightness.Therefore,we S from each one of the A lines.That indicates how consistently the lines intersect a single point in space.Refer to [3]for the only process pixels with contrast value Icontrast(,y)= complete derivations. Imax(,y)-Imin (y)larger than a pre-defined thresh- old Ithresh.This threshold was 70 in all experiments reported in this paper (recall that the intensity values 45
A pencil of known height h orthogonal to the desk S must lie on the / e?? ,' line A = (T,TJ!!! ~ 0t I , xrop(tO)=l I8 42 -50: 'ai Figure 3: Lamp calibration: The operator places a pencil on the desk plane IId, orthogonal to it (top-left). The camera observes the shadow of the pencil projected on the tabletop. The acquired image is shown on the top-right. From the two points b and ts on this image, one can infer the positions in space of K and T,, respectively the base of the pencil, and the tip of the pencil shadow (see bottom figure). This is done by intersecting the optical rays (Oc,b) and (oc,t,) with (known from camera calibration). In addition, given that the height of the pencil h is known, the coordinates of its tip T can be directly inferred from K. Then, the light source point S has to lie on the line A = (T,T,) in space. This yields one linear constraint on the light source position. By taking a second view, with the pencil at a different location on the desk, one can retrieve a second independent constraint with another line A'. A closed form solution for the 3D coordinate of S is then derived by Figure 4: Spatial and temporal shadow location: The first step consists of localizing spatially the shadow edge (ztop(tO),?&t(tO)) at every integer time to (i.e. every frame). The top and bottom rows are ytop = 10 and !/bot = 230 on the top figure. This leads to an estimate of the shadow plane II(t0) at every frame. The second processing step consists of extracting at every pixel E,, the time ts(Zc) of passage of the shadow edge. For any given pixel ?Ec = (c,y), define Imln(z, y) mint (I(z, y, t)) and Imax(x, y) = mau t (I(z, 9, t)) as its minimum and maximum brightness throughout the entire sequence. We then define the shadow edge to be the locations (in space-time) where the image I(z,y,t) intersects with the threshold image Ishadow(&y) = (Imln(z,y) + Imax(2,~)) /2. This may be also regarded as the zero crossings of the difference image AZ(z,y, 1) I(z,y, t) - Ishadow(%, y). The two bottom plots illustrate the shadow edge detection in the spatial domain (to find Ztop and Ebot) and in the temporal domain (to find ts(Tc)). The bottom-left figure shows the profile of AI(z, y, t) along the top reference row y = ytop = 10 at time t = to = 134 versus the column pixel coordinate z. The second zero crossing of that profile corresponds to the top reference point Ztop(tO) = (118.42,lO) (computed at subpixel accuracy). Identical processing is applied on the bottom row to obtain ?&(to) = (130.6,230). Similarly, the bottomright figure shows the temporal profile AI(zc,yc, t) at the pixel xc = (zc,yc) = (104,128) versus time t (or frame number). The shadow time at that pixel is defined as the first zero crossing location of that profile: t,(104,128) = 133.27 (computed at sub-frame accuracy). - Notice that the pixels corresponding to iregions in ows due to occlusions) do not provide any relevant depth information. For this reason we can restrict the processing to pixels that have sufficient swing between maximum and minimum brightness. ~h~~~f~~~, we only process pixels with contrast value Icontrast (2, U) Imax(z, y)-Imin(z, y) larger than a pre-defined threshold Ithres~,. This threshold was 70 in all experiments reported in this paper (recall that the intensity values intersecting the two lines A and A' (in the least squares sense). Notice that since the problem is linear, one can easily integrate the information from more than 2 views and then make the estimation more accurate. If N > 2 images are used, one cav obtain a closed form solution for the best intersection point S of the N inferred lines (in the least squares sense). We also e_stimate the uncertainty on that estimate from the distance of S from each one of the A lines. That indicates how consistently complete derivations. the scene that are not by the lamp (shadthe lines intersect a single point in space. Refer to [31 for the 45
are encoded from 0 for black to 255 for white). II(t).Since ts()is estimated at sub-frame accuracy We do not apply any spatial filtering on the im- the final planeΠ(ts(元c))actually results from linear ages;that would generate undesired blending in the interpolation between the two planes nI(to-1)and final depth estimates,especially noticeable at depth II(to)if to-1<ts(e)<to and to integer.Once the discontinuities (at occlusions for example).However, range data are recovered,a mesh may be generated by it would be acceptable to low-pass filter the brightness connecting neighboring points in triangles.Rendered profiles of the top and bottom rows (there is no depth views of three reconstructed surface structures can be discontinuity on the tabletop)and low-pass filter the seen in figures 6,7 and 8. temporal brightness profiles at every pixel.These op- erations would preserve sharp spatial discontinuities and might decrease the effect of local processing noise by accounting for smoothness in the motion of the 3 Noise Sensitivity stick. Experimentally,we found that this thresholding ap- The overall scheme is based on first extracting from every frame (i.e.every time instants t)the z coordi- proach for shadow edge detection allow for some inter- nal reflections in the scene [9,8,14].However,if the nates of the two reference points top(t)and Tbot() and second estimating the shadow time t)at ev- light source is not close to an ideal point source,the ery pixel Those input data are used to estimate mean value between maximum and minimum bright- ness may not always constitute the optimal value for the depth Ze at every pixel.The purpose of the noise sensitivity analysis is to quantify the effect of the noise the threshold image Ishadow.Indeed,the shadow edge in the measurement data top(t),bot(t),ts(e))on profile becomes shallower as the distance between the the final reconstructed scene depth map.One key step stick and the surface increases.In addition,it deforms in the analysis is to transfer the noise affecting the asymmetrically as the surface normal-changes.These shadow time te)into a scalar noise affecting the effects could make the task of detecting the shadow x coordinate of e after scaling by the local shadow boundary points challenging.In the future,we in- tend to develop a geometrical model of extended light speed on the image at that pixel.Let V be the vol- ume of the parallelepiped formed by the three vectors sources and incorporate it in the system. OA,OB and OS,originating at O (see figure 2): Although Imin and Imax are needed to compute Ishadow,there exists an implementation of that al- gorithm that does not require storage of the com- V=xg.{区B-Xs)×XA-Xs)} plete image sequence in memory and therefore leads itself to real-time implementations.All that one needs to do is update at each frame five different arrays where Xs =[Xs Ys Zs]T,XA =[XA YA ZA]T and Imax(x,y),Imin(x,,Icontrast(x,y),/shadow(r,y)and XB =XB YB ZBl are the coordinate vectors of the shadow time ts(,y),as the images /(,y,t)are S,A and B in the camera reference frame (x is the acquired. For a given pixel (r,y),the maximum standard outer product operator).Notice that v is brightness Imax(,y)is collected at the very begin- computed at the triangulation stage,and therefore is ning of the sequence (the first frame),and then,as always available (see [3]).Define Xe=[XeYe Ze]T as time goes,the incoming images are used to update the coordinate vector in the camera reference frame the minimum brightness Imin(,y)and the contrast Icontrast(,y).Once Icontrast(,y)crosses Ithresh,the of the point in space corresponding to Te.Assume adaptive threshold Ishadow(,y)starts being computed that the x coordinates of the top and bottom reference points (after normalization)are affected by additive and updated at every frame (and activated).This pro- white Gaussian noise with zero mean and variances cess goes on until the pixel brightness I(,y,t)crosses Ishadow(,y)for the first time (in the upwards direc- of and of respectively.Assume in addition that the variance on the a coordinate of is o (different at tion).That time instant is registered as the shadow every pixel).The following expression for the variance time ta(,y).In that form of implementation,the left of the induced noise on the depth estimate 2 was edge of the shadow is tracked instead of the right one, derived by taking first order derivatives of Ze with however the principle remains the same. respect to the new'noisy input variables ceop,bot 2.3 Triangulation and(notice that the time variable does not appear Once the shadow time te(c)is estimated at a given any longer in the analysis): pixel one can identify the corresponding shadow plane II(ts(e)).Then,the 3D point P associated to is retrieved by intersecting II(ta())with the opti- O'te V2 W2hZo2.+(a1+3Y。+1Z)2a2+ cal ray (Oc,e)(see figure 2).Notice that the shadow time te(e)acts as an index to the shadow plane list (a2+.+2Z。)2} (1 46
are encoded from 0 for black to 255 for white). We do not apply any spatial filtering on the images; that would generate undesired blending in the final depth estimates, especially noticeable at depth discontinuities (at occlusions for example). However, it would be acceptable to low-pass filter the brightness prafiles of the top and bottom rows (there is no depth discontinuity on the tabletop) and low-pass filter the temporal brightness profiles at every pixel. These operations would preserve sharp spatial discontinuities, and might decrease the effect of local processing noise by accounting for smoothness in the motion of the stick. Experimentally, we found that this thresholding approach for shadow edge detection allow for some internal reflections in the scene [9, 8, 141. However, if the light source is not close to an ideal point source, the mean value between maximum and minimum brightness may not always constitute the optimal value for the threshold image Ishadow. Indeed, the shadow edge profile becomes shallower as the distance between the stick and the surface increases. In addition, it deforms asymmetrically as the surface normal-changes. These effects could make the task of detecting the shadow boundary points challenging. In the future, we intend to develop a geometrical model of extended light sources and incorporate it in the system. Although I,,, and Imax are needed to compute Ishadow, there exists an implementation of that algorithm that does not require storage of the complete image sequence in memory and therefore leads itself to real-time implementations. All that one needs to do is update at each frame five different arrays Imax(IC,Y), Imin(z,y), Icontrast(z,Y), Ishadow(Z,!/) and the shadow time ts(z,y), as the images I(z,y,t) are acquired. For a given pixel (z,y), the maximum brightness Imax(2,y) is collected at the very beginning of the sequence (the first frame), and then, as time goes, the incoming images are used to update the minimum brightness Imin(z, y) and the contrast Icontrast(2, Y). Once lcontrast(x, y) crosses Ithresh, the adaptive threshold Ishadow (z, y) starts being computed and updated at every frame (and activated). This process goes on until the pixel brightness I(z, y, t) crosses Ishadow(~,y) for the first time (in the upwards direction). That time instant is registered as the shadow time ts(z,y). In that form of implementation, the left edge of the shadow is tracked instead of the right one, however the principle remains the same. 2.3 Triangulation Once the shadow time ts(T,) is estimated at a given pixel Z,, one can identify the corresponding shadow plane II(ts(3,)). Then, the 3D point P associated to ?Ec is retrieved by intersecting II(t8(Zc)) with the optical ray (O,,Z,) (see figure 2). Notice that the shadow time ts(Ec) acts a,s an index to the shadow plane list II(t). Since ti@,) is estimated at sub-frame accuracy, the final plane II(ts(Tc)) actually results from linear interpolation between the two planes II(t0 - 1) and n(t0) if to - 1 < ts(Z,) < to and to integer. Once the range data are recovered, a mesh may be generated by connecting neighboring points in triangles. Rendered views of three reconstructed surface structures can be seen in figures 6, 7 and 8. 3 Noise Sensitivity The overall scheme is based on first extracting from every frame (i.e. every time instants t) the z coordinates of the two reference points ztop(t) and xbot(t), and second estimating the shadow time ts(Ec) at every pixel I,. Those input data are used to estimate the depth 2, at every pixel. The purpose of the noise sensitivity analysis is to quantify the effect of the noise in the measurement data {xto,(t),xbot(t),ts(Zc))} on the final reconstructed scene depth map. One key step in the analysis is to transfer the noise affecting the shadow time ts(Ec) into a scalar noise affecting the IC coordinate of Tc after scaling by the local shadow speed on the image at that pixel. Let V be the vol- ____ ume of the parallelepiped formed by the three vectors OCA, O,B and m, originating at 0, (see figure 2): - where xs = [Xs Us ZsIT, -jf~ = [XA 12 Z,4IT and Xs = [X, YB ZgIT are the coordinate vectors of S, A and B in the camera reference frame (x is the standard outer product operator). Notice that T/’ is computed at the triangulation stge, and therefore is always available (see [3]). Define X, = [X, U, 2,IT as the coordinate vector in the camera reference frame of the point in space corresponding to Fc. Assume that the z coordinates of the top and bottom reference points (after normalization) are affected by additive white Gaussian noise with zero mean and variances o? and 02 respectively. Assume in addition that the variance on the IC coordinate of 5, is o:o (different at every pixel). The following expression for the variance uiC of the induced noise on the depth estimate 2, was derived by taking first order derivatives of 2, with respect to the ‘new’ noisy input variables ztop, 2bot and Z, (notice that the time variable does not appear any longer in the analysis): 46
Light source where W,hs,a1,B1,1,a2,B2 and 72 are constants depending only on the geometry(see figure 5): a1=ZA(ZBYs-YBZs) B1=-ZA(ZB-Zs) 71=ZA(YB-Ys) Q2 ZB(YA Zs-ZA Ys) 82=ZB(ZA-Zs) Y2 =-ZB (YA-Ys) The first term in equation 1 comes from the tempo- ral noise (on t(e)transferred to c);the second and third terms from the spatial noise (on Ttop and Tbot). Let o be the standard deviation of the image bright- ness noise.Given that we use linear interpolation of the temporal brightness profile to calculate the shadow time ta(e),we can write o as a function of the horizontal spatial image gradient I(c)at Te at time t=ts(Te): Oxe=Iz(c) (2) Desk I Since or in inversely proportional to the image gra- dient,the accuracy improves with shadow edge sharp- ness.This justifies the improvement in experiment Figure 5:Geometric setup:The camera is positioned at a 3 after removing the lamp reflector (thereby signif- distance da away from the desk plane IIa and tilted down to- icantly increasing sharpness).In addition,observe wards it at an angle 0.The light source is located at a height hs,with its direction defined by the azimuth and elevation an- that o does not depend.on the local shadow speed. gles and Notice the sign ofcos directly relates to the which Therefore,decreasing the scanning speed would not side of the camera the lamp is standing:positive on the right increase accuracy.However,for the analysis leading and negative on the left.The bottom figure is a side view of to equation 2 to remain valid,the temporal pixel pro- the system (in the (Oe,Ye,Zc)plane).The points A and B are file must be sufficiently sampled within the transition the reference points on the desk plane (see figure 2). area of the shadow edge (the penumbra).Therefore, if the shadow edge were sharper,the scanning should also be slower so that the temporal profile at every In order to obtain a uniformly accurate reconstruc- pixel would be properly sampled.Decreasing further tion of the entire scene,one may take two scans of the the scanning speed would benefit the accuracy only if same scene with the lamp at two different locations the temporal profile were appropriately low-pass fil- (on the left (L)and on the right (R)of the camera). tered before extraction of ts().This is an issue for and merge them together using at each pixel the esti- future research. mated reliability of the two measurements. Assume Notice that oz.,aside from quantifying the uncer- that the camera position,as well as the height hs tainties on the depth estimate Ze at every pixel e,it of the lamp,are kept identical for both scans.Sup- also constitutes a good indicator of the overall accu- pose in addition that the scanning speeds were ap- racies in reconstruction,since most of the errors are proximately the same.Then,at every pixel e in the located along the Z direction of the camera frame image,the two scan data sets provide two estimates In addition,we found numerically that most of the Z and ZR of the same depth Zc with respective relia- variations in the variance o are due to the varia- bilitieso and a given by equation 1.In addition, tion of volume V within a single scan.This explains if we call andthe two respective volumes,then why the reconstruction noise is systematically larger the relative uncertainty between Z and Zf reduces in portions of the scene further away from the lamp to a function of the volumes:/=(VL/VR)2. (see figures 6,7 and 8).Indeed,it can be shown that, Notice that calculating that relative uncertainty does as the shadow moves into the opposite direction of the not require any extra computation,since VL and VR lamp (e.g.to the right if the lamp is on the left of the are available from the two triangulations.The final camera),the absolute value of the volume V strictly depth is computed by weighted average of 2 and decreases,making o larger (see [3]for details). Ze:Ze=wL Z wR Z.If ZR and Zl were Gaus- 47
where ET, hs, a1, PI, 71, a2, ,& and 72 are constants depending only on the geometry (see figure 5): a1 = ZA (ZB YS - YB 2.5) PI = -zA (ZB - ZS) 71 = ZA (1% - kj.) = zg (Y, 2s - ZA YS) P2 = ZB (ZA - ZS) = -2g (YA - 12) The first term in equation 1 comes from the temporal noise (on ts(F,) transferred to Zc); the second and third terms from the spatial noise (on Ttop and Zbot). Let 01 be the standard deviation of the image brightness noise. Given that we use linear interpolation of the temporal brightness profile to calculate the shadow time ts(%,), we can write ex= as a function of the horizontal spatial image gradient I, (?Ec) at 3, at time t = ts(z,): Since pZc in inversely proportional to the image gradient, the accuracy improves with shadow edge sharpness. This justifies the improvement in experiment 3 after removing the lamp reflector (thereby significantly increasing sharpness). In addition, observe that crc does not depend on the local shadow speed. Therefore, decreasing the scanning speed would not increase accuracy. However, for the analysis leading to equation 2 to remain valid, the temporal pixel profile must be sufficiently sampled within the transition area of the shadow edge (the penumbra). Therefore, if the shadow edge were sharper, the scanning should also be slower so that the temporal profile at every pixel would be properly sampled. Decreasing further the scanning speed would benefit the accuracy only if the temporal profile were appropriately low-pass filtered before extraction of ts(Zc). This is an issue for future research. Notice that gzC, aside from quantifying the uncertainties on the depth estimate 2, at every pixel zC, it also constitutes a good indicator of the overall accuracies in reconstruction, since most of the errors are located along the Z direction of the camera frame. In addition, we found numerically that most of the variations in the variance u& are due to the variation of volume V within a single scan. This explains why the reconstruction noise is systematically larger in portions of the scene further away from the lamp (see figures 6, 7 and 8). Indeed, it can be shown that, as the shadow moves into the opposite direction of the lamp (e.g. to the right if the lamp is on the left of the camera), the absolute value of the volume /VI strictly decreases, making ugc larger (see [3] for details). Light source '\ Camera --wFigure 5: Geometric setup: The camera is positioned at a distance dd away from the desk plane & and tilted down towards it at an angle 6. The light source is located at a height hs, with its direction defined by the azimuth and elevation angles < and $. Notice the sign ofcos< directly relates to the which side of the camera the lamp is standing: positive on the right, and negative on the left. The bottom figure is a side view of the system (in the (Oc, Yc, Zc) plane). The points A and B are the reference points on the desk plane (see figure 2). In order to obtain a uniformly accurate reconstruction of the entire scene, one may take two 'scans of the same scene with the lamp at two different locations (on the left (L) and on the right (R) of the camera), and merge them together using at each pixel the estimated reliability of the two measurements. Assume that the camera position, as well as the height hs of the lamp, are kept identical for both scans. Suppose in addition that the scanning speeds were approximately the same. Then, at every pixel Zc in the image, the two scan data sets provide two! estimates 2," and 2: of the same depth 2, with respective reliabilities oiL and oiR given by equation 1. In addition, if we call VL and VR the two respective volumes, then the relative uncertainty between 2," and 2:: reduces to a function of the volumes: cJp/oiL = (VL/VR)~. Notice that calculating that relative uncertainty does not require any extra computation, since V, and VR are available from the two triangulations. The final depth is computed by weighted average of 2," and 2:: 2, = WL 2," + WR 2:. If 2: and 2," were Gaus- 47
sian distributed,and independent,they would be op Relative timally averaged using the inverse of the variances as Parameters Estimates errors weights [10]:wL =o2n/(+)=a2/(1+a2) fe(pixels】 857.3+1.3 0.2% and wR /()1/(1 +a2),where -0.199±0.002 1% a=VL/VR.Experimentally,we found that this dacm】 16.69±0.02 0.1% choice does not yield very good merged surfaces.It -0.0427±0.0003 makes the noisy areas of one view.interact too sig- 0.7515±0.0003 0.06% 0.6594±0.0004 nificantly with the clean corresponding areas in the other view,degrading the overall final reconstruc- e(degree8】 41.27±0.02 0.006% tion.This happens possibly because the random vari- ables Z and Z are not Gaussian.A heuristic solu- Lamp calibration.Similarly,we collected 10 images of the pencil shadow (like figure 3-top-right)and per- tion to that problem is to use sigmoid functions to formed calibration of the light source on them.See calculate the weights:wL =(1+exp{-BAV))1 section 2.1.Notice that the points b and t were man- and wR (1+exp(BAV))with AV =(V2- ually extracted from the images.Define S.as the co- V)/(V2 +Va)=(a2-1)/(a2 +1).The positive ordinate vector of the light source in the camera frame coefficient B controls the amount of diffusion between The following table summarizes the calibration results the left and the right regions,and should be deter- (refer to figure 5 for notation): mined experimentally.In the limit,as 3 tends to in- finity,merging reduces to a hard decision:Ze=Z if Parameters Estimates Relative VL>VR,and Ze=Z otherwise.Our merging tech- errors nique presents two advantages:(a)obtaining more -13.7±0.1 Se (cm) -17.2±0.3 coverage of the scene and (b)reducing the estimation ≈2% -2.9±0.1 noise.Moreover,since we do not move the camera be- 34.04±0.15 0.5% tween scans,we do not have to solve for the difficult hs (cm) E(degrees】 146.0±0.8 0.2% problem of view alignment (11,7,5].One merging Φ(degrees】 64.6±0.2 0.06% example is presented in experiment 3. Independently from local variations in accuracy The estimated lamp height agrees with the manual within one scan,one would also wish to maximize measure(with a ruler)of34±0.5cm. the global (or average)accuracy of reconstruction Our method yields an accuracy of approximately throughout the entire scene.In this paper,scanning is 3 mm (in standard deviation)in localizing the light vertical (shadow parallel to the y axis of the image). source.This accuracy is sufficient for final shape re- Therefore,the average relative depth error loz/Zel covery without significant deformation,as we discuss is inversely proportional to cos(see 3).The two in the next section. best values for the azimuth angle are then 0 and 4.2 Scene reconstructions =r corresponding to the lamp standing either to On the first scene (figure 6),we evaluated the accu- the right (=0)or to the left (=of the camera racy of reconstruction based on the sizes and shapes (see figure 5-top) of the plane at the bottom left corner and the corner object on the top of the scene (see figure 4-top). 4.Experimental Results Planarity of the plane:We fit a plane across the 4.1 Calibration accuracies points lying on the planar patch and estimated the standard deviation of the set of residual distances Camera calibration.For a given setup,we ac- of the points to the plane to 0.23 mm.This cor- quired 10 images of the checkerboard (see figure 1) responds to the granularity (or roughness)noise on and performed independent calibrations on them.The the planar surface.The fit was done over a sur- checkerboard consisted of approximately 90 visible face patch of approximate size 4 cm x 6 cm.This corners on a 8x9 grid.Then,we computed both mean leads to a relative non planarity of approximately values and standard deviations of all the parameters 0.23mm/5cm =0.4%.To check for possible global independently:the focal length fe,radial distortion deformations due to errors in calibration,we also fit factor ke and desk plane position IIa.Regarding the a quadratic patch across those points.We noticed desk plane position,it is convenient to look at the a decrease of approximately 6%in residual standard height da and the surface normal vector nd of II ex- deviation after quadratic warping.This leads us to pressed in the camera reference frame.An additional believe that global geometric deformations are negli- geometrical quantity related to nd is the tilt angle 6 gible compared to local surface noise.In other words (see figure 5).The following table summarizes the cal- one may assume that the errors of calibration do not ibration results (notice that the relative error on the induce significant global deformations on the final re- angle 6 is computed referring to 360 degrees): construction. 48
sian distributed, and independent, they would be optimally averaged using the inverse of the variances as weights [lo]: WL = c&/(& -t riR) = a2/(1 + cx2) and WR = aiL/(uiL + giR) = 1/(1 + a2), where Q: = V'/VR, Experimentally, we found that this choice does not yield very good merged surfaces. It makes the noisy areas of one view interact too significantly with the clean corresponding areas in the other view, degrading the overall final reconstruction. This happens possibly because the random variables 2: and 2: are not Gaussian. A heuristic solution to that problem is to use sigmoid functions to calculate the weights: WL = (1 + exp {-/?AV})-' , and WR = (1 + exp{PAV})-' with AV = (V," - lg)/(V," + Vi) = (a2 - 1)/(a2 + 1). The positive coefficient /? controls the amount of diffusion between the left and the right regions, and should be determined experimentally. In the limit, as ,8 tends to infinity, merging reduces to a hard decision: 2, = 2," if V' > V', and 2, I= 2: otherwise. Our merging technique presents two advantages: (a) obtaining more coverage of the scene and (b) reducing the estimation noise. Moreover, since we do not move the camera between scans, we do not have to solve for the difficult problem of view alignment [ll, 7, 51. One merging example is presented in experiment 3. Independently from local variations in accuracy within one scan, one would also wish to maximize the global (or average) accuracy of reconstruction throughout the entire scene. In this paper, scanning is vertical (shadow parallel to the y axis of the image). Therefore, the average relative depth error lczc /Z,I is inversely proportiQna1 to 1 cosEl (see [3]). The two best values for the azimuth angle are then E = 0 and < = T corresponding to the lamp standing either to the right (< = 0) or to the left (5 = T) of the camera (see figure 5-top). Parameters 4 ,Experimental Results 4.1 Calibration accuracies Camera calibration. For a given setup, we acquired 10 images of the checkerboard (see figure l), and performed independent calibrations on them. The checkerboard consisted of approximately 90 visible corners on a 8 x 9 grid. Then, we computed both mean values and standard deviations of all the parametecs independently: the focal length fc, radial distortion factor IC, and desk plane position IId. Regarding the desk plane position, it is convenient to look at the height dd and the surface normal vector Ad of IId expressed in the camera reference frame. An additional geometrical quantity related to Ad is the tilt angle 0 (see figure 5). The following table summarizes the calibration results (notice that the relative error on the angle 0 is computed referring to 360 degrees): Relative errors Estimates kc , r f,. fuixels'l II 857.3 & 1.3 I 0.2% I -0.199 i 0.002 1% ' dd (cm) 0.06% 0.006% -0.0427 i 0.0003 0.7515 =k 0.0003 0.6594 !c 0.0004 41.27 & 0.02 - nd 6' (degrees) Lamp calibration. Similarly, we collected 10 images of the pencil shadow (like figure 3-top-right) and performed calibration of the light source on them. See section 2.1. Notice that the points b and 5 were manually extracted from the images. Define s, as the coordinate vector of the light source in the camera frame. The following table summarizes the calibration results (refer to figure 5 for notation): 16&9 f 0.02 0.1% Parameters -13.7 i 0.1 1 3, (cm) 11 ( -17.2i0.3 ) 1 N 2% -2.9 * 0.1 I Relative errors Estimates hs (cm) E (degrees) d (degrees) The estimated lamp height agrees with the manual measure (with a ruler) of 34 & 0.5 cm. Our method yields an accuracy of approximately 3 mm (in standard deviation) in localizing the light source. This accuracy is sufficient for final shape recovery without significant deformation, as we discuss in the next section. 4.2 Scene reconstructions On the first scene (figure 6), we evaluated the accuracy of reconstruction based on the sizes and shapes of the plane at the bottom left corner and the corner object on the top of the scene (see figure 4-top). Planarity of the plane: We fit a plane across the points lying on the planar patch and estimated the standard deviation of the set of residual distances of the points to the plane to 0.23 mm. This corresponds to the granularity (or roughness) noise on the planar surface. The fit was done over a surface patch of approximate size 4 cm x 6 cm. This leads to a relative non planarity of approximately 0.23mm/5cm = 0.4%. To check for possible global deformations due to errors in calibration, we also fit a quadratic patch across those points. We noticed a decrease of approximately 6% in residual standard deviation after quadratic warping. This leads us to believe that global geometric deformations are negligible compared to local surface noise. In other words, one may assume that the errors of calibration do not induce significant global deformations on the final reconstruction. 34.04 ic 0.15 0.5% 146.0 i 0.8 0.2% 64.6 =k 0.2 0.06% 40
Figure 6:Experiment 1-The plane/ball/corner scene: Figure 7:Experiment 2-The cup/plane/ball scene:The Two views of the mesh generated from the cloud of points ob scanned objects were a cup,the plane and the ball.The ini- tained after triangulation.The original sequence was 270 frames tial image of the scene is shown on the left,and the final re- long,the images being 320 x 240 pixels each.At 60 Hz acquisi- constructed mesh on the right.We found agreement between tion frequency,the entire scanning take 5 seconds.The camera the estimated height of the cup from the 3D reconstruction, was positioned at distance d=16.7 cm from the desk plane, 11.040.09 cm,and the measured height (obtained using a tilted down by 0 =41.3 degrees.The light source was at height ruler),10.95+0.05 cm.Once again the right portion on the hs 37.7 cm,on the left of the camera at angles =157.1 reconstructed scene is noisier than the left portion.This was degrees and=64.8 degrees.From the right-hand figure we expected since the light source was,once again,standing to the notice that the right-hand side of the reconstructed scene is left of the camera.Geometrical parameters:d=22.6 cm, more noisy than the left-hand side.This was expected since the 0=38.2 degrees,hs 43.2 cm,155.9 degrees,and =69 lamp was standing on the left of the camera (refer to section 3 degrees. for details). Figures 7 and 8 report the reconstruction results Geometry of the corner:We fit 2 planes to the achieved on two other scenes. corner structure,one corresponding to the top surface (the horizontal plane)and the other one to the frontal 5 Conclusion and future work surface (vertical plane).We estimated the surface We have presented a simple,low cost system for noise of the top surface to 0.125 mm,and that of the extracting surface shape of objects.The method re frontal face to 0.8 mm (almost 7 times larger).This quires very little processing and image storage so that noise difference between the two planes can be ob- it can be implemented in real time.The accuracies we served on figure 6.Once again,after fitting quadratic obtained on the final reconstructions are reasonable patches to the two planar portions,we did not no- (at most 1%or 0.5:mm noise error)considering the tice any significant global geometric distortion in the little hardware requirement.The user can adjust the scene (from planar to quadratic warping,the residual speed of scanning to obtain the desired accuracy.In noise decreased by only 5%in standard deviation) addition,the final outcome is a dense coverage of the From the reconstruction,we estimated the height H surface (one point in space for each pixel in the image) and width D of the right angle structure,as well as allowing for direct texture mapping the angle t between the two reconstructed planes,and An error analysis was presented together with the compared them to their true values: description of a simple technique for merging multi- ple 3D scans together in order to (a)obtain a better coverage of the scene,and (b)reduce the estimation Parameters Estimates True Relative values errors noise.The overall calibration procedure,even in the H(cm】 2.57±0.02T2.65±0.02 3% case of multiple scans,is very intuitive,simple,and D(cm】 3.06±0.02 3.02±0.02 1.3% sufficiently accurate. (degrees) 86.21 90 1% Another advantage of our approach is that it easily scales to larger scenarios indoors-using more power- The overall reconstructed structure does not have ful lamps like photo-floods -and outdoors where the any major noticeable global deformation(it seems that sun may be used as a calibrated light source (given the calibration process gives good enough estimates). latitude,longitude,and time of day).These are ex The most noticeable source of errors is the surface periments that we wish to carry out in the future. noise due to local image processing.A figure of merit Other extensions of this work relate to multiple to keep in mind is a surface noise between 0.1 mm (for view integration.We wish to extend the alignment planes roughly parallel to the desk)and 0.8 mm (for technique to a method allowing the user to move freely frontal plane in the right corner).In most portions the object in front of the camera and the lamp between of the scene,the errors are of the order of 0.3 mm scans in order to achieve a full coverage.That is nec- i.e.less than 1%.Notice that these figures may very essary to construct complete 3D models. well vary from experiment to experiment,especially It is also part of future work to incorporate a geo- depending on how fast the scanning is performed.In metrical model of extended light source to the shadow all the presented experiments,we kept the speed of edge detection process,in addition to developing an the shadow approximately uniform uncalibrated (or projective)version of the method 49
Figure 6: Experiment 1 - The plane/ball/corner scene: Two views of the mesh generated from the cloud of points obtained after triangulation. The original sequence was 270 frames long, the images being 320 x 240 pixels each. At 60 Hz acquisition frequency, the entire scanning take 5 seconds. The camera was positioned at distance dd = 16.7 cm from the desk plane, tilted down by 0 = 41.3 degrees. The light source was at height hs = 37.7 cm, on the left of the camera at angles < = 157.1 degrees and = 64.8 degrees. From the right-hand figure we notice that the right-hand side of the reconstructed scene is more noisy than the left-hand side. This was expected since the lamp was standing on the left of the camera (refer to section 3 for details). Parameters H (cm) li, (degrees) D (cm) Geometry of the corner: We fit 2 planes to the corner structure, one corresponding to the top surface (the horizontal plane) and the other one to the frontal surface (vertical plane). We estimated the surface noise of the top surface to 0.125 mm, and that of the frontal face to 0.8 mm (almost 7 times larger). This noise difference between the two planes can be observed on figure 6. Once again, after fitting quadratic patches to the two planar portions, we did not notice any significant global geometric distortion in the scene (from planar to quadratic warping, the residual noise decreased by only 5% in standard deviation). From the reconstruction, we estimated the height H and width D of the right angle structure, as well as the angle between the two reconstructed planes, and compared them to their true values: True Relative values errors Estimates 2.57 & 0.02 2.65 & 0.02 3% 3.06 k 0.02 3.02 zk 0.02 1.3% 86.21 90 1% The overall reconstructed structure does not have any major noticeable global deformation (it seems that the calibration process gives good enough estimates). The most noticeable source of errors is the surface noise due to local image processing. A figure of merit to keep in mind is a surface noise between 0.1 mm (for planes roughly parallel to the desk) and 0.8 mm (for frontal plane in the right corner). In most portions of the scene, the errors are of the order of 0.3 mm, i.e. less than 1%. Notice that these figures may very well vary from experiment to experiment, especially depending on how fast the scanning is performed. In all the presented experiments, we kept the speed of the shadow approximately uniform. Figure 7: Experiment 2 - The cup/plane/ball scene: The scanned objects were a cup, the plane and the ball. The initial image of the scene is shown on the left, and the final reconstructed mesh on the right. We found agreement between the estimated height of the cup from the 3D reconstruction, 11.04 f 0.09 cm, and the measured height (obtained using a ruler), 10.95 j, 0.05 cm. Once again the right portion on the reconstructed scene is noisier than the left portion. This was expected since the light source was, once again, standing to the left of the camera. Geometrical parameters: dd = 22.6 cm, 0 = 38.2 degrees, hs = 43.2 cm, < = 155.9 degrees, and 4 = 69 degrees. Figures 7 and 8 report the reconstruction results achieved on two other scenes. 5 Conclusion and future work We have presented a simple, low cost system for extracting surface shape of objects. The method requires very little processing and image storage so that it can be implemented in real time. The accuracies we obtained on the final reconstructions are reasonable (at most 1% or 0.5 mm noise error) considering the little hardware requirement. The user can adjust the speed of scanning to obtain the desired aclcuracy. In addition, the final outcome is a dense coverage of the surface (one point in space for each pixel in the image) allowing for direct texture mapping. An error analysis was presented together with the description of a simple techinique for merging multiple 3D scans together in order to (a) obtain a better coverage of the scene, and (b) reduce the estimation noise. The overall calibration procedure, even in the case of multiple scans, is very intuitive, simple, and sufficiently accurate. Another advantage of our approach is that it easily scales to larger scenarios indoors - using more powerful lamps like photo-floods - and outdoors where the sun may be used as a calibrated light source (given latitude, longitude, and time of day). These are experiments that we wish to carry out in the Future. Other extensions of this work relate to multiple view integration. We wish to extend the alignment technique to a method allowing the user to rnove freely the object in front of the camera and the lamp between scans in order to achieve a full coverage. That is necessary to construct complete 3D models. It is also part of future work to incorpor&e a geometrical model of extended light source to the shadow edge detection process, in addition to developing an uncalibrated (or projective) version of the method. 49
Acknowledgments This work is supported in part by the California Institute of Technology;an NSF National Young Investigator Award to P.P.;the Center for Neuromorphic Systems Engineering funded by the National Science Foundation at the California Institute of Technology;and by the California Trade and Commerce Agency,Office of Strategic Technology.We wish to thank all the colleagues that helped us throughout this work,especially Luis Goncalvec,George Barbastathis,Mario Munich,and Ar- rigo Benedetti for very useful discussions.Comments from the anonymous reviewers were very helpful in improving a previous version of the paper. References [1]Paul Besl,Aduances in Machine Vision,chapter 1-Active optical range imaging sensors,pages 1-63,Springer-Verlag, 1989. [②☒P.J,Besl and N.D.McKay,“A method for registration of 3-d shapes",IEEE Transactions on Pattern Analysis and Machine Intelligence,14(2):239-256,1992. [3]Jean-Yves Bouguet and Pietro Perona, 3D pho- tography on your desk", Technical report, Cal ifornia Institute of Technology,1997,available at: http://www.vision.caltech.edu/bouguetj/ICCV98. [4]Brian Curless and Marc Levoy,"Better optical triangu- lation through spacetime analysis",Proc.5th Int.Conf. Computer Vision,pages 987-993,1995. [5]Brian Curless and Marc Levoy,"A volumetric method for building complex models from range images",S/G- GRAPH96,Computer Graphics Proceedings,1996. [6]O.D.Faugeras,Three dimensional vision,a geometric viewpoint,MIT Press,1993. [7]Berthold K.P.Horn,"Closed-form solution of absolute ori- entation using unit quaternions",J.Opt.Soc.Am.A, 4(4):629-642,1987, [8]Jurgen R.Meyer-Arendt,"Radiometry and photometry: Units and conversion factors",Applied Optics,7(10):2081 Figure 8:Experiment 3-The angel scene:We took two 2084,October 1968. scans of the angel with the lamnp first on the left side (top-left) and then on the right side(top-right)of the camera.The two [9]Shree K.Nayar,Katsushi Ikeuchi,and Takeo Kanade, resulting meshes are shown on the second row,left and right. "Shape from interreflectiong",Int.J.of Computer Vision, As expected,the portions further away from the light source are 6(3):173-195,1991. noisier.The two meshes were then merged together following 10 Athanasios Papoulis,Probability,Random Variables and the technique described in section 3,with diffusion coefficient Stochastic Processes,Mac Graw Hill,1991,Third Edition. B=15.Four different views of the final mesh (47076 triangles) are presented.Notice the small surface noise:we estimated it [11]A.J.Stoddart and A.Hilton,"Registration of multiple to 0.09 mm throughout the entire reconstructed surface.Over point sets",Proceedings of the 13th Int.Conf.of Pattern a depth variation of approximately 10 cm,this means a relative Recognition,1996. error of 0.1%.The few white holes correspond to the occluded [12]Marjan Trobina, "Error model of a coded-light range portions of the scene (not observed from the camera or not sensor",Technical Report BIWI-TR-164,ETH-Zentrum illuminated).Most of the geometrical constants in the setup 1995. were kept roughly identical in both scans:da=22 cm,0=40 [13]R.Y.Tsai,"A versatile camera calibration technique for degrees,hs =62 cm,70 degrees;we only changed the high accuracy 3d machine vision metrology using off-the- azimuth angle g from (lamp on the left)to 0(lamp on the shelf tv cameras and lenses",IBEE J.Robotics Automat., right).In this experiment we took the lamp reflector off,leaving RA-3(4):323-344,1987. the bulb naked.Consequently,we noticed a significant improve- ment in the sharpness of the projected shadow compared to the [14]John W.T.Walsh,Photometry,Dover,NY,1965. two first experiments.We believe that this operation was the [15]Y.F.Wang,"Characterizing three-dimensional surface main reason for the noticeable improvement in reconstruction structures from visual images",IEEE Transactions on quality.Once again,there was no significant global deformation Pattern Analysis and Machine Intelligence,13(1):52-60, in the final structured surface:we fit a quadratic model through 1991. the reconstructed set of points on the desk plane and noticed [16]Z.Yang and Y.F.Wang,"Error analysis of 3D shape con- from planar to quadratic warping a decrease of only 2%on the struction from structured lighting",Pattern Recognition, standard deviation of surface noise. 29(2):189-206,1996. 50
Figure 8: Experiment 3 - The angel scene: We took two scans of the angel with the lamp first on the left side (top-left) and then on the right side (top-right) of the camera. The two resulting meshes are shown on the second row, left and right. As expected, the portions further away from the light source are noisier. The two meshes were then merged together following the technique described in section 3, with diffusion coefficient /3 = 15. Four different views of the final mesh (47076 triangles) are presented. Notice the small surface noise: we estimated it to 0.09 mm throughout the entire reconstructed surface. Over a depth variation of approximately 10 cm, this means a relative error of 0.1%. The few white holes correspond to the occluded portions of the scene (not observed from the camera or not illuminated). Most of the geometrical constants in the setup were kept roughly identical in both scans: dd = 22 cm, 6’ = 40 degrees, hs = 62 cm, q5 x 70 degrees; we only changed the azimuth angle < from T (lamp on the left) to 0 (lamp on the right). In this experiment we took the lamp reflector off, leaving the bulb naked. Consequently, we noticed a significant improvement in the sharpness of the projected shadow compared to the two first experiments. We believe that this operation was the main reason for the noticeable improvement in reconstruction quality. Once again, there was no significant global deformation in the final structured surface: we fit a quadratic model through the reconstructed set of points on the desk plane and noticed from planar to quadratic warping a decrease of only 2% on the standard deviation. of surface noise. Acknowledgments This work is supported in part by the California Institute of Technology; an NSF National Young Investigator Award to P.P.; the Center for Neuromorphic Systems Engineering funded by the National Science Foundation at the California Institute of Technology; and by the California Trade .and Commerce Agency, Office of Strategic Technology. We wish to thank all the colleagues that helped us throughout this work, especially Luis Goncalvec, George Barbastathis, Mario Munich, and Arrig0 Benedetti for very useful discussions. Comments from the anonymous reviewers were very helpful in improving a previous version of the paper. References [l] Paul Bed, Advances in Machine Vision, chapter 1 - Active optical range imaging sensors, pages 1-63, Springer-Verlag, 1989. [a] P.J. Besl and N.D. McKay, “A method for registration of 3-d shapes”, IEEE Transactions on Pattern Analysis and Machine Intelligence, 14(2):239-256, 1992. [3] Jean-Yves Bouguet and Pietro Perona, “3D photography on your desk”, Technical report, California Institute of Technology, 1997, available at: http://www.vision.caltech.edu/bouguetj/ICCV98. [4] Brian Curless and Marc Levoy, “Better optical triangulation through spacetime analysis”, Proc. 5th Int. Conf. Computer Vision, pages 987-993, 1995. [5] Brian Curless and Marc Levoy, “A volumetric method for building complex models from range images”, SIGGRAPHSG, Computer Graphics Proceedings, 1996. [6] O.D. Faugeras, Three dimensional vision, a geometric viewpoint, MIT Press, 1993. [7] Berthold K.P. Horn, “Closed-form solution of absolute orientation using unit quaternions”, J. Opt. Soc. Am. A, [SI Jurgen R. Meyer-Arendt, “Radiometry and photometry: 4(4):629-642, 1987. Units and conversion factors”, Applied Optics, 7(10):2081- 2084, October 1968. Shree K. Nayar, Katsushi Ikeuchi, and Takeo Kanade, “Shape from interreflections”, Int. J. of Computer Vision, Athanasios Papoulis, Probability, Random Variables and Stochastic Processes, Mac Graw Hill, 1991, Third Edition. A.J. Stoddart and A. Hilton, “Registration of multiple point sets”, Proceedings of the 13th Int. Cons. of Pattern Recognition, 1996. Marjan Trobina, “Error model of a coded-light range sensor”, Technical Report BIWI-TR-164, ETH-Zentrum, 1995. R. Y. Tsai, “A versatile camera calibration technique for high accuracy 3d machine vision metrology using off-theshelf tv cameras and lenses”, IEEE J. Robotics Automat., John W. T. Walsh, Photometry, Dover, NY, 1965. Y.F. Wang, “Characterizing three-dimensional surface structures from visual images”, IEEE Pansactions on Pattern Analysis and Machine Intelligence, 13( 1):52-60, 1991. 2. Yang and Y.F. Wang, “Error analysis of 3D shape construction from structured lighting”, Pattern Recognition, 6(3):173-195, 1991. RA-3(4):323-344, 1987. 29(2):189-206, 1996. 50