Matthew Leotta
Brown University
Kristin Boyle
Brown University
 
SIGGRAPH 2005 Poster

Original Image

(a) One of the original images from our sequence.

Point Cloud Reconstruction

(b) Reconstructed 3D points from all images projected into the same image plane.

Continuous Depth Map

(c) The depth map computed for this image.

Augmented Image

(d) Virtual objects inserted into the image with shadows and occlusion.

Introduction

Augmented reality systems generally fall into one of two categories: those that require user input or prior information to model the world, and those that automatically model the world from images. The former can provide rich interaction between real and virtual objects at the expense of user effort to model the scene. The latter can render virtual objects into the images but usually do not account for any interaction with the world.

We present a system that automatically builds a 3D model of an environment from an unordered set of images. The model allows us to insert virtual objects into the scene and run real time plausible physics simulations between real and virtual objects. The scene model also accounts for occlusion of virtual objects and allows us to cast shadows of virtual objects onto the real objects in the scene.

Our approach is currently limited by the amount of texture in the scene. Texture results in a dense sampling of detected points, which improves depth estimation as well as plausible physics simulations.

Scene Modeling

Recent work by Iryna Skrypnyk and David Lowe[Skrypnyk2004] shows how a static scene can be modeled automatically given only images taken from several viewpoints. The scene is modeled by identifying salient image points via the Scale Invariant Feature Transform (SIFT). The 3D location of the feature points and the camera parameters used in image formation are simultaneously estimated using bundle adjustment. Skrypnyk and Lowe use this point cloud model to calibrate additional images in real time.

We extend this work in a different direction. Our focus is to improve the scene model so that we can create a real-time interactive environment viewed through augmentation of the original image set. After reconstructing a sparse point cloud [Figure 1(b)] we estimate a continuous depth map for each of the images [Figure 1(c)]. To this end, the subset of 3D points originally detected in each image is projected back into the image. This provides a set of 2D points with known depths. For each image we apply thin-plate interpolation of these depth values. The result is a smooth function of depth that both interpolates and extrapolates the data.

Plausible Physics and Interaction

To perform plausible physics we implemented a real-time simulation based upon the work done by Guendelman[Guendelman2003]. To account for interactions with our scene model we treat each reconstructed feature as infinite point mass. To simulate the force due to gravity in our augmented world we let the user determine the scale and orientation of the world. This removes the inherent ambiguity in the reconstruction. We also provide an interface for inserting and manipulating virtual objects in the scene. Users can assign initial positions and velocities to objects, view the scene from any of the viewpoints, and run simulations at various speeds. While the simulation does work in real-time, running at slower speeds (with shorter time steps) results in more accurate simulations.

Rendering

To run physics simulations in real time we must be able to render the scene efficiently. Using OpenGL, we render the images using texture mapped planes. We render the depth maps by sampling our depth function on a regular grid, back projecting these points into 3D, and rending a triangular mesh of these points into the depth buffer. Rendering the scene data in this way is much more efficient that calling glDrawPixels. The number of vertices in the depth grid is a free variable that controls trade-off between rendering speed and adherence to depth function. Since the depth function is very smooth we are able to down-sample considerably without noticeable loss in quality. The resulting depth maps play a dual role. First, they allow the real objects to occlude the virtual ones. Second, they allow shadows to be cast from virtual objects onto the surfaces in the real scene. For the latter, we use shadow volumes and make the assumption that half of each image's intensity comes from ambient light while the rest comes from direct lighting.

References

[Guendelman2003]
GUENDELMAN, E., BRIDSON, R., AND FEDKIW, R. 2003. Nonconvex rigid bodies with stacking. ACM Trans. Graph. 22, 3, 871-878.
[Skrypnyk2004]
SKRYPNYK, I., AND LOWE, D. G. 2004. Scene modelling, recognition and tracking with invariant image features. In ISMAR, 110-119.