EN161 Image Understanding Projects

 

All projects will entail careful reading and understanding 1-2 main papers and reading several other supplementary papers as the foundation to enable you to implement and test a current method in your chosen topic.  You will be expected to be able to discuss the strengths and weaknesses of the method.

 

Contact TA: MingChing Chang

 

Projects with are highly recommended for this course.

 



1. Mean Shift

 

Mean Shift: A Robust Approach Toward Feature Space Analysis

Dorin Comaniciu and Peter Meer, PAMI 2002.

 

Abstract

A general nonparametric technique is proposed for the analysis of a complex multimodal feature space and to delineate arbitrarily shaped clusters in it. The basic computational module of the technique is an old pattern recognition procedure, the mean shift. We prove for discrete data the convergence of a recursive mean shift procedure to the nearest stationary point of the underlying density function and, thus, its utility in detecting the modes of the density. The relation of the mean shift procedure to the Nadaraya-Watson estimator from kernel regression and the robust M-estimators of location is also established. Algorithms for two low-level vision tasks, discontinuity preserving smoothing and image segmentation, are described as applications. In these algorithms, the only user set parameter is the resolution of the analysis and either gray level or color images are accepted as input. Extensive experimental results illustrate their excellent performance.

 

Link, Paper, More results, Code from Comaniciu, Code from Meer.


2. Integral Histogram

 

Integral Histogram: A Fast Way To Extract Histograms in Cartesian Spaces

Fatih Porikli, CVPR 2005.

 

Abstract

We present a novel method, which we refer as an integral histogram, to compute the histograms of all possible target regions in a Cartesian data space. Our method has three distinct advantages: 1- It is computationally superior to the conventional approach. The integral histogram method makes it possible to employ even an exhaustive search process in real-time, which was impractical before. 2- It can be extended to higher data dimensions, uniform and non-uniform bin formations, and multiple target scales without sacrificing its computational advantages. 3- It enables the description of higher level histogram features. We exploit the spatial arrangement of data points, and recursively propagate an aggregated histogramby starting from the origin and traversing through the remaining points along either a scan-line or a wave-front. At each step, we update a single bin using the values of integral histogram at the previously visited neighboring data points. After the integral histogramis propagated, histogram of any target region can be computed easily by using simple arithmetic operations.

 

Link, Paper.


3. Chamfer Distance for Pedestrain Detection

 

Pedestrian Detection in Crowded Scenes

B. Leibe, E. Seemann, and B. Schiele, CVPR 2005.

 

Abstract

In this paper, we address the problem of detecting pedestrians in crowded real-world scenes with severe overlaps. Our basic premise is that this problem is too difficult for any type of model or feature alone. Instead, we present a novel algorithm that integrates evidence in multiple iterations and from different sources. The core part of our method is the combination of local and global cues via a probabilistic top-down segmentation. Altogether, this approach allows to examine and compare object hypotheses with high precision down to the pixel level. Qualitative and quantitative results on a large data set confirm that our method is able to reliably detect pedestrians in crowded scenes, even when they overlap and partially occlude each other. In addition, the flexible nature of our approach allows it to operate on very small training sets.

 

Link, Author's Page, Paper.

Shape Context and Chamfer Matching in Cluttered Scenes

A. Thayananthan, B. Stenger, P. H. S. Torr, R. Cipolla, CVPR 2003.

 

Abstract

This paper compares two methods for object localization from contours: shape context and chamfer matching of templates. In the light of our experiments, we suggest improvements to the shape context: shape contexts are used to find corresponding features between model and image. In real images it is shown that the shape context is highly influenced by clutters; furthermore, even when the object is correctly localized, the feature correspondence may be poor. We show that the robustness of shape matching can be increased by including a figural continuity constraint. The combined shape and continuity cost is minimized using the Viterbi algorithm on features, resulting in improved localization and correspondence. Our algorithm can be generally applied to any feature based shape matching method. Chamfer matching correlates model templates with the distance transform of the edge image. This can be done efficiently using a coarse-to-fine search over the transformation parameters. The method is robust in clutter, however, multiple templates are needed to handle scale, rotation and shape variation. We compare both methods for locating hand shapes in cluttered images, and applied to word recognition in EZ-Gimpy images.

 

Link, Paper.

Real-Time Object Detection for Smart Vehicles

D. Gavrila and V. Philomin, ICCV 1999.

 

Abstract

This paper presents an efficient shape-based object detection method based on Distance Transforms and describes its use for real-time vision on-board vehicles. The method uses a template hierarchy to capture the
variety of object shapes# efficient hierarchies can be generated offline for given shape distributions using stochastic optimization techniques (i.e. simulated annealing). Online, matching involves a simultaneous
coarse-to-fine approach over the shape hierarchy and over the transformation parameters. Very large speedup factors are typically obtained when comparing this approach with the equivalent brute-force formulation#
we have measured gains of several orders of magnitudes. We present experimental results on the real-time detection of traffic signs and pedestrians from a moving vehicle. Because of the highly time sensitive nature
of these vision tasks, we also discuss some hardwarespecific implementations of the proposed method as far as SIMD parallelism is concerned.

 

Link, Paper.

4. Human Detection

 

Histograms of Oriented Gradients for Human Detection

Navneet Dalal and Bill Triggs, CVPR 2005.

 

Abstract

We study the question of feature sets for robust visual object recognition, adopting linear SVM based human detection as a test case. After reviewing existing edge and gradient based descriptors, we show experimentally that grids of Histograms of Oriented Gradient (HOG) descriptors significantly outperform existing feature sets for human detection. We study the influence of each stage of the computation on performance, concluding that fine-scale gradients, fine orientation binning, relatively coarse spatial binning, and high-quality local contrast normalization in overlapping descriptor blocks are all important for good results. The new approach gives near-perfect separation on the original MIT pedestrian database, so we introduce a more challenging dataset containing over 1800 annotated human images with a large range of pose variations and backgrounds.

 

LinkPaper.
5. Space Time Interest Points

 

On Space-Time Interest Points

Ivan Laptev, IJCV 2005.

 

Abstract

Local image features or interest points provide compact and abstract representations of patterns in an image. In this paper, we extend the notion of spatial interest points into the spatio-temporal domain and show how the resulting features often reflect interesting events that can be used for a compact representation of video data as well as for interpretation of spatio-temporal events.
To detect spatio-temporal events, we build on the idea of the Harris and Förstner interest point operators and detect local structures in space-time where the image values have significant local variations in both space and time. We estimate the spatio-temporal extents of the detected events by maximizing a normalized spatio-temporal Laplacian operator over spatial and temporal scales. To represent the detected events, we then compute local, spatio-temporal, scale-invariant N-jets and classify each event with respect to its jet descriptor. For the problem of human motion analysis, we illustrate how a video representation in terms of local space-time features allows for detection of walking people in scenes with occlusions and dynamic cluttered backgrounds.

LinkPaper.

A Distance Measure and a Feature Likelihood Map Concept for Scale-Invariant Model Matching

Ivan Laptev and Tony Lindeberg, IJCV 2003.

 

Abstract

This paper presents two approaches for evaluating multi-scale feature-based object models. Within the first approach, a scale-invariant distance measure is proposed for comparing two image representations in terms of multi-scale features. Based on this measure, the maximisation of the likelihood of parameterised feature models allows for simultaneous model selection and parameter estimation.
The idea of the second approach is to avoid an explicit feature extraction step and to evaluate models using a function defined directly from the image data. For this purpose, we propose the concept of a feature likelihood map, which is a function normalised to the interval [0, 1], and that approximates the likelihood of image features at all points in scale-space.
To illustrate the applicability of both methods, we consider the area of hand gesture analysis and show how the proposed evaluation schemes can be integrated within a particle filtering approach for performing simultaneous tracking and recognition of hand models under variations in the position, orientation, size and posture of the hand. The experiments demonstrate the feasibility of the approach, and that real time performance can be obtained by pyramid implementations of the proposed concepts.

 

LinkPaper.

Local Descriptors for Spatio-Temporal Recognition

Ivan Laptev and Tony Lindeberg, ECCV Workshop "Spatial Coherence for Visual Motion Analysis" 2004.

Abstract

This paper presents and investigates a set of local space-time descriptors for representing and recognizing motion patterns in video. Following the idea of local features in the spatial domain, we use the notion of space-time interest points and represent video data in terms of local space-time events. To describe such events, we define several types of image descriptors over local spatio-temporal neighborhoods and evaluate these descriptors in the context of recognizing human activities. In particular, we compare motion representations in terms of spatio-temporal jets, position dependent histograms, position independent histograms, and principal component analysis computed for either spatio-temporal gradients or optic flow. An experimental evaluation on a video database with human actions shows that high classification performance can be achieved, and that there is a clear advantage of using local position dependent histograms, consistent with previously reported findings regarding spatial recognition.

 

Link, Data, Paper.

 "Velocity adaptation of space-time interest points" (2004),
I. Laptev and T. Lindeberg; in Proc. ICPR'04, Cambridge, UK, pp.I:52-56.

"Galilean-diagonalized spatio-temporal interest operators" (2004),
T. Lindeberg, A. Akbarzadeh and I. Laptev; in Proc. ICPR'04, Cambridge, UK, pp.I:57-62.

"Recognizing Human Actions: A Local SVM Approach" (2004),
Christian Schuldt, Ivan Laptev and Barbara Caputo; in Proc. ICPR'04, Cambridge, UK, pp.III:32--36.

Laptev's Page.
6. Efficient Graph-Based Image Segmentation

 

Efficient Graph-Based Image Segmentation

Pedro F. Felzenszwalb and Daniel P. Huttenlocher, IJCV 2004.

 

Abstract

This paper addresses the problem of segmenting an image into regions. We define a predicate for measuring the evidence for a boundary between two regions using a graph-based representation of the image. We then develop an efficient segmentation algorithm based on this predicate, and show that although this algorithm makes greedy decisions it produces segmentations that satisfy global properties. We apply the algorithm to image segmentation using two different kinds of local neighborhoods in constructing the graph, and illustrate the results with both real and synthetic images. The algorithm runs in time nearly linear in the number of graph edges and is also fast in practice. An important characteristic of the method is its ability to preserve detail in low-variability image regions while ignoring detail in high-variability regions.

 

LinkPaper, Code.
7. Efficient Graph-Based Image Segmentation

 


CP method of Ferrari and van Gool 2001 (ICCV or CVPR).
which one in here?

8. Debluring

 


9. Design and Implementation of bag-of-keypoints image classification

The purpose of the project is to label images depending on their object content, for instance an image containing a car should be labeled with the keyword car. Visual features are extracted and clustered to form visual codewords and the occurrence statistics of these codewords in the images discriminate between image classes. A query image is labeled with the keyword of a class if visual codewords of that class occur frequently on the image.

The main paper is: Csurka, Dance, Fan, Willamowski, Bray, Visual categorization with bags of keypoints, ECCV04

Main steps:

1) Construction of the codebook from the training image set:

a) SIFT interest points and descriptor will be used as visual features. We have code to detect SIFT interest points and extract the SIFT descriptors. The student would be to get familiar with SIFT features in practice, such as sensibly selecting the parameters.

b) For each image in the training set: detect interest points and extract descriptors.

c) Cluster the descriptors using k-means clustering algorithm. The cluster centers constitute the visual codeworks.

d) Implement a routine that colors the detected keypoints on an image with the color of the nearest codeword.

2) Classify images using Naive-Bayes Classifier: collect occurrence statistics of codewords from the training set for each class, and use these statistics to classify the query image.

3) If time permits desing and implement a process that bypasses keypoint detection and extracts SIFT descriptors from a dense grid over the image, repeat the experiment with dense features

Please talk to Ozge Can Ozcanli B&H 317 for further assistance


 


10. Design and implement the spatial pyramid matching backbone

 

The idea is to partition the image into rectangular regions as a hierarchy and extract the codeword histograms from each partition. Codeword construction is very similar to the previous project.
The main paper is: Lazebnik, Schmid, Ponce, CVPR06

Please talk to Ozge Can Ozcanli B&H 317 for further assistance


11. Design and implement an algorithm to select maximally informative features from a training set.

 

Given a set of feature vectors, a distance function to match the features, a set of labeled training feature vectors: design and implement the backbone to select the most informative subset to discriminate the classes.
The main paper is Vidal-Naquet, Ullman, ICCV03
Please talk to Ozge Can Ozcanli B&H 317 for further assistance


12. Reconstruction and Auto-Calibration Using a Handheld Camera

 

The purpose of this project is to implement a system that takes an input video
sequence of a static scene and outputs the position of the camera at each frame,
estimates of the intrinsic parameters, and reconstructed 3D points.  The student
will develop familiarity with recent methods of automatic reconstruction from
uncalibrated cameras, which are used in 3D photography, match-move applications
in the entertainment industry, and photogrammetry.  We will provide C++ and
matlab code for many of the parts of this project, and the student will put them
together to form a more complete system.

                           


The steps for implementing the system are:

1) Detect interest points in each image and match them using similarity of
appearance.
  - Usually SIFT features are used [Lowe]. We have code for this, the task of
    the student would be to get familiar with SIFT features in practice, such as
    sensibly selecting the parameters.

2) Compute the epipolar geometry / fundamental matrix between pairs of views.
  - This is performed using a robust estimation strategy, called RANSAC.
  - Only reliable feature matches are retained.
  - We also have code for this part. The task of the student is to develop a
    working knowledge of fundamental matrix estimation and the geometry of
    multiple views.

3) From two selected views, called "key frames" in this project, obtain two
canonical camera matrices as described in [Hartley and Zisserman]. These camera
matrices differ from the true ones by a 8-parameter ambiguity, but we can
nevertheless work with them for the purpose of enforcing consistency between the
whole video sequence.
  - This stage involves mere conceptual understanding, and can be performed in
    one line of C++ code.

4) "Projective Reconstruction" - Starting from the two key frames,
incrementally add another frame, forming
the key frame set.  Impose consistency of the new frame with the previous key
frame set.  This is performed using camera pose (also called resectioning) from
3D points to 2D points, assuming the canonical cameras as true ones. A robust
RANSAC strategy guides this process in order to eliminate false feature
correspondences. Keep adding frames and imposing consistency once no new frame
can be added.
  - We have code for the basic components, but the student will have to put it
    together in order to obtain a camera resectioning routine that works with
    RANSAC. This can be done with the aid of students in our lab.

5) The camera parameters (rotations, translations, and intrinsic parameters) and
the 3D reonstruction of matching feature points are now known up to 8 degrees of
freedom. The hardest part of the project is now done. The student must evaluate
this reconstruction before proceeding. One way of doing it is to get the
intrinsic parameters of the camera using some tranditional calibration board,
and, then, using these parameters together with the projective reconstruction
obtained in step 4, one can generate a final metric reconstruction. By showing
the positions of the cameras, the student can verify if it looks ok.


Optional Step (extra grade)

6) Even without knowing the intrinsic parameters of the cameras, the student can
still generate a final reconstruction from 4), bypassing any manual
calibration procedure as done in 5). This is performed by solving so-called
auto-calibration equations. We have no code for this part, but the student can
ask for help in our lab.

7) Dense stereo and Texture-mapping. This will provide a visually pleasing
reconstruction of an object captured by the video sequence.


Note that the system as a whole is quite complicated, but we have code for many
of the necessary modules, and help is available from people in our
lab whenever necessary.


References
==========

[1] Marc Pollefeys et al. "Visual Modeling with a Handheld Camera",
International Journal of Computer Vision, 2004. See also associated PhD
thesis, Marc Pollefeys, K. U. Leuven 1999. Advisor: L. Van Gool:
    http://cortex.lems.brown.edu/~rfabbri/stuff/PhD-Pollefeys.pdf
[2] Hartley and Zisserman's book, Cambridge University Press.
[3] Iryna Gordon's Msc thesis, U. of British Columbia. Advisor: D. Lowe.

paper1

Please talk to Ricardo Fabbri B&H 317 for further assistance.


13.  Texture Classification

 

Locally Invariant Fractal Features for Statistical Texture Classification

ICCV 2007.

Please talk to Amir Tamrakar B&H 317 for details.


14. Object Detection Using Pictorial Structures
Pictorial structure is a model for object detection/localization: given an image containing a particular object, e.g. a person or a car, we wish to determine the positions of that object and its parts. In this model, an object is represented as a collection of parts arranged in a deformable configuration. Each part encodes local visual properties of the object and the deformable configuration is characterized by spring-like connections between certain pairs of parts. The best match of such model to an image is found by minimizing an energy function that is the sum of two terms: the degree of mismatch for each part (typically computed from a part detector) and degree of deformation for each pair of the connected parts (compared to a generic configuration). In its most general case, optimizing this energy function is very difficult due to the high number of parameters and the complexity of the dependency graph. Pedro Felzenszwalb and Huttenlocher developed an efficient algorithm to solve the problem in linear time. In this project, students will:

- study the model and the algorithm proposed by Felzenswalb and Huttenlocher (there is no free lunch, the authors achieve linear time by putting certain restrictions on the model).

- implement the algorithm to detect a generic object class, e.g. a pedestrian or a car. 

References:
1. Pedro Felzenszwalb and Huttenlocher, CVPR 2000, “Efficient matching of pictorial structures

2. Pedro Felzenszwalb and Huttenlocher, IJCV 2005, "Pictorial structures for object recognition".

Please talk to Nhon Trinh B&H 317 for details.


15. Segmentation of knee cartilage in MRI images using pixel classification

Accurate segmentation of the articular cartilage from the knee MRI images is an essential step in the diagnosing osteoarthritis. One approach to segment the cartilage is to realize that all voxels in the cartilage are made of the same type of tissue and therefore should possess similar local properties that other voxels do not. In this project, students will:
- investigate which local properties separate cartilage voxels from its neighborhood and build a cartilage/non-cartilage classifier based on these properties.
- develop an algorithm to group the potential cartilage voxels together which yield accurate segmentation. 

References:
[1] Folkesson et al, TMI 2007, “Segmenting articular cartilage automatically using a voxel classification approach

Available data:  MRI images and manual segmentation of 5 cadavers and 1 living person.

Please talk to Nhon Trinh B&H 317 for details.


16. Improved watershed transform using prior information and its application to segment knee cartilage in MRI images.

 

The watershed transform is a popular and intuitive segmentation method coming from the field of mathematical morphology: if we consider the image as a topographic relief, where the height of each point is directly related to its gray value, and consider rain gradually falling on the terrain, then the watersheds are the lines separating the "lakes", or catchment basins. Each of the "lakes" is a segmentation region. Traditional watershed suffers drawbacks such as oversegmentation, sensitivity to noise, and poor detection of thin structures. This paper improves the traditional watershed method by incorporating prior information such as absolute or relative intensity of the objects into the process. The method has been shown to work well for segmenting knee cartilage and brain images. 

 

References:
[1] Grau etal, TMI 2004, “Improved Watershed Transform for Medical Image Segmentation Using Prior Information

Available data:  MRI images and manual segmentation of 5 cadavers and 1 living person.

Please talk to Nhon Trinh B&H 317 for details.

 


17. Using active shape models for bifurcating contours to segment tibia and femur from X-ray images of human knees

 

Segmentation of the tibia and the femur from x-ray images of human knees can be used to determine the joint space width (JSW) of a knee, an important standard in the assessment of osteoarthritis. Direct application of the traditional active shape model (ASM) often fails because the traditional ASM rely on explicit interimage correspondence being established between landmark points while the outline of the tibia in x-ray images often contain loops and the positions of bifurcation points, often used as a landmark, vary in a complex way relative to the object. An active shape model for bifurcating contours is proposed in this paper. It describes a method to learn the model and applies it to segment x-ray images of human knees.

 

References:
[1] Seise etal, TMI 2007, “Learning active shape models for bifurcating contours

[2] Seise etal, BMVC 2005, "Double Contour Active Shape Models"

Available data:  MRI and x-ray images of human knees (OAI dataset).

Please talk to Nhon Trinh B&H 317 for details.

 

 


18. Shape Regularized Active Contour using Iterative Global Search and Local Optimization

 

Paper abstract:

 

Recently, nonlinear shape models have been shown to improve the robustness and flexibility of segmentation. In this paper, we propose shape regularized active contour (ShRAC) that incorporates existing nonlinear shape models into the classical active contour approach. ShRAC uses a discrete representation of the contour to allow efficient combinatorial search. The search for optimal contour is performed by coarse-to-fine algorithm that iterates between combinatorial search and gradient-based local optimization. First, multi-solution dynamic programming (MSDP) is used to generate initial candidates by minimizing only the image energy. In the second step, a combination of image energy and shape energy determined by a given prior shape model is minimized for the initial candidates using a local optimization method and the best one is selected. To have diverse initial candidates, we employ a clustered solution pruning procedure in the MSDP search space. Finally, local shape regularization is used to feed shape constraints back into the new MSDP search space of the next iteration. Our search strategy combines the advantages of global combinatorial search and local optimization, and has shown excellent robustness to local minima caused by distracting suboptimal segmentations. Experimental results on segmentation of different anatomical structures using ShRAC are provided.

 

References:
[1] Yu etal, CVPR 2005, “
Shape regularized active contour using iterative global search and local optimization

 

Please talk to Nhon Trinh B&H 317 for details.

 

 


19. 2D-3D Registration of Spine X-ray projection (2D) to CT image (3D)

 

Paper: Image Similarity Using Mutual Information of Regions

Abstract: Mutual information (MI) has emerged in recent years as an effective similarity measure for comparing images. One drawback of MI, however, is that it is calculated on a pixel by pixel basis, meaning that it takes into account only the relationships between corresponding individual pixels and not those of each pixel’s respective neighborhood. As a result, much of the spatial information inherent in images is not utilized. In this paper, we propose a novel extension to MI called regional mutual information (RMI). This extension efficiently takes neighborhood regions of corresponding pixels into account. We demonstrate the usefulness of RMI by applying it to a real-world problem in the medical domain—intensity-based 2D-3D registration of X-ray projection images (2D) to a CT image (3D). Using a gold-standard spine image data set, we show that RMI is a more robust similarity meaure for image registration than MI.

 

References:
[1] Russakoff etal, ECCV 2004, “
Image Similarity Using Mutual Information of Regions

 

Please talk to Nhon Trinh B&H 317 for details.