EN161 Image Understanding Projects
|
All projects will entail careful reading and understanding 1-2 main papers and reading several other supplementary papers as the foundation to enable you to implement and test a current method in your chosen topic. You will be expected to be able to discuss the strengths and weaknesses of the method. |
Contact TA: MingChing Chang
Projects with ![]()
![]()
are highly recommended for this course.
1. Mean Shift
Mean Shift: A Robust Approach Toward Feature Space Analysis
Dorin Comaniciu and Peter Meer, PAMI 2002.
Abstract
A general nonparametric technique is proposed for the analysis of a complex multimodal feature space and to delineate arbitrarily shaped clusters in it. The basic computational module of the technique is an old pattern recognition procedure, the mean shift. We prove for discrete data the convergence of a recursive mean shift procedure to the nearest stationary point of the underlying density function and, thus, its utility in detecting the modes of the density. The relation of the mean shift procedure to the Nadaraya-Watson estimator from kernel regression and the robust M-estimators of location is also established. Algorithms for two low-level vision tasks, discontinuity preserving smoothing and image segmentation, are described as applications. In these algorithms, the only user set parameter is the resolution of the analysis and either gray level or color images are accepted as input. Extensive experimental results illustrate their excellent performance.
Link, Paper, More results, Code from Comaniciu, Code from Meer.
Integral Histogram: A Fast Way To Extract Histograms in Cartesian Spaces
Fatih Porikli, CVPR 2005.
Abstract
We present a novel method, which we refer as an integral histogram, to compute the histograms of all possible target regions in a Cartesian data space. Our method has three distinct advantages: 1- It is computationally superior to the conventional approach. The integral histogram method makes it possible to employ even an exhaustive search process in real-time, which was impractical before. 2- It can be extended to higher data dimensions, uniform and non-uniform bin formations, and multiple target scales without sacrificing its computational advantages. 3- It enables the description of higher level histogram features. We exploit the spatial arrangement of data points, and recursively propagate an aggregated histogramby starting from the origin and traversing through the remaining points along either a scan-line or a wave-front. At each step, we update a single bin using the values of integral histogram at the previously visited neighboring data points. After the integral histogramis propagated, histogram of any target region can be computed easily by using simple arithmetic operations.
Pedestrian Detection in Crowded Scenes
B. Leibe, E. Seemann, and B. Schiele, CVPR 2005.
Abstract
In this paper, we address the problem of detecting pedestrians in crowded real-world scenes with severe overlaps. Our basic premise is that this problem is too difficult for any type of model or feature alone. Instead, we present a novel algorithm that integrates evidence in multiple iterations and from different sources. The core part of our method is the combination of local and global cues via a probabilistic top-down segmentation. Altogether, this approach allows to examine and compare object hypotheses with high precision down to the pixel level. Qualitative and quantitative results on a large data set confirm that our method is able to reliably detect pedestrians in crowded scenes, even when they overlap and partially occlude each other. In addition, the flexible nature of our approach allows it to operate on very small training sets.Link, Author's Page, Paper. Shape Context and Chamfer Matching in Cluttered Scenes
A. Thayananthan, B. Stenger, P. H. S. Torr, R. Cipolla, CVPR 2003.
Abstract
This paper compares two methods for object localization from contours: shape context and chamfer matching of templates. In the light of our experiments, we suggest improvements to the shape context: shape contexts are used to find corresponding features between model and image. In real images it is shown that the shape context is highly influenced by clutters; furthermore, even when the object is correctly localized, the feature correspondence may be poor. We show that the robustness of shape matching can be increased by including a figural continuity constraint. The combined shape and continuity cost is minimized using the Viterbi algorithm on features, resulting in improved localization and correspondence. Our algorithm can be generally applied to any feature based shape matching method. Chamfer matching correlates model templates with the distance transform of the edge image. This can be done efficiently using a coarse-to-fine search over the transformation parameters. The method is robust in clutter, however, multiple templates are needed to handle scale, rotation and shape variation. We compare both methods for locating hand shapes in cluttered images, and applied to word recognition in EZ-Gimpy images.Link, Paper.
Real-Time Object Detection for Smart Vehicles
D. Gavrila and V. Philomin, ICCV 1999.
Abstract
This paper
presents an efficient shape-based object detection method based on
Distance Transforms and describes its use for real-time vision on-board
vehicles. The method uses a template hierarchy to capture the
variety of object shapes# efficient hierarchies can be generated
offline for given shape distributions using stochastic optimization
techniques (i.e. simulated annealing). Online, matching involves a
simultaneous
coarse-to-fine approach over the shape hierarchy and over the
transformation parameters. Very large speedup factors are typically
obtained when comparing this approach with the equivalent brute-force
formulation#
we have measured gains of several orders of magnitudes. We present
experimental results on the real-time detection of traffic signs and
pedestrians from a moving vehicle. Because of the highly time sensitive
nature
of these vision tasks, we also discuss some hardwarespecific
implementations of the proposed method as far as SIMD parallelism is
concerned.
Link, Paper.
Histograms of Oriented Gradients for Human Detection
Navneet Dalal and Bill Triggs, CVPR 2005.
Abstract
We study the question of feature sets for robust visual object recognition, adopting linear SVM based human detection as a test case. After reviewing existing edge and gradient based descriptors, we show experimentally that grids of Histograms of Oriented Gradient (HOG) descriptors significantly outperform existing feature sets for human detection. We study the influence of each stage of the computation on performance, concluding that fine-scale gradients, fine orientation binning, relatively coarse spatial binning, and high-quality local contrast normalization in overlapping descriptor blocks are all important for good results. The new approach gives near-perfect separation on the original MIT pedestrian database, so we introduce a more challenging dataset containing over 1800 annotated human images with a large range of pose variations and backgrounds.Link, Paper.
On Space-Time Interest Points
Ivan Laptev, IJCV 2005.
Abstract
Local
image features or interest points provide compact and abstract
representations of patterns in an image. In this paper, we extend the
notion of spatial interest points into the spatio-temporal domain and
show how the resulting features often reflect interesting events that
can be used for a compact representation of video data as well as for
interpretation of spatio-temporal events.
To detect spatio-temporal events, we build on the idea of the Harris
and Förstner interest point operators and detect local structures
in space-time where the image values have significant local variations
in both space and time. We estimate the spatio-temporal extents of the
detected events by maximizing a normalized spatio-temporal Laplacian
operator over spatial and temporal scales. To represent the detected
events, we then compute local, spatio-temporal, scale-invariant N-jets
and classify each event with respect to its jet descriptor. For the
problem of human motion analysis, we illustrate how a video
representation in terms of local space-time features allows for
detection of walking people in scenes with occlusions and dynamic
cluttered backgrounds.
A Distance Measure and a Feature Likelihood Map Concept for Scale-Invariant Model Matching
Ivan Laptev and Tony Lindeberg, IJCV 2003.
Abstract
This paper presents two approaches for evaluating multi-scale feature-based object models. Within the first approach, a scale-invariant distance measure is proposed for comparing two image representations in terms of multi-scale features. Based on this measure, the maximisation of the likelihood of parameterised feature models allows for simultaneous model selection and parameter estimation.Link, Paper.
Local Descriptors for Spatio-Temporal Recognition
Ivan Laptev and Tony Lindeberg, ECCV Workshop "Spatial Coherence for Visual Motion Analysis" 2004.Abstract
This paper presents and investigates a set of local space-time descriptors for representing and recognizing motion patterns in video. Following the idea of local features in the spatial domain, we use the notion of space-time interest points and represent video data in terms of local space-time events. To describe such events, we define several types of image descriptors over local spatio-temporal neighborhoods and evaluate these descriptors in the context of recognizing human activities. In particular, we compare motion representations in terms of spatio-temporal jets, position dependent histograms, position independent histograms, and principal component analysis computed for either spatio-temporal gradients or optic flow. An experimental evaluation on a video database with human actions shows that high classification performance can be achieved, and that there is a clear advantage of using local position dependent histograms, consistent with previously reported findings regarding spatial recognition.Link, Data, Paper.
Efficient Graph-Based Image Segmentation
Pedro F. Felzenszwalb and Daniel P. Huttenlocher, IJCV 2004.
Abstract
This paper addresses the problem of segmenting an image into regions. We define a predicate for measuring the evidence for a boundary between two regions using a graph-based representation of the image. We then develop an efficient segmentation algorithm based on this predicate, and show that although this algorithm makes greedy decisions it produces segmentations that satisfy global properties. We apply the algorithm to image segmentation using two different kinds of local neighborhoods in constructing the graph, and illustrate the results with both real and synthetic images. The algorithm runs in time nearly linear in the number of graph edges and is also fast in practice. An important characteristic of the method is its ability to preserve detail in low-variability image regions while ignoring detail in high-variability regions.Link, Paper, Code.
The
purpose of the project is to label images depending on their object content,
for instance an image containing a car should be labeled with the keyword car.
Visual features are extracted and clustered to form visual codewords and the
occurrence statistics of these codewords in the images discriminate between
image classes. A query image is labeled with the keyword of a class if visual
codewords of that class occur frequently on the image.
The
main paper is: Csurka,
Dance, Fan, Willamowski, Bray, Visual categorization with bags of keypoints,
ECCV04
Main
steps:
1) Construction of the codebook from
the training image set:
a) SIFT interest points and descriptor
will be used as visual features. We have code to detect SIFT interest points
and extract the SIFT descriptors. The student would be to get familiar with
SIFT features in practice, such as sensibly selecting the parameters.
b) For each image in the training set:
detect interest points and extract descriptors.
c) Cluster the descriptors using
k-means clustering algorithm. The cluster centers constitute the visual
codeworks.
d) Implement a routine that colors the
detected keypoints on an image with the color of the nearest codeword.
2) Classify images using Naive-Bayes
Classifier: collect occurrence statistics of codewords from the training set
for each class, and use these statistics to classify the query image.
3) If time permits desing and implement
a process that bypasses keypoint detection and extracts SIFT descriptors from a
dense grid over the image, repeat the experiment with dense features
Please talk to Ozge Can Ozcanli B&H
317 for further assistance
10. Design and implement the spatial pyramid matching backbone
The idea
is to partition the image into rectangular regions as a hierarchy and
extract the codeword histograms from each partition. Codeword
construction is very similar to the previous project.
The main paper is: Lazebnik, Schmid, Ponce, CVPR06
Please talk to Ozge Can Ozcanli B&H 317 for further assistance
11. Design and implement an algorithm to select maximally informative features from a training set.
Given a
set of feature vectors, a distance function to match the features, a
set of labeled training feature vectors: design and implement the
backbone to select the most informative subset to discriminate the
classes.
The main paper is Vidal-Naquet, Ullman, ICCV03
Please talk to Ozge Can Ozcanli B&H 317 for further assistance
The purpose of this project is to implement a system that takes an input video
sequence of a static scene and outputs the position of the camera at each frame,
estimates of the intrinsic parameters, and reconstructed 3D points. The student
will develop familiarity with recent methods of automatic reconstruction from
uncalibrated cameras, which are used in 3D photography, match-move applications
in the entertainment industry, and photogrammetry. We will provide C++ and
matlab code for many of the parts of this project, and the student will put them
together to form a more complete system.
The steps for implementing the system are:
1) Detect interest points in each image and match them using similarity of
appearance.
- Usually SIFT features are used [Lowe]. We have code for this, the task of
the student would be to get familiar with SIFT features in practice, such as
sensibly selecting the parameters.
2) Compute the epipolar geometry / fundamental matrix between pairs of views.
- This is performed using a robust estimation strategy, called RANSAC.
- Only reliable feature matches are retained.
- We also have code for this part. The task of the student is to develop a
working knowledge of fundamental matrix estimation and the geometry of
multiple views.
3) From two selected views, called "key frames" in this project, obtain two
canonical camera matrices as described in [Hartley and Zisserman]. These camera
matrices differ from the true ones by a 8-parameter ambiguity, but we can
nevertheless work with them for the purpose of enforcing consistency between the
whole video sequence.
- This stage involves mere conceptual understanding, and can be performed in
one line of C++ code.
4) "Projective Reconstruction" - Starting from the two key frames,
incrementally add another frame, forming
the key frame set. Impose consistency of the new frame with the previous key
frame set. This is performed using camera pose (also called resectioning) from
3D points to 2D points, assuming the canonical cameras as true ones. A robust
RANSAC strategy guides this process in order to eliminate false feature
correspondences. Keep adding frames and imposing consistency once no new frame
can be added.
- We have code for the basic components, but the student will have to put it
together in order to obtain a camera resectioning routine that works with
RANSAC. This can be done with the aid of students in our lab.
5) The camera parameters (rotations, translations, and intrinsic parameters) and
the 3D reonstruction of matching feature points are now known up to 8 degrees of
freedom. The hardest part of the project is now done. The student must evaluate
this reconstruction before proceeding. One way of doing it is to get the
intrinsic parameters of the camera using some tranditional calibration board,
and, then, using these parameters together with the projective reconstruction
obtained in step 4, one can generate a final metric reconstruction. By showing
the positions of the cameras, the student can verify if it looks ok.
Optional Step (extra grade)
6) Even without knowing the intrinsic parameters of the cameras, the student can
still generate a final reconstruction from 4), bypassing any manual
calibration procedure as done in 5). This is performed by solving so-called
auto-calibration equations. We have no code for this part, but the student can
ask for help in our lab.
7) Dense stereo and Texture-mapping. This will provide a visually pleasing
reconstruction of an object captured by the video sequence.
Note that the system as a whole is quite complicated, but we have code for many
of the necessary modules, and help is available from people in our
lab whenever necessary.
References
==========
[1] Marc Pollefeys et al. "Visual Modeling with a Handheld Camera",
International Journal of Computer Vision, 2004. See also associated PhD
thesis, Marc Pollefeys, K. U. Leuven 1999. Advisor: L. Van Gool:
http://cortex.lems.brown.edu/~rfabbri/stuff/PhD-Pollefeys.pdf
[2] Hartley and Zisserman's book, Cambridge University Press.
[3] Iryna Gordon's Msc thesis, U. of British Columbia. Advisor: D. Lowe.
Please talk to Ricardo Fabbri B&H 317 for further assistance.
13. Texture Classification
Locally Invariant Fractal Features for Statistical Texture Classification
ICCV 2007.
Please talk to Amir Tamrakar B&H 317 for details.
- study the model and the algorithm proposed by Felzenswalb and Huttenlocher (there is no free lunch, the authors achieve linear time by putting certain restrictions on the model).
- implement the algorithm to detect a
generic object class, e.g. a pedestrian or a car.
References:
1. Pedro Felzenszwalb and Huttenlocher,
CVPR 2000, “Efficient matching of pictorial structures”
2. Pedro Felzenszwalb and Huttenlocher, IJCV 2005, "Pictorial structures for object recognition".
Please talk to Nhon Trinh B&H 317 for details.
15. Segmentation of knee cartilage in MRI images using pixel classification
Accurate segmentation of the articular
cartilage from the knee MRI images is an essential step in the diagnosing
osteoarthritis. One approach to segment the cartilage is to realize that all
voxels in the cartilage are made of the same type of tissue and therefore
should possess similar local properties that other voxels do not. In this
project, students will:
- investigate which local properties
separate cartilage voxels from its neighborhood and build a
cartilage/non-cartilage classifier based on these properties.
- develop an algorithm to group the
potential cartilage voxels together which yield accurate segmentation.
References:
[1] Folkesson et al, TMI 2007, “Segmenting
articular cartilage automatically using a voxel classification approach”
Available data: MRI images and manual segmentation of 5 cadavers and 1 living person.
Please talk to Nhon Trinh B&H 317 for details.
16. Improved watershed transform using prior information and its application to segment knee cartilage in MRI images.
The watershed
transform is a popular and intuitive segmentation method coming from the field
of mathematical morphology: if we consider the image as a topographic relief,
where the height of each point is directly related to its gray value, and
consider rain gradually falling on the terrain, then the watersheds are the
lines separating the "lakes", or catchment basins. Each of the "lakes" is a
segmentation region. Traditional watershed suffers drawbacks such as
oversegmentation, sensitivity to noise, and poor detection of thin structures.
This paper improves the traditional watershed method by incorporating prior
information such as absolute or relative intensity of the objects into the
process. The method has been shown to work well for segmenting knee cartilage
and brain images.
References:
[1] Grau etal, TMI 2004, “Improved
Watershed Transform for Medical Image Segmentation Using Prior Information”
Available data: MRI images and manual segmentation of 5 cadavers and 1 living person.
Please talk to Nhon Trinh B&H 317 for details.
17. Using active shape models for bifurcating contours to segment tibia and femur from X-ray images of human knees
Segmentation of the tibia and the femur from x-ray images of human knees can be used to determine the joint space width (JSW) of a knee, an important standard in the assessment of osteoarthritis. Direct application of the traditional active shape model (ASM) often fails because the traditional ASM rely on explicit interimage correspondence being established between landmark points while the outline of the tibia in x-ray images often contain loops and the positions of bifurcation points, often used as a landmark, vary in a complex way relative to the object. An active shape model for bifurcating contours is proposed in this paper. It describes a method to learn the model and applies it to segment x-ray images of human knees.
References:
[1] Seise etal, TMI 2007, “Learning
active shape models for bifurcating contours”
[2] Seise etal, BMVC 2005, "Double Contour Active Shape Models"
Available data: MRI and x-ray images of human knees (OAI dataset).
Please talk to Nhon Trinh B&H 317 for details.
18. Shape Regularized Active Contour using Iterative Global Search and Local Optimization
Paper abstract:
Recently, nonlinear shape models have been shown to improve the robustness and flexibility of segmentation. In this paper, we propose shape regularized active contour (ShRAC) that incorporates existing nonlinear shape models into the classical active contour approach. ShRAC uses a discrete representation of the contour to allow efficient combinatorial search. The search for optimal contour is performed by coarse-to-fine algorithm that iterates between combinatorial search and gradient-based local optimization. First, multi-solution dynamic programming (MSDP) is used to generate initial candidates by minimizing only the image energy. In the second step, a combination of image energy and shape energy determined by a given prior shape model is minimized for the initial candidates using a local optimization method and the best one is selected. To have diverse initial candidates, we employ a clustered solution pruning procedure in the MSDP search space. Finally, local shape regularization is used to feed shape constraints back into the new MSDP search space of the next iteration. Our search strategy combines the advantages of global combinatorial search and local optimization, and has shown excellent robustness to local minima caused by distracting suboptimal segmentations. Experimental results on segmentation of different anatomical structures using ShRAC are provided.
References:
[1] Yu etal, CVPR 2005, “Shape
regularized active contour using iterative global search and local optimization”
Please talk to Nhon Trinh B&H 317 for details.
19. 2D-3D Registration of Spine X-ray projection (2D) to CT image (3D)
Paper: Image Similarity Using Mutual Information of Regions
Abstract: Mutual information (MI) has emerged in recent years as an effective similarity measure for comparing images. One drawback of MI, however, is that it is calculated on a pixel by pixel basis, meaning that it takes into account only the relationships between corresponding individual pixels and not those of each pixel’s respective neighborhood. As a result, much of the spatial information inherent in images is not utilized. In this paper, we propose a novel extension to MI called regional mutual information (RMI). This extension efficiently takes neighborhood regions of corresponding pixels into account. We demonstrate the usefulness of RMI by applying it to a real-world problem in the medical domain—intensity-based 2D-3D registration of X-ray projection images (2D) to a CT image (3D). Using a gold-standard spine image data set, we show that RMI is a more robust similarity meaure for image registration than MI.
References:
[1] Russakoff etal, ECCV 2004, “Image
Similarity Using Mutual Information of Regions”
Please talk to Nhon Trinh B&H 317 for details.