EN195 Independent Study Project

A Curve Matching Approach to Gesture Recognition

Fall 2001

Nick Diakopoulos

 

1 Project Goal

The goal of the project was to build a Dynamic Pose Static Location (DPSL) [2] hand gesture recognition prototype based on hand silhouette outlines and a curve matching algorithm developed by Kimia and Sebastian [4].  In other words, we want to be able to determine both the gesture identity and pose based on an input image taken from an arbitrary viewpoint with respect to the hand.

 

2 Introduction

The power of gesture based computer interfaces has been realized for some time now going back at least as far as Ivan Sutherland’s 1963 PhD thesis, "Sketchpad: A Man-machine Graphical Communications System."  The push towards better more natural interaction mechanisms between people and computers is greater now than it ever has been before.  Additionally, camera based approaches to gesture recognition [3] offer constraint free motion of the user and thus an even more natural mode of interaction.  Gesture recognition systems not only offer possibilities for general interaction paradigms, but also interfaces for handicapped people to communicate with computers and other people.

 

3 Approach / Methods

Using an aspect graph based approach to shape-recognition [1], the first step was to acquire data (i.e. images) around the equatorial axis at 5 degree intervals in order to represent each gesture.   We were confronted with several options concerning the model acquisition.  One option would have been to use an ordinary still camera and tripod to capture an actual person’s gesture from each of the 72 needed views.  The downside of this approach is that it is very time consuming and easy to introduce inaccuracies.  We instead explored the possibility of using a 3D polygonal model and a synthetic camera to generate the 72 views quickly and easily.  Finding a high enough detail 3D model which was could also be articulated proved to be difficult however.  Therefore we decided to use a 3D scanning system to generate 3D polygonal models from a real-world articulatable rubber hand model. Using this procedure we were able to generate the 5 models seen in Figure 1.

 

Figure 1. Renderings of the 5 models used to generate the gesture database.  From left to right, top to bottom, the gestures represent letters I, R, U, W, and V in the American Sign Language (ASL) finger alphabet.

 

Once the hand gestures were scanned and existed as a 3D polygonal representation, the generation of the 72 views was greatly simplified.  Each of the views was rendered in a flat tone so that the ensuing curve extraction algorithm was also simplified (Figure 2).  The silhouette outlines were defined at the boundaries of the flat rendering and the curves were extracted and stored as a list of ordered (x,y) coordinates.

 

Figure 2.  A 3D gesture model is rendered in a flat tone (Top). The silhouette curve is then extracted (Bottom).  This process is repeated for all 72 views, three of which are shown.

 

The test gestures were collected in a more traditional manner by taking several digital pictures at varying angles of a person wearing a dark glove making each hand gesture.  A threshold was applied to these images and the silhouette curve was extracted in much the same way as for the database images.

 

Figure 3.  Test data was collected using a digital camera and the silhouette curves were then extracted.

 

The shape-similarity metric used for matching one gesture to another in the first prototype system was a curve matching algorithm based on a Euclidean distance cost applied to the outermost silhouette curves.  Other shape-similarity metrics such as those based on shock-graph matching [5] may also be appropriate.

 

4 Initial Results

In order to obtain some idea of the effectiveness of applying curve matching as the similarity metric in hand gesture recognition an initial experiment was carried out.  The equatorial images of the five gestures shown in Figure 1 resulted in an image gesture database of 72 * 5 = 360 images.  Nineteen digital pictures of someone making the actual gestures (see Figure 3.) were then collected from different viewing angles and matched against the database images using the curve matching algorithm and an Euclidean distance cost function.  Curve resolution was on the order of 150-200 points and each input gesture took approximately 10 minutes on an Octane2 to compare against the entire database.  Results are shown in Table 1.

 

Gesture Input

1st Match

Cost

2nd  Match

Cost

3rd Match

Cost

I ~0°

I 310°

2.67

I 320°

2.76

I 330°

2.91

I ~90°

U 340°

2.09

U 325°

2.44

I 280°

2.57

I ~180°

I 195°

2.94

I 185°

2.98

I 220°

3.31

I ~270°

V 255°

2.35

U 295°

2.47

V 250°

2.49

R ~80°

U 110°

2.08

U 100°

2.10

U 40°

2.11

R ~90°

R 30°

2.14

U 15°

2.20

R 10°

2.23

R ~180°

R 230°

1.50

R 225°

1.63

U 240°

1.63

R ~270°

R 270°

1.85

R 290°

1.89

R 295°

1.94

U ~90°

W 115°

3.04

U 75°

3.25

U 250°

3.25

U ~180°

U 195°

1.48

U 175°

1.52

U 210°

1.54

U ~270°

U 240°

2.91

U 155°

2.91

U 160°

2.99

V ~0°

V 355°

2.81

V 335°

3.14

V 340°

3.26

V ~90°

U 120°

2.96

U 115°

3.08

U 15°

3.26

V ~180°

V 195°

2.25

V 180°

2.37

V 165°

2.62

V ~240°

V 150°

2.65

V 155°

2.85

V 160°

2.99

W ~0°

V 30°

3.22

V 40°

3.41

V 35°

3.61

W ~90°

R 25°

3.34

V 105°

3.44

U 15°

3.57

W~180°

V 210°

2.89

V 215°

2.95

W 185°

2.98

W~270°

U 185°

3.17

U 165°

3.49

W 140°

3.94

Table 1.  Results of a matching experiment showing the input gesture and its approximate rotation with regard to the camera and the top 3 matches with their rotations and match scores. Input gestures that were correctly identified within the top 3 matches are shown in red whereas unsuccessful matches are shown in blue.

 

5 Discussion of Results

If we define a match between gestures to mean that the correct identity of the gesture can be found within the top three matches, the accuracy of the algorithm can be taken as 14/19 = 74%.  However, a more realistic measure of matching effectiveness is whether the correct gesture was identified as the top match, in which case our accuracy drops to 10/19 = 53%.  An even higher goal would be to not only correctly identify a gesture but also to identify its pose.  If we were to define a matching success as identifying the correct gesture identity as the top match and the correct pose to within ± 30° then our accuracy is 6/19 = 32%. There are several possible explanations for the rather low matching results obtained.  

The resolution of the curves used for matching was on the order of 150-200 points.  The primary reason for using this range of points is the inherent time complexity of curve matching.  However, the low resolution of curves may have adversely affected the accuracy of the matching algorithm.  This could have been the case for gestures which are somewhat similar but differ in fine details such as between R and U in which the difference is just a crossing of the index and middle fingers.  Higher resolution curves would better capture this fine detail and perhaps lead to a better match, however would be at the expense of matching speed.

A major factor contributing to matching success is how accurate the model used for generating the database images was.  Using the articulatable rubber model produced fairly good silhouette curves, however there still exist some subtle differences in the way that it deforms and a real human hand deforms.  For instance, take the gesture R 90° generated from the model and the R ~90° image of a human hand,

 

The image on the right, taken from the actual human gesture differs from the image on the left generated from the model in that there is a larger bulge produced by the middle finger when it crosses the index finger.  Therefore, the way that the model behaves is a key factor in determining how closely the real data will fit it.  Perhaps to achieve better performance the database images should in fact be collected from many people making the gestures.  The curves could then be averaged together to form a typical gesture.  In this way the actual behavior of the hand would be better captured by the model.  Alternatively, a 3D model of a hand which deforms in a physically and physiologically realistic way could also be used to generate the database images.

Examining the cases in Table 1 for which good and bad matching results were obtained leads to some insights as to why the curve matching metric produces the given results.  By inspection, we can see that the majority of the good results, in which the identity of the gesture was successfully recognized, occurred for views around 0° and 180°.  This is because these angles allow for the fingers to generate more defining silhouette curves than for views at 90° and 270°.  For example, the gestures U and V look very similar from a 90° view,

 

but are rather well differentiated from a view at 0°,

 

The self occlusion of the fingers makes matching much more difficult for the 90° and 270° views.  The subtle differences at these viewing angles might only be apparent in a silhouette image at much higher curve resolutions than were used in this experiment. There is a definite loss of information which cannot be recovered due to occlusion in going from the gesture images to the silhouette curves.

 

6 Conclusions

Applying curve matching to the problem of recognizing hand gestures based on silhouette curves has been shown to be only moderately effective.  Technical factors contributing to the accuracy of the matching range from the resolution of the outline curves to how realistic the model used to generate the database curves was.  Other shape matching metrics such as shock graph matching [5] may be more appropriate

In an object as complexly articulated as the hand, subtle variations in the way the fingers are used and the way in which the hand itself deforms make recognition in the dimensionally reduced domain of images difficult.  In going from the binocular vision of humans, to monocular camera vision, to just looking at silhouette curves, a lot of information is lost.  A more robust recognition system would look at not only shape, but also other perceptually important information such as shading, depth, and perhaps even higher level visual understanding such as apriori knowledge of a hand model.


References

[1]    C. Cyr and B. Kimia. 3D Object recognition using shape similarity-based aspect graph. IEEE International Conference on Computer Vision (ICCV), 2001.

[2]    A. Edwards. Progress in Sign Language Recognition. Gesture and sign language in human-computer interaction, Proceedings International gesture workshop, 1997.

[3]    J. Lee and T. Kunii. Model-based analysis of hand posture. IEEE Computer Graphics and Applications, pp 77-86, Sept. 1995.

[4]    T. Sebastian, P. Klein, and B. Kimia. Alignment-based recognition of shape outlines. Proc. 4th International Workshop on Visual Form (IWVF4), pp 606-618, May 2001.

[5]    T. Sebastian, P. Klein, and B. Kimia. Recogntiion of shapes by editing shock graphs. IEEE International Conference on Computer Vision (ICCV), pp 755-762, 2001.