Marc Johannes
CS295-3
Machine Vision and Learning
November 6, 2000
"Expression Recognition"
 
 

Part I: Fitting the Data

The first order of business is to develop an algorithm, that given an N-dimensional space generated from data, estimates a mixture of N-dimensional Gaussians to best fit the space spanned by the data.  In order to reduce the dimensionality of this space, Principal Component Analysis (PCA) was run on the data using a standard singular value decomposition (SVD) technique.  Assuming 90% of the variance to be a sufficient estimate of the data, "lip space" was reduced to 7 dimensions by taking the seven most significant eigenvectors from the SVD algorithm.  The problem of fitting a mixture of Gaussians to this space was solved using an EM algorithm EMDN.m ,  that iteratively improves the parameters of the Gaussians in repeated "Expectation" and "Maximization" steps.  Figures 2-4 illustrate results from the EM-algorithm. Note that this algorithm requires a starting point so some initial parameter estimate is necessary.  A likely candidate for this initial estimage might be a k-means clustering algorithm, although a more naive method was used here to generate the data.  The initial estimate of the centers of each Gaussian model was taken as an interpolation value along the line from the minimum to the maximum data vector in each dimension. Figure 1 depicts these estimated centers for the first two dimensions of the data.  The first shows the interpolation from actual minimum to maximum, and the second shows the minimum and maximum squeezed in towards the mean and then interpolated.  Different fits do arise from different initial estimates.  The initial estimates for the covariance is simply the covariance of the data set, and the priors all are initialized at 1/N.  For the purpose of continuity, all results were derived with an initial center estimate with a squeeze value of .5 (i.e., the minimum and maximum are taken to be halfway between the actual minimum and maximum and the mean.  Note also that all estimates include the mean, even though the linear interpolation from min to max might not include it.  You can see this clearly in the first figure when the true mean "sags" off the line.
 
 


Figure 1




The dimensions of the data can be arbitrarily large, and the covariance matrices of this space can become either unwieldy or sometimes near singular, near zero, or blown up.  Each of these possibilities pose serious difficulties to the algorithm.  In the face of this, the covariance matrix can be assumed diagonal in situations where the full covariance matrix causes difficulties.  Figure 2 depicts a set of fits using the full covariance matrix, the estimates for different dimensions (titled as Dn on each figure) and varying number of Gaussians (10,7,3).

10 Gaussians


 

7 Gaussians

 

3 Gaussians

Figure 2




Figure 3 illustrates the same information as Figure 2, but uses the diagonal covariance assumption instead.  Note all ellipses are aligned with the axes Dn1,Dn2.

10 Gaussians


 

7 Gaussians

 

3 Gaussians

Figure 3





I found that my results were more robust using the diagonal assumption and unless otherwise stated, all results were calculated using a fit under this assumption.  Below are some additional interesting fits using the diagonal assumption.  The final two graphs show the different fits that arise from changing only the initial estimate for the centers of the Gaussians.



 

.3 squeeze value                                      .7 squeeze value

Figure 4
 

Part II:  Lip Detection

With a probabilistic model for the data estimates,  specific image regions can be cast against this Probability Density Function in order to evaluate how close the region is to this "lip space".   In a probabilistic sense, this technique can highlight regions that are the most likely to contain lip-like structure.  Figure 5-13 illustrates this detection procedure on face images that contains no distinguishable expression.  This procedure tests EVERY pixel in the image.  Each pixel is considered to be the top left corner to a region (i.e., the r(0,0) of a particular region) of the same size as the training data.  This region is cast into lip space by projecting it onto the basis vectors derived from SVD.  Then, this reduced dimension representation of the region is then evaluated by the Gaussian Mixure Model (GMM) estimated by the EM algorithm to determine the likelihood of this region being part of lip space.  This results in a probability for each pixel in the image as the procedure is repeated in an exhaustive raster scan of the test image.  The second image in Figures 5-13 depict these probabilities and can be thought of as likelihood maps of this procedure.  From these likelihood maps, the maximum value was extracted and its corresponding region labeled on the image as a box.  This corresponds to the best guess at the region that contains the lips.   Other very likely region origins are also plotted as 'plus' values in the following color scheme:

Red:  Likelihood > 90% of the maximum value
Yellow:  Likelihood > 80% of the maximum value
Blue:  Likelihood > 70% of the maximum value

Note:  The fact that the box is blue has no significance in the above color scheme, is simply denotes the region of highest likelihood.  Some of the later images have red boxes as I made the switch when it occurred to me that that would be a more intuitive color to use, given the above color scheme.

Further Note:  One notes that the light regions in the likelihood map don't seem to line up with the box and critical points labeled on the region.  However they actually do, but the two images have different axes.  I stopped evaluation on the right and bottom side of the image when the region would no longer fit in the image.  Therefore the likelihood map is the equivalent of a morphological erosion of the right and bottom side of the image with the element being the 66x36 box corresponding to the region of interest.
 
 

Null Expression


Figure 5


Figure 6


Figure 7


Figure 8
 


Figure 9


Figure 10


Figure 11
 


Figure 12


Figure 13





Note:  The technique is certainly not perfect and even when it does find lip regions the centering is not exact.  However, considering the training data I think this procedure works remarkably well (I will discuss this later).  Below is the same procedure run using images at the "peak" of a distinguishable expression of emotion.
 
 

Peak Expression
 


Figure 14


Figure 15


Figure 16


Figure 17


Figure 18
 


Figure 19
 


Figure 20
 


Figure 21
 


Figure 22
 


Figure 23




I believe in truth in advertising so I have displayed all results from running batch tests on the Null expression and the "Peak" expression.  However, I also ran the tests on the half-way points between Null and Peak for use in expression detection and for this set I have displayed only the eight images where lips or partial lips were detected.  Assume that the other two images were detected with non-lip like maximum likelihood regions.  Figure 24,25 show mid-expression results.
 
 

Mid-Expression






Figure 24
 

Peak-Expression with Full Covariances




Figure 25




Note:  Take special note of the last three image: they are dead on!  These last three lip regions were detected using a GMM with full covariances.   So the question is why not always use full covariances?  The answer is: I probably should have.   The reason why I did not is because in my initial tests with diagonal and full GMMs, diagonal GMMs yielded more detected regions containing lips or partial lips than full GMMs.  From these results I made the decision to run all tests using the diagonal GMM - for better or for worse.  I chose quantity over quality which may or may not make sense.  In the case of "Expression Detection"  it does not make sense because as the results in Part III will show,  an exact or near exact estimate of the lip region is necessary in order to generate reliable expression detection.

Part III:  Expression Detection

The training set is split into component parts:  Those containing the primary three expressions; anger, sad, and joy.  The same procedure as in Part I is used to develop a GMM for each expression.  The result is three separate models, GMManger, GMMsad, GMMjoy.

Figure 26, 27 and 28 show ellipse fits for the three expressions.
 



Fits from the Anger Training Set


 
 

Fits from the Sad Training Set


 


Fits from Joy Training Set

Having developed these models, detected lip regions can be evaluated by each of the three expression GMMs to determine how likely each region is be of the corresponding expression.  The following results illustrate this idea.  The table contains by column, the image of interest, in the detected lip region, the likelihood values of this region for each expression, and the final expression decision based upon the maximum likelihood.  Table 1 contains the results taken directly from the mid-expression lip detection procedures in Part II, these results should look familiar.  Likewise Table 2 contains data taken directly from the peak-expression results in Part II. Take particular note to the decision of the last three images where the lip regions are near exact.

Mid-Expression



Test Image
Detected Lip Region Likelihood Anger Likelihood Sad Likelihood Joy Decision
0 .022 .0299 Joy
0 .0024 .0011 Sad
0 .1403 Joy
0 .0062 .0008 Sad
0 .0006 0 Sad
Table 1
 

Peak-Expression


Test Image
Detected Lip Region Likelihood Anger Likelihood Sad Likelihood Joy Decision
.0005 .0007 .0115 Joy
0 0 0 None
0 0 .1329 Joy
0 .0735 .0033 Sad
0 .0006 0 Sad
.0009 0 0 Anger
.0038 0 0 Anger
0 0 .0276 Joy
Table 2

Note:  The last three lip regions were detected using gaussian models with full covariances, as mentioned above.  Note that when the region is dead on that expression detector actually works!  This motivates the next step to further test the expression recognizer on lip regions that are exact or near exact (i.e., in this case hand selected).
 

The procedure here is to hand select the lip region and then to make an expression decision on each image in an expression series as it transitions from the Null expression to the the Peak expression.  The three tables below show this procedure for series containing each of the three distinguishable expressions.  I only ran this procedure on three series, one for each expression,  so these tables aren't an 'optimistic' presentation of the results, but an exhaustive one.  The tables are a bit clumsy to read, but should be interpreted as such.  The first cell is the image and the lip region of the Null expression.  The first row is dedicated to displaying images.  Underneath the Null images is the expression decision.  The expression decisions are then displayed for each image in a series in raster order, (i.e., right to left, top to bottom).  Over cells that contain an expression change I have displayed the image and the lip region for two images that correspond to this expression transition.  In the case of the "Anger" series I have displayed only one transition image because the preceding expression was Null.  The final image is underneath the final expression decision and corresponds to the last image in the expression series.
 
 
 

Anger Anger Anger Anger Joy
Joy Joy Joy Joy Joy
Joy Joy Joy Joy Joy
Joy Joy Joy Joy Joy
Table 3


None None None Anger Anger
Anger Anger Anger Anger Anger
Anger Anger Anger Anger Anger
Table 4



 
 
 

Anger Anger Anger Sad Sad
Sad Sad Sad Sad Sad
Sad Sad
Table 5






Note:  The results are rather good.   Early expressions without distinction tend to get categorized as "anger" or have no probabilities at all and then after only a few images transition into the proper expression.  In each case the correct expression was latched soon after the initial Null expression image.
 
 

Conclusions






Lips vary a lot and in strange ways.  After taking a look at the training images I noted that they were plagued with variation.  Not only variation with respect to expression but to also to translations.  A good number of training regions contained some other feature, either part of the nose or the contour of the chin.  I think this is the primary drawback of the training set.  Also the variation between teeth and no teeth,  pout, and snarl make for a space that might not look like anything or cluster anywhere.  As I was doing part I, I was very concerned that the generated "lip-space" might not model lips any better than other areas of face.  However,  the large number of transition images and stable expressions seems to have overwhelmed the outliers and a space that can be considered "lip-like" was indeed generated.

I also am now under the assumption that the diagonal GMM might be the 'second-best' recognizer, even though it generally finds more near-lip regions, perhaps the sloppier fit given by the diagonal GMM is not precise enough to zero in on the exact lip region.  The 'choosier' full GMM mis-detects more often but really locates exactly the lip regions which turns out to be an integral part of expression detection.

One can really get different fits from using different initial estimates.  Squeezing toward the mean tended to result in closer clustered mixtures, while spreading tended to allow Gaussian mixtures to separate and seek out local clusters.  I would have liked to try a k-means clustering solution and also generate some data to indicate how important the differences in the fits make when detecting lips and expressions.
 

Code: In order of importance

EMND.m  (Handles the EM algorithm to fit data with mixture of Gaussians)
modelData.m (
locateLips.m (Find the regions of high likelihoods in lip space)
detExpress.m (Find the most likely expression of a given image region)

Helper Functions:

getReducedSpace.m
marginalProbability.m
runSet.m
showStatistics.m
evaluateSwatch.m
evaluateSample.m
getSwatch.m
getRegion.m
generateEllipses.m