Part I: Fitting the Data
The first order of business is to develop an algorithm,
that given an N-dimensional space generated from data, estimates a mixture
of N-dimensional Gaussians to best fit the space spanned by the data.
In order to reduce the dimensionality of this space, Principal Component
Analysis (PCA) was run on the data using a standard singular value decomposition
(SVD) technique. Assuming 90% of the variance to be a sufficient
estimate of the data, "lip space" was reduced to 7 dimensions by taking
the seven most significant eigenvectors from the SVD algorithm. The
problem of fitting a mixture of Gaussians to this space was solved using
an EM algorithm EMDN.m , that iteratively improves
the parameters of the Gaussians in repeated "Expectation" and "Maximization"
steps. Figures 2-4 illustrate results from the EM-algorithm. Note
that this algorithm requires a starting point so some initial parameter
estimate is necessary. A likely candidate for this initial estimage
might be a k-means clustering algorithm, although a more naive method was
used here to generate the data. The initial estimate of the centers
of each Gaussian model was taken as an interpolation value along the line
from the minimum to the maximum data vector in each dimension. Figure
1 depicts these estimated centers for the first two dimensions of the
data. The first shows the interpolation from actual minimum to maximum,
and the second shows the minimum and maximum squeezed in towards the mean
and then interpolated. Different fits do arise from different initial
estimates. The initial estimates for the covariance is simply the
covariance of the data set, and the priors all are initialized at 1/N.
For the purpose of continuity, all results were derived with an initial
center estimate with a squeeze value of .5 (i.e., the minimum and maximum
are taken to be halfway between the actual minimum and maximum and the
mean. Note also that all estimates include the mean, even though
the linear interpolation from min to max might not include it. You
can see this clearly in the first figure when the true mean "sags" off
the line.

Figure 1
The dimensions of the data can be arbitrarily large, and the covariance matrices of this space can become either unwieldy or sometimes near singular, near zero, or blown up. Each of these possibilities pose serious difficulties to the algorithm. In the face of this, the covariance matrix can be assumed diagonal in situations where the full covariance matrix causes difficulties. Figure 2 depicts a set of fits using the full covariance matrix, the estimates for different dimensions (titled as Dn on each figure) and varying number of Gaussians (10,7,3).
10 Gaussians


7 Gaussians

3 Gaussians

Figure 2
Figure 3 illustrates the same information as Figure 2, but uses the diagonal covariance assumption instead. Note all ellipses are aligned with the axes Dn1,Dn2.
10 Gaussians


7 Gaussians

3 Gaussians

Figure 3
I found that my results were more robust using the diagonal assumption and unless otherwise stated, all results were calculated using a fit under this assumption. Below are some additional interesting fits using the diagonal assumption. The final two graphs show the different fits that arise from changing only the initial estimate for the centers of the Gaussians.


.3 squeeze value
.7 squeeze value

Figure 4
Part II: Lip Detection
With a probabilistic model for the data estimates, specific image regions can be cast against this Probability Density Function in order to evaluate how close the region is to this "lip space". In a probabilistic sense, this technique can highlight regions that are the most likely to contain lip-like structure. Figure 5-13 illustrates this detection procedure on face images that contains no distinguishable expression. This procedure tests EVERY pixel in the image. Each pixel is considered to be the top left corner to a region (i.e., the r(0,0) of a particular region) of the same size as the training data. This region is cast into lip space by projecting it onto the basis vectors derived from SVD. Then, this reduced dimension representation of the region is then evaluated by the Gaussian Mixure Model (GMM) estimated by the EM algorithm to determine the likelihood of this region being part of lip space. This results in a probability for each pixel in the image as the procedure is repeated in an exhaustive raster scan of the test image. The second image in Figures 5-13 depict these probabilities and can be thought of as likelihood maps of this procedure. From these likelihood maps, the maximum value was extracted and its corresponding region labeled on the image as a box. This corresponds to the best guess at the region that contains the lips. Other very likely region origins are also plotted as 'plus' values in the following color scheme:
Red: Likelihood > 90% of the maximum value
Yellow: Likelihood > 80% of the maximum value
Blue: Likelihood > 70% of the maximum value
Note: The fact that the box is blue has no significance in the above color scheme, is simply denotes the region of highest likelihood. Some of the later images have red boxes as I made the switch when it occurred to me that that would be a more intuitive color to use, given the above color scheme.
Further Note: One notes that the light regions in the likelihood
map don't seem to line up with the box and critical points labeled on the
region. However they actually do, but the two images have different
axes. I stopped evaluation on the right and bottom side of the image
when the region would no longer fit in the image. Therefore the likelihood
map is the equivalent of a morphological erosion of the right and bottom
side of the image with the element being the 66x36 box corresponding to
the region of interest.
Null Expression

Figure 5

Figure 6

Figure 7

Figure 8

Figure 9

Figure 10

Figure 11

Figure 12

Figure 13
Note: The technique is certainly not perfect and even when
it does find lip regions the centering is not exact. However, considering
the training data I think this procedure works remarkably well (I will
discuss this later). Below is the same procedure run using images
at the "peak" of a distinguishable expression of emotion.
Peak Expression

Figure 14

Figure 15

Figure 16

Figure 17

Figure 18

Figure 19

Figure 20

Figure 21

Figure 22

Figure 23
I believe in truth in advertising so I have displayed all results from
running batch tests on the Null expression and the "Peak" expression.
However, I also ran the tests on the half-way points between Null and Peak
for use in expression detection and for this set I have displayed only
the eight images where lips or partial lips were detected. Assume
that the other two images were detected with non-lip like maximum likelihood
regions. Figure 24,25 show mid-expression results.
Mid-Expression
Figure 24
Peak-Expression with Full Covariances
Figure 25
Note: Take special note of the last three image: they are dead on! These last three lip regions were detected using a GMM with full covariances. So the question is why not always use full covariances? The answer is: I probably should have. The reason why I did not is because in my initial tests with diagonal and full GMMs, diagonal GMMs yielded more detected regions containing lips or partial lips than full GMMs. From these results I made the decision to run all tests using the diagonal GMM - for better or for worse. I chose quantity over quality which may or may not make sense. In the case of "Expression Detection" it does not make sense because as the results in Part III will show, an exact or near exact estimate of the lip region is necessary in order to generate reliable expression detection.
Part III: Expression Detection
The training set is split into component parts: Those containing the primary three expressions; anger, sad, and joy. The same procedure as in Part I is used to develop a GMM for each expression. The result is three separate models, GMManger, GMMsad, GMMjoy.
Figure 26, 27 and 28 show ellipse fits for the three expressions.
Fits from the Anger Training Set


Fits from the Sad Training Set


Fits from Joy Training Set


Having developed these models, detected lip regions can be evaluated by each of the three expression GMMs to determine how likely each region is be of the corresponding expression. The following results illustrate this idea. The table contains by column, the image of interest, in the detected lip region, the likelihood values of this region for each expression, and the final expression decision based upon the maximum likelihood. Table 1 contains the results taken directly from the mid-expression lip detection procedures in Part II, these results should look familiar. Likewise Table 2 contains data taken directly from the peak-expression results in Part II. Take particular note to the decision of the last three images where the lip regions are near exact.
Mid-Expression
|
Test Image |
Detected Lip Region | Likelihood Anger | Likelihood Sad | Likelihood Joy | Decision |
![]() |
![]() |
0 | .022 | .0299 | Joy |
![]() |
![]() |
0 | .0024 | .0011 | Sad |
![]() |
![]() |
0 | 0 | .1403 | Joy |
![]() |
![]() |
0 | .0062 | .0008 | Sad |
![]() |
![]() |
0 | .0006 | 0 | Sad |
Peak-Expression
|
Test Image |
Detected Lip Region | Likelihood Anger | Likelihood Sad | Likelihood Joy | Decision |
![]() |
![]() |
.0005 | .0007 | .0115 | Joy |
![]() |
![]() |
0 | 0 | 0 | None |
![]() |
![]() |
0 | 0 | .1329 | Joy |
![]() |
![]() |
0 | .0735 | .0033 | Sad |
![]() |
![]() |
0 | .0006 | 0 | Sad |
![]() |
![]() |
.0009 | 0 | 0 | Anger |
![]() |
![]() |
.0038 | 0 | 0 | Anger |
![]() |
![]() |
0 | 0 | .0276 | Joy |
Note: The
last three lip regions were detected using gaussian models with full covariances,
as mentioned above. Note that when the region is dead on that expression
detector actually works! This motivates the next step to further
test the expression recognizer on lip regions that are exact or near exact
(i.e., in this case hand selected).
The procedure here is to hand select the lip region and then to make
an expression decision on each image in an expression series as it transitions
from the Null expression to the the Peak expression. The three tables
below show this procedure for series containing each of the three distinguishable
expressions. I only ran this procedure on three series, one for each
expression, so these tables aren't an 'optimistic' presentation of
the results, but an exhaustive one. The tables are a bit clumsy to
read, but should be interpreted as such. The first cell is the image
and the lip region of the Null expression. The first row is dedicated
to displaying images. Underneath the Null images is the expression
decision. The expression decisions are then displayed for each image
in a series in raster order, (i.e., right to left, top to bottom).
Over cells that contain an expression change I have displayed the image
and the lip region for two images that correspond to this expression transition.
In the case of the "Anger" series I have displayed only one transition
image because the preceding expression was Null. The final image
is underneath the final expression decision and corresponds to the last
image in the expression series.
![]() ![]() |
![]() ![]() |
![]() ![]() |
||
| Anger | Anger | Anger | Anger | Joy |
| Joy | Joy | Joy | Joy | Joy |
| Joy | Joy | Joy | Joy | Joy |
| Joy | Joy | Joy | Joy | Joy |
![]() ![]() |
![]() ![]() |
![]() ![]() |
|||
| None | None | None | Anger | Anger |
| Anger | Anger | Anger | Anger | Anger |
| Anger | Anger | Anger | Anger | Anger |
![]() ![]() |
![]() ![]() |
![]() ![]() |
![]() ![]() |
||
| Anger | Anger | Anger | Sad | Sad |
| Sad | Sad | Sad | Sad | Sad |
| Sad | Sad | |||
![]() ![]() |
Note: The results are rather good. Early expressions
without distinction tend to get categorized as "anger" or have no probabilities
at all and then after only a few images transition into the proper expression.
In each case the correct expression was latched soon after the initial
Null expression image.
Conclusions
Lips vary a lot and in strange ways. After taking a look at the training images I noted that they were plagued with variation. Not only variation with respect to expression but to also to translations. A good number of training regions contained some other feature, either part of the nose or the contour of the chin. I think this is the primary drawback of the training set. Also the variation between teeth and no teeth, pout, and snarl make for a space that might not look like anything or cluster anywhere. As I was doing part I, I was very concerned that the generated "lip-space" might not model lips any better than other areas of face. However, the large number of transition images and stable expressions seems to have overwhelmed the outliers and a space that can be considered "lip-like" was indeed generated.
I also am now under the assumption that the diagonal GMM might be the 'second-best' recognizer, even though it generally finds more near-lip regions, perhaps the sloppier fit given by the diagonal GMM is not precise enough to zero in on the exact lip region. The 'choosier' full GMM mis-detects more often but really locates exactly the lip regions which turns out to be an integral part of expression detection.
One can really get different fits from using different initial estimates.
Squeezing toward the mean tended to result in closer clustered mixtures,
while spreading tended to allow Gaussian mixtures to separate and seek
out local clusters. I would have liked to try a k-means clustering
solution and also generate some data to indicate how important the differences
in the fits make when detecting lips and expressions.
Code: In order of importance
EMND.m (Handles the EM algorithm to fit data
with mixture of Gaussians)
modelData.m (
locateLips.m (Find the regions of high likelihoods
in lip space)
detExpress.m (Find the most likely expression
of a given image region)
Helper Functions:
getReducedSpace.m
marginalProbability.m
runSet.m
showStatistics.m
evaluateSwatch.m
evaluateSample.m
getSwatch.m
getRegion.m
generateEllipses.m