MASSACHUSETTS INSTITUTE OF TECHNOLOGY
ARTIFICIAL INTELLIGENCE LABORATORY
and
CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING
AT. Memo No. 1521 December, 1994
C.B.C.L. Paper No. 112
Example-based Learning for View-based
Human Face Detection
Kah-Kay Sung and Tomaso Poggio
This publication can be retrieved by anonymous ftp to publications.ai.mit.edu.
Abstract
Finding human faces automatically in an image is a difficult yet important first step to a fully automatic
face recognition system. It is also an interesting academic problem because a successful face detection
system can provide valuable insight on how one might approach other similar object and pattern detection
problems. This paper presents an example-based learning approach for locating vertical frontal views of
human faces in complex scenes. The technique models the distribution of human face patterns by means
of a few view-based "face" and "non-face" prototype clusters. At each image location, a difference feature
vector is computed between the local image pattern and the distribution-based model. A trained classifier
determines, based on the difference feature vector, whether or not a human face exists at the current image
location. We show empirically that the prototypes we choose for our distribution-based model, and the
distance metric we adopt for computing difference feature vectors, are both critical for the success of our
system.
Copyright © Massachusetts Institute of Technology, 1994
This report describes research done at the Artificial Intelligence Laboratory and within the Center for Biological and Computa-
tional Learning in the Department of Brain and Cognitive Sciences at the Massachusetts Institute of Technology. This research
is sponsored by grants from ARPA-ONR under contract N00014-92-J-1879; and by a grant from the National Science Foun-
dation under contract ASC-9217041 (this award includes funds from ARPA provided under the HPCC program). Additional
support is provided by the North Atlantic Treaty Organization, ATR Audio and Visual Perception Research Laboratories,
Mitsubishi Electric Corporation, Sumitomo Metal Industries, and Siemens AG. Support for the A.I. Laboratory's artificial
intelligence research is provided by ARPA-ONR contract N00014-91-J-4038. Tomaso Poggio is supported by the Uncas and
Helen Whitaker Chair at MIT's Whitaker College.
1 Introduction
Feature and pattern detection in images is a classical
computer vision problem with many potential applica-
tions, ranging from automatic target recognition to in-
dustrial inspection tasks in assembly lines. While there
has been some successful pattern detection systems in
the past, especially those using pictorial templates [4]
[10] [11] or geometric models [7] for describing objects,
most such systems are limited by some serious con-
straints. For example, in some systems, the target ob-
jects must be rigid or at least made up of rigid parts. In
others, lighting must be controlled or the imaging geom-
etry must be known.
This paper presents a new feature and pattern detec-
tion technique for finding slightly deformable objects un-
der reasonable amounts of lighting variation. To demon-
strate our technique, we focus on the specific problem
of finding human faces in images. We have developed
a generic human face detection system that operates
successfully under a reasonably wide range of lighting
conditions and imaging environments. Our system de-
tects vertically oriented and unoccluded frontal views
of human faces in grey-level images. It handles faces
within a range of scales as specified by the user, and
works under different lighting conditions, even with mod-
erately strong shadows. We stress that our ultimate
goal is to propose a general methodology for taking on
feature detection tasks in multiple domains, including
hand-printed character recognition, industrial inspec-
tion, medical image analysis and terrain classification,
where target patterns may not be rigid or geometrically
parameterizable, and where imaging conditions may not
be within the user's control.
1.1 The Face Detection Problem
We begin by describing the problem of human face de-
tection. Given as input an arbitrary image, which could
be a digitized video signal or a scanned photograph, de-
termine whether or not there are any human faces in the
image, and if there are, return an encoding of the loca-
tion and spatial extent of each human face in the image.
An example encoding might be to fit each face in the
image with a bounding box, and return the image co-
ordinates of the box corners. A related problem to face
detection is face recognition: given an input image of a
face, compare the input face against models in a library
of known faces and report if a match is found. In recent
years, the face recognition problem has attracted much
attention because of its many possible applications in
automatic access control systems and human-computer
interfaces.
Why is automatic face detection an interesting prob-
lem? Applications wise, face detection has direct rele-
vance to the face recognition problem, because the first
important step of a fully automatic human face recog-
nizer is usually one of identifying and locating faces in
an unknown image. So far, the focus of face recognition
research has been mainly on distinguishing individual
faces from others in a database. The task of finding
faces in an arbitrary background is usually avoided by
either hand segmenting the input image, or by capturing
faces against a known uniform background.
Face detection also has potential applications in
human-computer interfaces and surveillance systems.
For example, a face finder can make workstations with
cameras more user friendly by turning monitors on and
keeping them active whenever there is someone in front
of the terminal. In some security and census systems,
one could determine the number of people in a scene by
counting the number of visible human faces.
From an academic standpoint, face detection is inter-
esting because faces make up a challenging class of nat-
urally structured but slightly deformable objects. There
are many other object classes and phenomena in the real
world that share similar characteristics, for example dif-
ferent hand or machine printed instances of the character
'A', tumor anomalies in MRI scans and material defects
in industrial products. A successful face detection sys-
tem can provide valuable insight on how one might ap-
proach other similar pattern and feature detection prob-
lems.
Face detection is difficult for three main reasons.
First, although most faces are similarly structured with
the same facial features arranged in roughly the same
spatial configuration, there can be a large component of
non-rigidity and textural differences among faces. For
the most part, these elements of variability are due to
the basic differences in "facial appearance" between in-
dividuals — person 'A' has a larger nose than person
'B', person 'C has eyes that are farther apart than per-
son 'D', while person 'E' has a darker skin complexion
than person 'F'. Even between images of the same per-
son's face, there can still be significant geometrical or
textural differences due to changes in expression and the
presence or absence of facial makeup. As such, tradi-
tional fixed template matching techniques and geomet-
ric model-based object recognition approaches that work
well for rigid and articulate objects tend to perform in-
adequately for detecting faces.
Second, face detection is also made difficult because
certain common but significant features, such as glasses
or a moustache, can either be present or totally absent
from a face. Furthermore, these features, when present,
can cloud out other basic facial features (eg. the glare
in one's glasses may de-emphasize the darkness of one's
eyes) and have a variable appearance themselves (eg.
glasses come in many different designs). All this adds
more variability to the range of permissible face pat-
terns that a comprehensive face detection system must
handle.
Third, face detection can be further complicated by
unpredictable imaging conditions in an unconstrained
environment. Because faces are essentially 3-dimensional
structures, a change in light source distribution can cast
or remove significant shadows from a particular face,
hence bringing about even more variability to 2D facial
patterns.
1.2 Existing work
As discussed above, the key issue and difficulty in face
detection is to account for the wide range of allowable fa-
cial pattern variations in images. There have been three
main approaches for dealing with allowable pattern vari-
ations, namely: (1) the use of correlation templates, (2)
deformable templates, and (3) spatial image invariants.
Correlation Templates Fixed correlation templates
work like matched filters. They compute a difference
measurement between a fixed target pattern and candi-
date image locations, and the output is thresholded for
matches.
While the class of all face patterns is probably too var-
ied to be modeled by fixed correlation templates, there
are some face detection approaches that use a bank of
several correlation templates to detect major facial sub-
features in the image [2] [3] . The assumption here is that
the degree of non-rigidity in these facial subfeatures is
small enough to be adequately described by a few fixed
templates. At a later stage, the technique infers the
presence of faces by analyzing the spatial relationship
between the detected subfeatures.
A closely related approach to correlation templates is
that of view-based eigen-spaces [12]. The approach as-
sumes that the set of all possible face patterns occupies
a small and easily parameterizable sub-space in the origi-
nal high dimensional input image vector space. To detect
faces, the approach first recovers a representation of the
sub-space of face patterns as a reference for matching.
An image pattern is classified as "a face" if its distance
from the sub-space of faces is below a certain threshold,
according to an appropriate distance metric.
Typically, the approach approximates the sub-space
of face patterns using clusters and their principal com-
ponents from one or more sets of face images. Mathe-
matically, each set of face images is modeled as a cluster
of points in the original high-dimensional input image
vector space, and each cluster is characterized by the
principal components of the distribution of its elements.
The principal components, or eigenvectors, capture the
different types of variation that can occur between face
images in the set, with the largest eigenvectors usually
corresponding to the key variations types — those due to
illumination changes and predominant differences in fa-
cial appearance between individuals. The face sub-space
is formed using only a subset of the largest eigenvectors
from each of the clusters, so as to preserve only the se-
mantically meaningful variations. One commonly used
distance metric is the Euclidean distance of a pattern
from the nearest face sub-space location, which may be
interpreted as a difference measure between the observed
pattern and some "typical" face pattern [12]. So far, this
approach has only been demonstrated on images with
not-so-cluttered backgrounds.
The view-based eigen-space approach has also been
used for detecting and locating individual facial sub-
features. The purpose of these sub-feature detectors has
mainly been one of registering mugshot faces with model
faces for recognition, rather than as a preprocessor for
face detection.
Deformable Templates Deformable templates are
similar in principle to classical correlation templates, ex-
cept that the former has some built-in non-rigidity com-
ponent. To find faces in an image with a deformable tem-
plate, one essentially fits the template to different parts
of the image and thresholds the output for matches.
One deformable template approach uses hand con-
structed parameterized curves and surfaces to model the
non-rigid elements of faces and facial subfeatures, such as
the eyes, nose and lips [20]. The parameterized curves
and surfaces are fixed elastically to a global template
frame to allow for minor positional variations between
facial features. The matching process attempts to align
the template with one or more pre-processed versions of
the image, such as the peak, valley and edge maps. An
energy functional constrains alignment by attracting the
parameterized curves and surfaces to corresponding im-
age features, while penalizing deformation "stress" in the
template. The best fit configuration is found by mini-
mizing the the energy functional, and the minimum value
also serves as a closeness measure for the match.
Image Invariants Image-invariance schemes assume
that even though faces may vary greatly in appearance
for a variety of reasons, there are some spatial image re-
lationships common and possibly unique to all face pat-
terns. To detect faces, one has to compile such a set of
image invariants and check for positive occurrences of
these invariants at all candidate image locations.
One image-invariance scheme is based on a set of ob-
served brightness invariants between different parts of
a human face [15]. The underlying observation is that
that while illumination and other changes can signifi-
cantly alter brightness levels at different parts of a face,
the local ordinal structure of brightness distribution re-
mains largely unchanged. For example, the eye regions
of a face are almost always darker than the cheeks and
the forehead, except possibly under some very unlikely
lighting conditions. Similarly, the bridge of the nose
is always brighter than the two flanking eye regions.
The scheme encodes these observed brightness regulari-
ties as a ratio template, which it uses to pattern match
for faces. A ratio template is a coarse spatial template
of a face with a few appropriately chosen sub-regions,
roughly corresponding to key facial features. The bright-
ness constraints between facial parts are captured by
an appropriate set of pairwise brighter-darker relation-
ships between corresponding sub-regions. An image pat-
tern matches the template if it satisfies all the pairwise
brighter-darker relationships.
1.3 Example- based Learning and Face
Detection
In this paper, we formulate the face detection problem
as one of learning to recognize face patterns from ex-
amples. We use an initial database of about 1000 face
mugshots to construct a distribution-based generic face
model with all its permissible pattern variations in a
high dimensional image window vector space. We then
train a decision procedure on a sequence of "face" and
"non-face" examples, to empirically discover a set of op-
erating parameters and thresholds that separates "face"
patterns from "non-face" patterns. Our learning-based
approach has the following distinct advantages over ex-
isting techniques:
First, our modeling scheme does not depend much
on domain specific knowledge or special hand-crafting
techniques to parameterize face patterns and their var-
ious sources of variation. This immediately eliminates
one potential source of modeling error — that due to
inaccurate or incomplete knowledge. Unlike deformable
template approaches or image invariance techniques that
build models based on prior knowledge of facial struc-
ture, our scheme essentially builds models that describe
the distribution of face patterns it receives. Therefore,
as long as we provide our scheme with a sufficiently com-
prehensive set of example face patterns, we can expect
our face models to be more accurate and more descrip-
tive than the manually synthesized ones.
Second, unlike most non learning-based approaches
that typically obtain their operating parameters and
thresholds manually from a few trial cases, our detec-
tion scheme derives its parameters and thresholds au-
tomatically from a large number of input-output exam-
ples. This makes our scheme potentially superior in two
ways: (1) For a given set of free thresholds and parame-
ters, we can expect our scheme to arrive at a statistically
more reliable set of operating values because it is able to
process a wider sample of training data automatically.
(2) Because our scheme automatically learns thresholds
and parameters, it can be easily made to learn appro-
priate high-dimensional and non-linear relationships on
image measurements for distinguishing between "face"
and "non-face" patterns. These relationships, even if
they do exist, may be too complex for human observers
to discover manually.
Our resulting system also has the following desirable
characteristics. Performance wise, it can be made arbi-
trarily robust by increasing the size and variety of its
training examples. Both false positive and false negative
detection errors can be easily corrected by further train-
ing with the wrongly classified patterns. Functionality
wise, the system is also highly extensible. For instance,
to detect human faces over a wider range of poses, we
can apply the same example-based learning technique to
create a few separate systems that each model and iden-
tify faces at a different pose, and have the systems run
in parallel.
2 System Overview and Approach
In our view-based approach, faces are treated as a
class of local target patterns to be detected in an image.
Because faces are essentially structured objects with the
same key features geometrically arranged in roughly the
same fashion, it is possible to define a semantically sta-
ble "canonical" face structure in the image domain for
the purpose of pattern matching. Figure 1(a) shows the
canonical face structure adopted by our system. It cor-
responds to a square portion of the human face whose
upper boundary lies just above the eyes and whose lower
edge falls just below the mouth. The face detection task
thus becomes one of appropriately representing the class
of all such "face-like" window patterns, and finding in-
stances of these patterns in an image.
Our system detects faces by exhaustively scanning an
image for these face-like window patterns at all possi-
ble scales. Figure 2(a) describes the system's task at
one fixed scale. The image is divided into multiple,
possibly overlapping sub-images of the current window
size. At each window, the system attempts to classify
the enclosed image pattern as being either "a face" or
"not a face" . This involves taking a set of local im-
age measurements to extract pertinent information from
the enclosed pattern, and using a pre-defined decision
procedure to determine, based on the measurements ob-
tained, whether or not the enclosed pattern "resembles"
our canonical face structure. Each time a "matching"
window pattern is found, the system reports a face at the
window location, and the scale as given by the current
window size. Multiple scales are handled by examining
and classifying windows of different sizes. Our system
performs an equivalent operation by examining and clas-
sifying fixed sized window patterns on scaled versions of
the image. The idea is to resize the image appropri-
ately, so that the desired window size scales to the fixed
window dimensions assumed by the image measurement
routines.
Clearly, the most critical part of our system is the
algorithm for classifying window patterns as "faces" or
"non-faces" . The rest of this paper focuses on the algo-
rithm we developed. Figure 2(b) shows the key compo-
nents of the algorithm. Basically, the approach is one of
appropriately modeling the distribution of canonical face
patterns in some high dimensional image window vector
space, and learning a functional mapping of input pat-
tern measurements to output classes from a representa-
tive set of "face" and "non-face" window patterns. More
specifically, our approach works as follows:
1. We require that the window pattern to be classified
be 19 x 19 pixels in size and appropriately masked
(see Figure 1). All window patterns of different
dimensions must first be re-scaled to this size and
masked before further processing. Matching with a
fixed sized window simplifies our algorithm because
it allows us to use the same classification procedure
for all scales. The masking operation applies some
prior domain knowledge to the matching problem
by ignoring certain near-boundary pixels that may
fall outside the spatial boundaries of a face.
2. Using a database of canonical "face" window pat-
terns and a similar database of "non-face" window
patterns, we construct a distribution-based model
of canonical face patterns in the masked 19 x 19
dimensional image window vector space. Our mod-
eling scheme uses a few "face" pattern prototypes
to piece- wise approximate the distribution manifold
of canonical face patterns in the masked 19 x 19 di-
mensional image window vector space. It also uses a
few "non-face" pattern prototypes to help capture
the concavities in the distribution manifold more
accurately. Together, these "face" and "non-face"
pattern prototypes serve as a distribution-based
model for the class of canonical face views. The
"face" pattern prototypes are synthesized offline by
performing clustering on the example database of
"face" window patterns, and the "non-face" proto-
types are similarly synthesized from the database
of "non-face" window patterns. The prototypes are
Figure 1: (a): A "canonical" face pattern, (b): A 19 x 19 mask for eliminating near-boundary pixels of canonical face
patterns, (c): The resulting "canonical" face pattern after applying the mask.
hard-wired into the face detection system at com-
pile time. Each prototype is a multi-dimensional
Gaussian cluster with a centroid location and a co-
variance matrix that describes the local data distri-
bution around the centroid.
3. For each new window pattern to be classified, we
compute a set of image measurements as input to
the "face" or "non-face" decision procedure. Each
set of image measurements is a vector of distances
from the new window pattern to the window pat-
tern prototypes in the image window vector space.
To compute our vector of distances, we define and
use a new "Mahalanobis-like" distance metric for
measuring the distance between the input window
pattern and each prototype center. The distance
metric takes into account the shape of each proto-
type cluster, in order to penalize distances orthog-
onal to the local data distribution. The resulting
vector of distances coarsely encodes the new win-
dow pattern's location relative to the overall dis-
tribution of canonical face patterns in the image
window vector space. It serves as a notion of "dif-
ference" between the new window pattern and the
class of all canonical face patterns.
4. We train a multi-layer perceptron (MLP) net to
identify new window patterns as "faces" or "non-
faces" from their vector of distance measurements.
When trained, the multi-layer perceptron net takes
as input a vector of distance measurements and out-
puts a '1' if the vector arises from a face pattern,
and a '0' otherwise. We use a combined database
of 47316 "face" and "non-face" window patterns to
train the multi-layer perceptron classifier.
The following sections describe our window pattern
classification algorithm in greater detail. Section 3 de-
scribes our distribution-based model for canonical face
window patterns. It explains the pre-processing and
clustering operations we perform for synthesizing "face"
and "non-face" window pattern prototypes. Section 4
describes our distance metric for computing classifier in-
put vectors. Section 5 discusses the classifier training
process, and in particular, our method of selecting a
comprehensive but tractable set of training examples.
In Section 6, we identify the algorithm's critical compo-
nents in terms of generating correct results, and evaluate
the system's overall performance.
3 A Distribution-based Face Model
Our window classification algorithm identifies faces by
comparing each new masked window pattern with a
canonical face model, and declaring a "face" when the
match is good. This section describes our canonical face
model. We use a distribution-based modeling scheme
that tries to represent canonical faces as the set of all
masked 19 x 19 pixel patterns that are canonical face
views. More specifically, our distribution-based mod-
eling scheme works as follows: Suppose we treat each
unmasked pixel of a 19 x 19 image as a vector dimen-
sion, then the class of all 19 x 19 pixel images forms
a vector space whose dimensionality equals the number
of unmasked image pixels. Each 19 x 19 image pattern
maps to a specific vector space location, and the set of
all 19 x 19 pixel canonical face patterns maps to a fixed
region in this multi-dimensional vector space. So, in the-
ory, one can model the class of all canonical face views by
identifying the portion of this multi-dimensional image
vector space that corresponds to canonical face patterns,
and representing the region in some tractable fashion.
3.1 Identifying and Representing the
Canonical Face Manifold
In practice, one does not have the set of all 19 x 19
pixel canonical face patterns to recover the sub-space
of canonical face views exactly. Figure 3 explains how
we localize and represent the region of canonical face
patterns with limited data. Basically, we use a reason-
ably large example database of canonical face patterns
and a carefully chosen database of non-face patterns to
infer the spatial extent of canonical face views in the
multi-dimensional image vector space. We assume that
our "face" database contains a sufficiently representative
sample of canonical face patterns that evenly populates
the actual vector sub-space of canonical face views. So
by building a model of the "face" sample distribution,
one can still obtain a coarse but fairly reliable represen-
tation of the actual canonical face manifold. Our "non-
face" data samples are specially selected patterns that lie
near the boundaries of the canonical face manifold. We
use the "non-face" patterns to help localize and refine the
boundaries of the canonical face manifold by explicitly
Compute Image
Measurements
Input Image
Face/Wot- Face
Classifier
Face Pattern Recognizer
(a)
Test
Fattern
Pre-process
& Resize
Canonical
Face Model
"Difference" measurements
F ace/Hot- Face Clas sifter
00
Figure 2: Overview of our face detection system, (a): The system's task at each scale. The image is divided into many
possibly overlapping windows. Each window pattern gets classified as either "a face" or "not a face", based on a set of
local image measurements, (b): The key components of the Face Pattern Recognizer block in greater detail. The Face
Pattern Recognizer block classifies new patterns as "faces" or "non-faces" . The algorithm uses a distribution-based model
to represent the space of all possible canonical face patterns. For each new pattern to be classified, it computes a set of
"difference" measurements between the new pattern and the canonical face model. A trained classifier identifies the new
pattern as being either "a face" or "not a face" , based on the set of "difference" measurements.
carving out regions around the "face" sample distribu-
tion that do not correspond to canonical face views. We
shall explain how we synthesize our special database of
"non-face" patterns in Section 5.2.
Our approach uses six prototype "face" clusters to
piece-wise approximate the multi-dimensional distribu-
tion of canonical face patterns, and six "non-face" clus-
ters to model the distribution of "non-face" patterns.
Qualitatively, one can view the six "face" clusters as a
coarse distribution-based representation of the canoni-
cal face manifold. The six "non-face" clusters help de-
fine boundaries in and around the manifold by carv-
ing out nearby regions in the vector space that do not
correspond to face patterns. Each prototype cluster is
a multi-dimensional Gaussian with a centroid location
and a covariance matrix that describes the local data
distribution. We adopt a piece-wise continuous mod-
eling scheme because we believe that the actual face
pattern manifold is continuous and smoothly varying in
our multi-dimensional image vector space - i.e., more of-
ten than not, a face pattern with minor spatial and/or
grey-level perturbations still looks like another valid face
pattern. Similarly, a non-face pattern with minor varia-
tions would most likely still appear as a non-face pattern.
The piecewise smooth modeling scheme serves two im-
portant functions. First, it performs generalization by
applying a prior smoothness assumption to the observed
data sample distribution. This results in a stored data
distribution function that is well defined even in regions
of the image vector space where no data samples have
been observed. Second, it serves as a tractable scheme
for representing an arbitrary data distribution by means
of a few Gaussian basis functions.
The bottom right image of Figure 3 shows the 12
prototype centroids in our canonical face model. The
six "face" prototypes are synthesized by clustering the
database of canonical face patterns, while the six "non-
face" prototypes are similarly derived from the database
of non-face patterns. The rest of this section describes
our process for synthesizing prototypes in greater detail.
3.2 Preprocessing
The first step of synthesizing prototypes is to normalize
the sample window patterns in the databases. Normal-
ization compensates for certain sources of image varia-
tion. It reduces the range of possible window patterns
the subsequent stages have to consider, thus making the
modeling and classification tasks easier. Our preprocess-
ing stage consists of the following sequence of image win-
dow operations:
1. Window resizing: Recall that our scheme per-
forms modeling and classification on 19 x 19 grey-
level window patterns. We choose a 19 x 19 window
size to keep the dimensionality of the window vec-
tor space manageably small, but also large enough
to preserve details that distinguish between "face"
and "non-face" window patterns. This operation
re-scales square window patterns of different sizes
to 19 x 19 pixels.
2. Masking: We use the 19 x 19 binary pixel mask
in Figure 1(b) to zero-out some near-boundary pix-
els of each window pattern. For "face" patterns,
these masked pixels usually correspond to back-
ground pixels irrelevant to the description of a face.
Zeroing out these pixels ensures that our model-
ing scheme does not wrongly encode any unwanted
background structure in our canonical face repre-
sentation. It also reduces the dimensionality of our
image window vector space, which helps to make
x3
Face Sample
Distribution
G anonical F ace Pattern
samples to appro rimate
vector sub space of
canonical face views
x3 Approximation with
.I Gaussian clusters
xS Non- Face Sample
Distribution
Special Non- Face Pattern
samples to refine vector
sub space boundaries of
canonical face views
x3 Approximation with
1 Gaussian clusters
rx-'X i ]
era ti
Our model with
Face& Non- Face clusters
FaceCentroids
Non- Face Centroids
Figure 3: Our distribution-based canonical face model. Top Row: We use a representative sample of canonical face
patterns to approximate the volume of canonical face views in a masked 19 x 19 pixel image vector space. We model
the "face" sample distribution with 6 multi-dimensional Gaussian clusters. Center Row: We use a selection of non-
face patterns to help refine the boundaries of our Gaussian mixture approximation. We model the "non-face" sample
distribution with 6 Gaussian clusters. Bottom Row: Our final model consists of 6 "face" clusters and 6 "non-face"
clusters. Each cluster is defined by a centroid and a covariance matrix. The 12 centroids are shown on the right. Note:
All the distribution plots are fictitious and are shown only to help with our explanation. The 12 centroids are real.
the modeling and classification tasks a little more
tractable.
3. Illumination gradient correction: This oper-
ation subtracts a best-fit brightness plane from the
unmasked window pixels. For face patterns, it does
a fair job at reducing the strength of heavy shadows
caused by extreme lighting angles.
4. Histogram equalization: This operation ad-
justs for several geometry independent sources of
window pattern variation, including changes in il-
lumination brightness and differences in camera re-
sponse curves.
Notice that the same preprocessing steps must also
be applied to all new window patterns being classified at
runtime.
3.3 Modeling the Distribution of "Face"
Patterns — Clustering for Positive
Prototypes
We use a database of 4150 normalized canonical "face"
patterns to synthesize 6 "face" pattern prototypes in
our multi-dimensional image vector space. The database
consists of 1067 real face patterns, obtained from several
different image sources. We artificially enlarge the orig-
inal database by adding to it slightly rotated versions of
the original face patterns and their mirror images. This
helps to ensure that our final database contains a reason-
ably dense and representative sample of canonical face
patterns.
Each pattern prototype is a multi-dimensional Gaus-
sian cluster with a centroid location and a covariance
matrix that describes the local data distribution around
the centroid. Our modeling scheme approximates the
observed "face" sample distribution with only a small
number of prototype clusters because the sample size is
still very small compared to the image vector space di-
mensionality (4150 data samples in a 283 dimensional
masked image vector space). Using a large number of
prototype clusters could lead to overfitting the observed
data distribution with poor generalization results. Our
choice of 6 pattern prototypes is somewhat arbitrary.
The system's performance, i.e. its face detection rate
versus the number of false alarms, does not change much
with slightly fewer or more pattern prototypes.
We use a modified version of the k-means clustering
algorithm to compute 6 face pattern centroids and their
cluster covariance matrices from the enlarged database
of 4150 face patterns. Our clustering algorithm differs
from the traditional k-means algorithm in that it fits
directionally elongated (i.e. elliptical) Gaussian distri-
butions to the data sample instead of isotropic Gaussian
distributions. We adopt a non-isotropic mixture model
because we believe the actual face data distribution may
in fact be locally more elongated along certain vector
space directions than others.
To implement our elliptical k-means clustering algo-
rithm, we use an adaptively changing normalized Maha-
lanobis distance metric instead of a standard Euclidean
distance metric to partition the data sample into clus-
ters. The normalized Mahalanobis distance metric re-
Figure 4: An example of a naturally occuring "non-face"
pattern that resembles a face. Left: The pattern viewed
in isolation. Right: The pattern viewed in the context of
its environment.
turns a directionally dependent distance value between
a new pattern x (in column vector form) and a Gaussian
cluster centroid jl (also as a column vector). It has the
form:
M„{x,jl) = -(rfln27r+ ln|S| + (x - jl) T Y 1 ~ l (x - jl)),
where £ is the covariance matrix that encodes the clus-
ter's shape and directions of elongation. The normalized
Mahalanobis distance differs from the Euclidean distance
in that it reduces the penalty of pattern differences along
a cluster's major directions of data distribution. This
penalty adjustment feature allows the clustering algo-
rithm to form elliptical clusters instead of spherical clus-
ters where there is a non-isotropic local data distribu-
tion. Section 4.2 explains the normalized Mahalanobis
distance metric in greater detail.
The following is a crude outline of our elliptical k-
means clustering algorithm:
1. Obtain k (6 in our case) initial pattern centers by
performing vector quantization with euclidean dis-
tances on the enlarged face database. Divide the
data set into k partitions (clusters) by assigning
each data sample to the nearest pattern center in
euclidean space.
2. Initialize the covariance matrices of all k clusters to
be the identity matrix.
3. Re-compute pattern centers to be the centroids of
the current data partitions.
4. Using the current set of k pattern centers and their
cluster covariance matrices, re-compute data parti-
tions by re-assigning each data sample to the near-
est pattern center in normalized Mahalanobis dis-
tance space. If the data partitions remain un-
changed or if the maximum number of inner-loop
(i.e. Steps 3 and 4) iterations has been exceeded,
proceed to Step 5. Otherwise, return to Step 3.
5. Re-compute the covariance matrices of all k clusters
from their respective data partitions.
6. Using the current set of A; pattern centers and their
cluster covariance matrices, re-compute data parti-
tions by re-assigning each data sample to the near-
est pattern center in normalized Mahalanobis dis-
tance space. If the data partitions remain un-
changed or if the maximum number of outer-
xl
lurJSo-'
■■■■*■■■ " T --tf-
+■ x2
xl
■>■ x2
xl
->x2
Figure 5: An illustration on how "negative" prototypes can help model concavities in a distribution with fewer Gaussian
basis functions. Left: A hypothetical data distribution in 2 dimensions with a concave surface. Center: The
distribution can be reasonably approximated with one positive Gaussian prototype encircling the distribution and one
negative prototype carving out the concave region. Right: To model the same distribution with only positive prototypes,
one has to use more Gaussian basis functions. The difference in number of basis functions can be a lot larger in a higher
dimensional space.
(i.e. Steps 3 to 6) iterations has been exceeded,
proceed to Step 7. Otherwise, return to Step 3.
7. Return the current set oik pattern centers and their
cluster covariance matrices.
The inner loop (i.e. Steps 3 and 4) is analogous to the
traditional k-means algorithm. Given a fixed distance
metric, it finds a set of k pattern prototypes that stably
partitions the sample data set. Our algorithm differs
from the traditional k-means algorithm because of Steps
5 and 6 in the outer loop, where we try to iteratively
refine and recover the cluster shapes as well.
3.4 Modeling the Distribution of "Non-Face"
Patterns — Clustering for Negative
Prototypes
There are many naturally occuring "non-face" patterns
in the real world that look like faces when viewed in iso-
lation. Figure 4 shows one such example. In our multi-
dimensional image window vector space, these "face-
like" patterns lie along the boundaries of the canonical
face manifold. Because we are coarsely representing the
canonical face manifold with 6 Gaussian clusters, some
of these face-like patterns may even be located nearer the
"face" cluster centroids than some real "face" patterns.
This may give rise to misclassification problems, because
in general, we expect the opposite to be true, i.e. face
patterns should lie nearer the "face" cluster centroids
than non-face patterns.
In order to avoid possible misclassification, we try to
obtain a comprehensive sample of these face-like patterns
and explicitly model their distribution using 6 "non-
face" prototype clusters. These "non-face" clusters carve
out negative regions around the "face" clusters that do
not correspond to face patterns. Each time a new win-
dow pattern lies too close to a "non-face" prototype, we
favor a "non-face" hypothesis even if the pattern also lies
near a "face" prototype. Our choice of using exactly 6
pattern prototypes to model the "non-face" distribution
is also somewhat arbitrary, as in the case of modeling the
"face" pattern distribution. The system's face detection
rate versus false alarm ratio does not change much with
slightly fewer or more "non-face" pattern prototypes.
We use our elliptical k-means clustering algorithm to
obtain 6 "non-face" prototypes and their cluster covari-
ance matrices from a database of 6189 face-like patterns.
The database was incrementally generated in a "boot-
strap" fashion by first building a reduced version of our
face detection system with only "face" prototypes, and
collecting all the false positive patterns it detects over a
large set of random images. Section 5.2 elaborates fur-
ther on our "boot-strap" data generation technique.
3.5 Summary and Remarks
One of the key difficulties in face detection is to account
for the wide range of permissible face pattern variations
that cannot be adequately captured by geometric mod-
els, correlation templates and other classical parametric
modeling techniques. Our approach addresses this prob-
lem by modeling the distribution of frontal face patterns
directly in a fixed 19 x 19 pixel image vector space. We
infer the actual distribution of face patterns in the im-
age vector space from a sufficiently representative set of
face views, and a carefully chosen set of non-face pat-
terns. The distribution-based modeling scheme statis-
tically encodes observable face pattern variations that
cannot otherwise be easily parameterized by classical
modeling techniques. To generalize from the slow vary-
ing nature of the face pattern distribution in the image
vector space, and to construct a tractable model for the
face pattern distribution, we interpolate and represent
the observed distribution in a piecewise-smooth fashion
using a few multi-dimensional Gaussian clusters.
Our final distribution-based model consists of 6 "face"
clusters for coarsely approximating the canonical face
pattern manifold in the image vector space, and 6 "non-
face" clusters for representing the distribution of a spe-
cially selected "non-face" data sample. The non-face
patterns come from non-face regions in the image vector
space near the canonical face manifold. We propose two
Test
Pattern
(a)
1.'
Centroid
'eetois
(b)
Figure 6: The measurements returned from matching a test pattern against our distribution-based model, (a): Each
set of measurements is a vector of 12 distances between the test pattern and the model's 12 cluster centroids. (b): Each
distance measurement between the test pattern and a cluster centroid is a 2- value distance metric. The first component is a
Mahalanobis distance between the test pattern's projection and the cluster centroid in a subspace spanned by the cluster's
75 largest eigenvectors. The second component is the Euclidean distance between the test pattern and its projection in
the subspace. So, the entire set of 12 distance measurements is a vector of 24 values.
possibly related ways of reasoning about the 6 "non-face"
pattern clusters:
1. The 6 "non-face" clusters explicitly model the dis-
tribution of face-like patterns that are not actual
faces — i.e. "near miss" patterns that the face de-
tection system should classify as "non-faces" . They
define and represent a separate "near-miss" pattern
class in the multi-dimensional vector space, just like
how the 6 "face" clusters encode a "canonical face"
pattern class. Test patterns that fall under this
"near-miss" class are classified as non-faces.
2. The 6 "non-face" clusters carve out regions in and
around the "face" clusters that should not corre-
spond to face patterns. They help shape the bound-
aries in our Gaussian mixture model of the canoni-
cal face manifold. More interestingly, they can also
help one model concavities in the canonical face
manifold with fewer Gaussian basis functions (see
Figure 5). This advantage of using "non-face" clus-
ters becomes especially critical when one has too
few face data samples to model a concave distribu-
tion region with a large number of Gaussian "face"
clusters.
4 Matching Patterns with the Model
To detect faces, our system first matches each candi-
date window pattern in an input image against our
distribution-based canonical face model. Each match
returns a set of measurements that captures a notion
of "similarity" or "difference" between the test pattern
and the face model. At a later stage, a trained classifier
determines, based on the set of "match" measurements,
whether or not the test pattern is a frontal face view.
This section describes the set of "match" measure-
ments we compute for each new window pattern. Each
set of measurements is a vector of 12 distances between
the test window pattern and the model's 12 cluster cen-
troids in our multi- dimensional image vector space (see
Figure 6(a)). One way of interpreting our vector of dis-
tances is to treat each distance measurement as the test
pattern's actual distance from some key reference loca-
tion on or near the canonical face pattern manifold. The
set of all 12 distances can then be viewed as a crude "dif-
ference" notion between the test pattern and the entire
"canonical face" pattern class.
4.1 A 2- Value Distance Metric
We now define a new metric for measuring distance
between a test pattern and a prototype cluster centroid.
Ideally, we want a metric that returns small distances
between face patterns and the "face" prototypes, and ei-
ther large distances between non-face patterns and the
"face" prototypes, or small distances between non-face
patterns and the "non-face" centers. We propose one
such measure that normalizes distances by taking into
account both the local data distribution around a proto-
type centroid, and the reliability of the local distribution
estimate. The measure scales up distances orthogonal
to a cluster's major axes of elongation, and scales down
distances along the cluster's major elongation directions.
We argue that such a distance measure should meet our
ideal goals reasonably well, because we expect most face
patterns on the face manifold to lie along the major axes
of one or more "face" clusters, and hence register small
distances with one or more "face" prototypes. Similarly,
we expect most non-face patterns to lie either far away
from all the "face" clusters or along the major axes of
10M^
IB
■g.
I B
a)
■i4
>
i
3 9
bo £
\
■ i-H
*
w
V
n
*w^
Eigenvalues of
6 "Face" Clusters
100
200
300
Rank of Eigenvector Size in Cluster
Eigenvalues of
6 "Non-Face" Clusters
100
200
300
Rank of Eigenvector Size in Cluster
Figure 7: Graphs of eigenvalues in decreasing order of magnitude. Left: Decreasing eigenvalues for 6 "Face" clusters.
Right : Decreasing eigenvalues for 6 "Non-Face" clusters.
one or more "non-face" clusters, and hence either regis-
ter large distances with all the "face" prototypes or small
distances with some "non-face" prototypes.
Our proposed distance measure consists of two out-
put components (see Figure 6(b)). The first value is
a Mahalanobis-like distance between the test pattern
and the prototype centroid, defined within a lower-
dimensional sub-space of our masked 19 x 19 pixel image
vector space. The sub-space is spanned by the 75 largest
eigenvectors of the prototype cluster. This distance com-
ponent is directionally weighted to reflect the test pat-
tern's location relative to the major elongation direc-
tions in the local data distribution around the proto-
type center. The second value is a normalized Euclidean
distance between the test pattern and its projection in
the lower-dimensional sub-space. This is a uniformly
weighted distance component that accounts for pattern
differences not included in the first component due to
possible modeling inaccuracies. We elaborate further on
the two components below.
4.2 The Normalized Mahalanobis Distance
We begin with a brief review of the Mahalanobis dis-
tance. Let x be a column vector test pattern, fl be a
column vector prototype centroid, and £ be the covari-
ance matrix describing the local data distribution near
the centroid. The Mahalanobis distance between the test
pattern and the prototype centroid is given by:
M (x, ft) = (x — ft) S~ (x — ft).
Geometrically, the Mahalanobis distance can be inter-
preted as follows. If one models the prototype cluster
with a best-fit multi-dimensional Gaussian distribution,
then one gets a best-fit Gaussian distribution centered
at fl with covariance matrix S. All points at a given Ma-
halanobis distance from fl occupy a constant density sur-
face in this multi-dimensional vector space. This inter-
pretation of the Mahalanobis distance leads to a closely
related distance measure, which we call the normalized
Mahalanobis distance, given by:
M n (x,fl) = -(rfln27r+ ln|S| + (x ■
■LA) T Y,- l {x-fl)).
10
Here, d is the vector space dimensionality and |S| means
the determinant of S.
The normalized Mahalanobis distance is simply the
negative natural logarithm of the best-fit Gaussian dis-
tribution described above. It is normalized in the sense
that it originates directly from a probability density
distribution that integrates to unity. Our modified k-
means clustering algorithm in Section 3 uses the normal-
ized Mahalanobis distance metric instead of the standard
form because of stability reasons. Notice that for a given
spatial displacement between a test pattern and a proto-
type centroid, the standard Mahalanobis distance tends
to be smaller for long clusters with large covariance ma-
trices than for small clusters. So, if one uses a standard
Mahalanobis distance metric to perform clustering, the
larger clusters will constantly increase in size and even-
tually overwhelm all the smaller clusters.
As a distance metric for establishing the class iden-
tity of test patterns, the Mahalanobis distance and its
normalized form are both intuitively pleasing for the fol-
lowing reason: They both measure pattern differences in
a distribution dependent manner, unlike the Euclidean
distance which measures pattern differences in an abso-
lute sense. More specifically, the distances they mea-
sure are indicative of how the test pattern ranks relative
to the overall location of other known patterns in the
pattern class. In this sense, they capture very well the
notion of "similarity" or "dis-similarity" between a test
pattern and a pattern class.
4.3 The First Distance Component —
Distance within a Normalized
Low- Dimensional Mahalanobis Subspace
As mentioned earlier, our distance metric consists of
2 output values as shown in Figure 6(b). The first
value, T>i, is a Mahalanobis-like distance between the
Test Pattern
Reconstruction Quality
Thr esholded to determine
whether pattern is a face
*0,
Face Space
Reco gnition of individual
faces is performed within
this space
Figure 8: The eigen-image model of faces by Turk and Pentland [19] [12]. The approach represents the space of all
face views (i.e. the "Face Space") as a linear combination of several orthonormal eigen-images. The eigen-images are
computed by performing PCA on an example database of face patterns. The modeling scheme assumes that all face views
lie within the "face space" , and each face view corresponds to a set of coefficients that describes its appearance as a linear
combination of the eigen-images. One can use the coefficients as features to recognize faces of different individuals. One
can also determine whether or not a test pattern is a face by computing and thresholding d, a measure of how badly the
eigen-images reconstruct the test pattern.
test pattern and the prototype center. This distance is
defined within a 75- dimensional sub-space of our masked
19 x 19 pixel image vector space, spanned by the 75
largest eigenvectors of the current prototype cluster. Ge-
ometrically, we can interpret the first distance value as
follows: The test pattern is first projected onto the 75
dimensional vector sub-space. We then compute the nor-
malized Mahalanobis distance between the test pattern's
projection and the prototype center as the first output
value. Like the standard Mahalanobis distance, this first
value locates and characterizes the test pattern relative
to the cluster's major directions of data distribution.
Mathematically, the first value is computed as follows:
Let x be the column vector test pattern, jl be the pro-
totype pattern, £"75 be a matrix with 75 columns, where
column i is a unit vector in the direction of the cluster's
jth i ar g es t eigenvector, and W75 be a diagonal matrix
of the corresponding 75 largest eigenvalues. The covari-
ance matrix for the cluster's data distribution in the 75
dimensional subspace is given by £75 = (-EVsW^s-E^g),
and the first distance value is:
2>i(2, /?) = |(75 ln27r + In |S 75 | + (x - jt) T Y,^{x - /?)).
We emphasize again that this first value is not a "com-
plete" distance measure in our masked 19 x 19 pixel im-
age vector space, but rather a "partial" distance measure
in a subspace of the current cluster's 75 most signifi-
cant eigenvectors. The measure is not "complete" in the
sense that it does not account for pattern differences in
certain image vector space directions, namely differences
11
orthogonal in direction to the cluster's 75 most signifi-
cant eigenvectors. We omit the smaller eigenvectors in
this directionally dependent distance measure because
we believe that their corresponding eigenvalues may be
significantly inaccurate due to the small number of data
samples available to approximate each cluster. For in-
stance, we have, on the average, fewer than 700 data
points to approximate each "face" cluster in a 283 di-
mensional masked image vector space. Using the smaller
eigenvectors and eigenvalues to compute a distribution
dependent distance can therefore easily lead to meaning-
less results.
Figure 7 plots the eigenvalues of our 12 prototype clus-
ters in decreasing order of magnitude. Notice that for all
12 curves, only a small number of eigenvectors are sig-
nificantly larger than the rest. We use the following cri-
terion to arrive at 75 "significant" eigenvectors for each
cluster: We eliminate from each cluster the maximum
possible number of trailing eigenvectors, such that the
sum of all eliminated eigenvalues is still smaller than the
largest eigenvalue in the cluster. This criterion leaves
us with approximately 75 eigenvectors for each cluster,
which we standardize at 75 for the sake of simplicity.
4.4 The Second Distance Component —
Distance from the Low- Dimensional
Mahalanobis Subspace
The second output value of our distance metric is a stan-
dard Euclidean distance between the test pattern and
its projection in the 75 dimensional sub-space. This
distance component accounts for pattern differences not
-3
Volcano Mo del Space
Figure 9: The view-based model of volcanos by Burl et.
al. [4]. The approach assumes that the set of all volcano
patterns lies within a low dimensional model space, spanned
by a small number of the largest principal component vec-
tors from a volcano image database. Each pattern in the
model space corresponds to a set of coefficients that de-
scribes the pattern's appearance as a linear combination of
the principal component images. Not all linear combina-
tions of principal component images are volcano patterns,
so one can detect volcanos by learning from examples the
range of coefficient values that corresponds to volcanos.
V 2 (x,fl) = \\(x- x" p
(I
E 75 E? 5 )(x-
m-
captured by the first component, namely pattern differ-
ences in the smaller eigenvector directions. Because we
may not have a reasonable estimate of the smaller eigen-
values, we simply assume an isotropic Gaussian sample
data distribution in the smaller eigenvector directions,
and hence a Euclidean distance measure.
Using the notation from the previous sub-section, we
can show that the second component is simply the L 2
norm of the displacement vector between x and its pro-
jection x p :
Notice that on its own, the second distance value is
also not a "complete" distance measure in our masked
19 x 19 pixel image vector space. Specifically, it does
not account for pattern differences in the subspace of
the cluster's 75 most significant eigenvectors. It com-
plements the first output value to form a "complete"
distance measure that captures pattern differences in all
directions of our masked 19 x 19 pixel image vector space.
4.5 Relationship between our 2- Value Distance
and the Mahalanobis Distance
There is an interesting relationship between our 2- Value
distance metric and the "complete" normalized Maha-
lanobis distance involving all 283 eigenvectors of a pro-
totype cluster 1 . Recall from Section 4.2 that the Ma-
halanobis distance and its normalized form arise from
fitting a multi-dimensional full-covariance Gaussian to a
sample data distribution. For high dimensional vector
spaces, modeling a distribution with a full-covariance
Gaussian is often not feasible because one usually has
too few data samples to recover the covariance matrix
accurately. In a d dimensional vector space, each full-
covariance Gaussian requires d parameters to define its
centroid and another d(d + l)/2 more parameters to de-
fine its covariance matrix. For d = 283, this amounts to
40469 parameters!
One way of reducing the number of model parameters
is to use a restricted Gaussian model with a diagonal co-
variance matrix, i.e., a multi-dimensional Gaussian dis-
tribution whose elongation axes are aligned to the vec-
tor space axes. In our domain of 19 x 19 pixel images,
this corresponds to a model whose individual pixel values
may have different variances, but whose pair-wise pixel
values are all uncorrelated. Clearly, this is a very poor
model for face patterns which are highly structured with
groups of pixels having very highly correlated intensities.
An alternative way of simplifying the Gaussian model
is to preserve only a small number of "significant" eigen-
vectors in the covariance matrix. One can show that
a d dimensional Gaussian with h principal components
has a covariance matrix of only h(2d — h + l)/2 free
parameters. This can easily be a tractable number of
model parameters if h is small. Geometrically, the op-
eration corresponds to projecting the original d dimen-
sional Gaussian cluster onto an h dimensional subspace,
spanned by the cluster's h most "significant" elongation
directions. The h dimensional subspace projection pre-
serves the most prominent correlations between pixels
in the target pattern class. To exploit these pixel-wise
correlations for pattern classification, one computes a di-
rectionally weighted Mahalanobis distance between the
test pattern's projection and the Gaussian centroid in
this h dimensional subspace — V\ of our 2- Value dis-
tance metric.
The orthogonal subspace is spanned by the d — h re-
maining eigenvectors of the original Gaussian cluster.
Because this subspace encodes pixel correlations that are
less prominent and possibly less reliable due to limited
training data, we simply assume an isotropic Gaussian
distribution of data samples in this subspace, i.e. a di-
agonal covariance matrix Gaussian with equal variances
along the diagonal. The isotropic Gaussian distribu-
tion requires only 1 free parameter to describe its vari-
ance. To measure distances in this subspace of isotropic
data distribution, we use a directionally independent Eu-
clidean distance — V 2 of our 2- Value distance metric.
We can thus view our 2- Value distance metric as a
robust approximate Mahalanobis distance that one uses
when there is insufficient training data to accurately re-
cover all the eigenvectors and eigenvalues of a Gaussian
model. The approximation degrades gracefully by using
its limited degrees of freedom to capture the most promi-
nent pixel correlations in the data distribution. Notice
12
1 See [8] for a similar interpretation and presentation. Our
explanation here is adapted from theirs.
Input Units (12 Pairs)
Output Unit
Figure 10: One of the multi-layer perceptron (MLP) net architectures we trained to identify face patterns from our vector
of distance measurements. The net has 12 pairs of input units to accept the 12 pairs of distance values from our matching
stage. When trained, the output unit returns a 1 for face patterns and a otherwise. Notice that this net has one special
layer of hidden units that combines each 2- Value distance pair into a single distance value. Our experiments show that
the detailed net architecture is really not crucial in this application.
that as the data sample size increases, one can preserve
a larger number of principal components in the Gaussian
model. This results in V\ becoming increasingly like the
"complete" Mahalanobis distance with X>2 vanishing in
size and importance.
4.6 Relationship between our 2- Value Distance
and some PCA-based Classical Distance
Measures
Our 2- Value distance metric is also closely related to the
classical distance measures used by principal components
analysis (PCA, also called Karhunen-Loeve expansion)
based classification schemes [5] [6] [18]. We examine two
such distance measures below.
A standard way of modeling a 3D object for pattern
recognition is to represent its space of 2D views using a
few basis images, obtained by performing principal com-
ponents analysis on a comprehensive set of 2D views.
The modeling approach assumes that all possible 2D
views of the target object lie within a low dimensional
sub-space of the original high dimensional image space.
The low dimensional sub-space is linearly spanned by
the principal components. Each new view of the target
object can be represented by a set of coefficients that
describes the view as a linearly weighted superposition
of the principal components.
Kirby and Sirovich [16] [9] have used principal compo-
nents analysis methods to build low-dimensional mod-
els for characterizing the space of human faces. As a
representative recent example, let us consider Turk and
Pentland's application of PCA techniques to face recog-
nition [19] and facial feature detection [12]. To opti-
mally represent their "face space" , Turk and Pentland
compute an orthonormal set of eigen-images from an ex-
ample database of face patterns to span the space. New
face views are represented by projecting their image pat-
terns onto the set of eigen-images to obtain their linear
13
combination coefficients. Because the technique assumes
that all face patterns actually lie within the "face space" ,
one can use the set of linear combination coefficients as
features to parameterize and recognize faces of different
individuals (see Figure 8). Similarly, one can determine
whether or not a given pattern is a face by measuring
how well or poorly the eigen-images reconstruct the pat-
tern — i.e., the pattern's distance (d in Figure 8) from
the "face space" [12].
In another recent application of classical techniques,
Burlet. al. [4] have used a different PCA based approach
for a terrain classification task on finding volcanos in
SAR images of Venus. Their system builds a low di-
mensional view-based volcano model, parameterized by a
small number of the largest principal component vectors
from an example set of volcano images (see Figure 9).
Within this low-dimensional model space, Burl et. al.
further observe that not all reconstructible patterns are
actual volcano views; i.e., not all linear combinations of
principal component images are volcano patterns. Thus,
to detect volcanos, they use the set of linear combina-
tion coefficients as classification features, and learn from
examples the range of coefficient values, i.e. distances
within the volcano model space, that actually correspond
to volcano patterns.
One can relate our 2- Value distance metric to
the above PCA based classification techniques as fol-
lows: Suppose we treat each Gaussian cluster in our
distribution-based model as locally representing a sub-
class of "face" or "non-face" patterns, then the cluster's
75 largest eigenvectors can be viewed as a set of basis im-
ages that linearly span the sub-class. Each new pattern
in the sub-class corresponds to a set of 75 coefficients in
the model space.
Our distance component T>2 is the Euclidean distance
between a test pattern and the cluster's 75 largest eigen-
vector subspace. Assuming that the 75 eigenvectors are
Y^ffvvv
Original Rotate 5 deg Rotate -5 deg Minor
Mirror &
Mirror &
Rotate 5 deg Rotate -5 deg
Figure 11: Artificially generating virtual examples of face patterns. For each original face pattern in our training database,
we can generate a few new face patterns using some simple image transformations. We artificially enlarge our face database
in this fashion to obtain a more comprehensive sample of face patterns.
the "basis views" that span our target sub-class, this dis-
tance component measures how poorly the basis views
reconstruct the test pattern, and hence how "different"
the test pattern is relative to the entire target sub-class.
This distance is analogous to Turk and Pentland's dis-
tance to "face space" metric (i.e. d in Figure 8), which
quantifies the "difference" between a test pattern and
the class of face views.
The distance component V\ is a Mahalanobis distance
between the cluster centroid and a test pattern's projec-
tion in the 75 largest eigenvector subspace. Here, we
implicitly assume, like Burl et. al., that not all recon-
structible patterns in our 75 eigenvector model space
are views of the target sub-class, and that all views of
the target sub-class are located within a certain Maha-
lanobis distance from the cluster centroid. This con-
strains the range of model coefficients that correspond
to actual target patterns, just like how Burl et. al. con-
strain by learning their range of model parameters that
correspond to actual volcano views.
Thus, our T>i and T>2 distance measurements corre-
spond to different classical classification criteria used
also recently. As far as we know, this paper presents the
first attempt to combine these two measures for pattern
classification.
5 The Classifier
The classifier's task is to identify "face" test patterns
from "non-face" patterns based on their match feature
vectors of 12 distance measurements. Our approach
treats the classification stage as one of learning a func-
tional mapping from input feature distance vectors to
output classes using a representative set of training ex-
amples.
5.1 A Multi- Layer Perceptron Classifier
We use a Multi-Layer Perceptron (MLP) net to perform
the desired classification task. The net has 12 pairs of
input terminals, one output unit and 24 hidden units.
Each hidden and output unit computes a weighted sum
of its input links and performs sigmoidal thresholding
on its output. During classification, the net is given a
vector of the current test pattern's distance measure-
ments to the 12 prototype centers. Each input terminal
pair receives the distance values for a designated proto-
type pattern. The output unit returns a '1' if the input
14
distance vector arises from a "face" pattern, and a '0'
otherwise.
In our current system, the hidden units are partially
connected to the input terminals and output unit as
shown in Figure 10. The connections exploit some prior
knowledge of the problem domain. Notice in particular
that there is one layer of hidden units that combines each
2- Value distance pair into a single distance value. Our
experiments in the next section will show that the num-
ber of hidden units and network connectivity structure
do not significantly affect the classifier's performance.
We train our multi-layer perceptron classifier on fea-
ture distance vectors from a database of 47316 window
patterns. There are 4150 positive examples of "face"
patterns in the database and the rest are "non-face"
patterns. The net is trained with a standard back-
propagation learning algorithm [14] until the output er-
ror stabilizes at a very small value. For the particular
multi-layer perceptron net architecture in Figure 10, we
actually get a training output error of 0.
5.2 Generating and Selecting Training
Examples
In many example-based learning applications, how well
a learner eventually performs on its task depends heav-
ily on the quality of examples it receives during training.
An ideal learning scenario would be to give the learner
as large a set of training examples as possible, in order
to attain a comprehensive sampling of the input space.
Unfortunately, there are some real-world considerations
that could seriously limit the size of training databases,
such as shortage of free disk space and computation re-
source constraints.
How do we build a comprehensive but tractable
database of "face" and "non-face"" patterns? For "face"
patterns, the task at hand seems rather straight forward.
We simply collect all the frontal views of faces we can find
in mugshot databases and other image sources. Because
we do not have access to many mugshot databases and
the size of our mugshot databases are all fairly small, we
do not encounter the problem of having to deal with an
unmanageably large number of "face" patterns. In fact,
to make our set of "face" patterns more comprehensive,
we even artificially enlarged our data set by adding vir-
tual examples [13] of faces to the "face" database. These
virtual examples are mirror images and slightly rotated
versions of the original face patterns, as shown in Fig-
Figure 12: Some false detects by an earlier version of our face detection system, marked by solid squares on the two
images above. We use these false detects as new negative examples to re-train our system in a "boot-strap" fashion.
ure 11.
For "non-face" patterns, the task we have seems more
tricky. In essence, every square non-canonical face win-
dow pattern of any size in any image is a valid "non-face"
pattern. Clearly, even with a few source images, our set
of "non-face" patterns can grow intractably large if we
are to include all valid "non-face" patterns in our train-
ing database.
To constrain the number of "non-face" examples in
our database, we use a "boot-strap" strategy that in-
crementally selects only those "non-face" patterns with
high information value. The idea works as follows:
1. Start with a small and possibly incomplete set of
"non-face" examples in the training database.
2. Train the multi-layer perceptron classifier with the
current database of examples.
3. Run the face detector on a sequence of random
images. Collect all the "non-face" patterns that
the current system wrongly classifies as "faces" (see
Figure 12). Add these "non-face" patterns to the
training database as new negative examples.
4. Return to Step 2.
At the end of each iteration, the "boot-strap" strat-
egy enlarges the current set of "non-face" patterns with
new "non-face" patterns that the current system classi-
fies wrongly. We argue that this strategy of collecting
wrongly classified patterns as new training examples is
reasonable, because we expect these new examples to
improve the classifier's performance by steering it away
from the mistakes it currently commits.
Notice that if necessary, we can use the same "boot-
strap" technique to enlarge the set of positive "face" pat-
terns in our training database. Also, notice that at the
15
end of each iteration, we can re-cluster our "face" and
"non-face" databases to generate new prototype patterns
that might model the distribution of face patterns more
accurately.
6 Results and Performance Analysis
We implemented and tested our face detection system
on a wide variety of images. Figure 13 shows some sam-
ple results. The system detects faces over a fairly large
range of scales, beginning with a window size of 19 x 19
pixels and ending with a window size of 100 x 100 pixels.
Between successive scales, the window width is enlarged
by a factor of 1.2.
The system writes its face detection results to an out-
put image. Each time a "face" window pattern is found
in the input, an appropriately sized dotted box is drawn
at the corresponding window location in the output im-
age. Notice that many of the face patterns in Figure 13
are enclosed by multiple dotted boxes. This is because
the system has detected those face patterns either at a
few different scales or at a few slightly offset window
positions or both.
The upper left and bottom right images of Figure 13
show that our system works reliably without making
many false positive errors (none in this case), even for
fairly complex scenes. Notice that the current system
does not detect Geordi's face. This is because Geordi's
face differs significantly from the notion of a "typical"
face pattern in our training database of faces — his eyes
are totally occluded by an opaque metallic visor. The
upper right input-output image pair demonstrates that
the same system detects real faces and hand-drawn faces
equally well, while the bottom left image shows that
the system finds faces successfully at two very different
Figure 13: Some face detection results by our system. The upper left and bottom right images show that the system
finds faces in complex scenes without making many false detects. The upper right image pair shows that the same system
detects real faces and hand drawn faces equally well. The bottom left image shows the system working successfully at two
very different scales. Notice that in the cards image, the system detects only the vertically oriented face patterns as it is
meant to at this time.
16
scales.
6.1 Measuring the System's Performance
To quantitatively measure our system's performance, we
ran our system on two test databases and counted the
number of correct detections versus false alarms. All the
face patterns in both test databases are new patterns not
found in the training data set. The first test database
consists of 301 frontal and near-frontal face mugshots
of 71 different people. All the images are high quality
digitized images taken by a CCD camera in a labora-
tory environment. There is a fair amount of lighting
variation among images in this database, with about 10
images having very strong lighting shadows. We use this
database to obtain a "best case" detection rate for our
system on high quality input patterns.
The second database contains 23 images with a total
of 149 face patterns. There is a wide variation in quality
among the 23 images, ranging from high quality CCD
camera pictures to low quality newspaper scans. Most
of these images have complex background patterns with
faces taking up only a very small percentage of the total
image area. We use this database to obtain an "average
case" performance measure for our system on a more
representative sample of input images.
For the first database, our system correctly finds
96.3% of all the face patterns and makes only 3 false
detects. All the face patterns that it misses have either
strong illumination shadows or fairly large off-plane ro-
tation components or both. Even though this high de-
tection rate applies only to high quality input images,
we still find the result encouraging because often, one
can easily replace poorer sensors with better ones to ob-
tain comparable results. For the second database, our
system achieves a 79.9% detection rate with 5 false posi-
tives. The face patterns it misses are mostly either from
low quality newspaper scans or hand drawn pictures. We
consider this behavior acceptable because the system is
merely degrading gracefully with poorer image quality.
6.2 Analyzing the System's Components
We conducted the following additional experiments to
identify the key components of our face detection algo-
rithm. Specifically, we examined how the following three
aspects of our system affects its performance in terms of
face detection rate versus false alarm ratio: (1) The clas-
sifier architecture, (2) our 2- Value distance metric versus
other distance measures for computing feature distance
vectors, and (3) the importance of "non-face" prototypes
in our distribution-based face model.
Classifier Architecture The first experiment investi-
gates how varying the classifier's architecture affects our
system's overall performance. To do this, we create two
new systems with different classifier architectures, and
compare their face detection versus false alarm statistics
with those of our original system. Our two new classifier
architectures are:
1. A single perceptron unit. We replace the orig-
inal multi-layer perceptron net in Figure 10 with a
single perceptron unit connected directly to the 12
17
pairs of input terminals. The single perceptron unit
computes a sigmoidally thresholded weighted sum
of its input feature distance vector. It represents
the simplest possible architecture in the family of
multi-layer perceptron net classifiers, and its pur-
pose here is to provide an "extreme case" perfor-
mance figure for multi-layer perceptron net classi-
fiers in this problem domain.
2. A nearest neighbor classifier. We perform near-
est neighbor classification on feature distance vec-
tors to identify new test patterns. The nearest
neighbor classifier works as follows: For each train-
ing pattern, we compute and store its 24 value fea-
ture distance vector and output class at compile
time. When classifying a new test pattern, we com-
pute its 24 value feature distance vector and return
the output class of the closest stored feature vector
in Euclidean space. The nearest neighbor classifier
provides us with a performance figure for a different
classifier type in this problem domain.
The Distance Metric The second experiment inves-
tigates how using a different distance metric for com-
puting feature distance vectors in the matching stage
affects the system's performance. We compare our 2-
Value distance metric with three other distance mea-
sures: (1) the normalized Mahalanobis distance within
a 75 dimensional vector subspace spanned by the pro-
totype cluster's 75 largest eigenvectors — i.e. the first
component only (X>i) of our 2- Value distance metric, (2)
the Euclidean distance between the test pattern and its
projection in the 75 dimensional subspace — i.e. the sec-
ond component only (V2) of our 2- Value distance metric,
and (3) the standard normalized Mahalanobis distance
(Mn) between the test pattern and the prototype cen-
troid within the full image vector space.
To conduct this experiment, we repeat the previous
set of classifier experiments three additional times, once
for each new distance metric we are comparing. Each
new set of experiments differs from the original set as fol-
lows: During the matching stage, we use one of the three
distance measures above instead of the original 2- Value
distance metric to compute feature distance vectors be-
tween new test patterns and the 12 prototype centroids.
Notice that because the three new distance measures are
all single-value measurements, our new feature distance
vectors have only 12 values instead of 24 values. This
means that we have to modify the classifier architectures
accordingly by reducing the number of input terminals
from 24 to 12.
"Non-Face" Prototypes This experiment looks at
how differently the system performs with and without
"non-face" prototypes in the distribution-based model.
We compare results from our original system, whose face
model contains both "face" and "non-face" prototypes,
against results from two new systems whose internal
models contain only "face" prototypes. The two new
systems are:
1. A system with 12 "face" prototypes and
no "non-face" prototypes. In this system, we
fix the total number of pattern prototypes in the
Distance Metric
Classifier Architecture
Multi-Layer
Single Unit
Nearest Nbr
2-Value
96
79
3% 3
9% 5
96
84
7% 3
6% 13
65.1% 1
2>i
(first component)
91
85
6% 21
1% 114
93
85
3% 15
1% 94
97.4% 208
v 2
(second component)
91
65
4% 4
1% 5
92
68
3% 3
2% 5
53.9% 1
M n
(Std. Mahalanobis)
84
42
1% 9
6% 5
93
58
0% 13
6% 11
71.8% 5
Table 1: Summary of performance figures from Experiments f (Classifier Architecture) and 2 (Distance Metric). Detection
rates versus number of false positives for different classifier architectures and distance metrics. The four numbers for each
entry are: Top Left: detection rate for first database. Top Right: number of false positives for first database. Bottom
Left: detection rate for second database. Bottom Right: number of false positives for second database. Notice that
we did not test the nearest neighbor architecture on the second database. This is because the test results from the first
database already show that the nearest neighbor classifier is significantly inferior to the other 2 classifiers in this problem
domain.
canonical face model, and hence the dimensionality
of the feature distance vector the matching stage
computes. We obtain the 12 "face" prototypes
by performing elliptical k-means clustering with 12
centers on our enlarged canonical face database of
4150 patterns (see Section 3.3). The matching stage
computes the same 2- Value distance metric that our
original system uses; i.e., for each prototype clus-
ter, T>i is the normalized Mahalanobis distance in
a subspace of the 75 largest eigenvectors, and T>2
is the Euclidean distance between the test pattern
and the subspace.
2. A system with only 6 "face" prototypes. In
this system, we preserve only the "face" prototypes
from the original system. The matching stage com-
putes the same 2- Value distances between each test
pattern and the 6 "face" centroids. Notice that the
resulting feature distance vectors have only 6 pairs
of values.
We generate two sets of performance statistics for each
the two new systems above. For the first set, we use a
trained multi-layer perceptron net with 12 hidden units
to classify new patterns from their feature distance vec-
tors. For the second set, we replace the multi-layer per-
ceptron net classifier with a trained single perceptron
unit classifier.
6.3 Performance Figures and Interpretation
Table 1 summarizes the performance statistics for
both the classifier architecture and distance metric ex-
periments, while Table 2 shows how the system performs
with and without "Non-Face" model prototypes.
Classifier Architecture The horizontal rows of Ta-
ble 1 show how different classifier architectures affect
the system's face detection rate versus false alarm in-
stances. Quantitatively, the two network-based classi-
fiers have very similar performance figures, while the
nearest neighbor classifier produces rather different per-
formance statistics. Depending on the distance metric
being used, the nearest neighbor classifier has either a
18
somewhat higher face detection rate with a lot more false
alarms, or a much lower face detection rate with some-
what fewer false alarms than the other two classifiers.
Because we do not have a reasonable measure of relative
importance between face detection rate and number of
false alarms, we empirically rank the performance fig-
ures by visually examining their corresponding output
images. In doing so, we find that the two network-based
classifiers produce results that are a lot more visually
appealing than the nearest neighbor classifier.
It is interesting that the two network-based classifiers
we tested produce very similar performance statistics,
especially on the first test database. The similarity here
suggests that the system's performance depends most
critically on the classifier type, i.e. a perceptron net,
and not on the specific net architecture. It is also some-
what surprising that one can get very encouraging per-
formance figures even with an extremely simple single
perceptron unit classifier, especially when using our 2-
Value distance metric to compute feature distance vec-
tors. This suggests that the distance vectors we get with
our 2- Value distance metric are a linearly very separable
set of features for "face" and "non-face" patterns.
Why does the nearest neighbor classifier have a much
lower face detection rate than the other two network-
based classifiers when used with our 2- Value distance
metric? We believe this has to do with the much larger
number of non-face patterns in our training database
than face patterns (43166 non-face patterns versus 4150
face patterns). The good performance figures from the
single perceptron classifier suggest that the two pattern
classes are linearly highly separable in our 24 value fea-
ture distance vector space. Assuming that roughly the
same fraction of face and non-face patterns lie along
the linear class boundary, we get a boundary population
with 10 times more non-face samples than face samples.
A new face pattern near the class boundary will there-
fore have a very high chance of lying nearer a non-face
sample than a face sample, and hence be wrongly clas-
sified by a nearest neighbor scheme. Notice that despite
the huge difference in number of face and non-face train-
Classifier
Architecture
Composition of Prototypes
6 Face &
6 Non-Face
12 Face
6 Face
Multi-layer Perceptron
96.3% 3
79.9% 5
85.3% 21
69.6% 74
59.7% 17
60.9% 41
Single Perceptron
96.7% 3
84.6% 13
52.1% 6
49.7% 16
66.6% 25
55.4% 56
Table 2: Summary of performance figures for Experiment 3 ("Non-Face" Prototypes). Detection rates versus number of
false positives for different classifier architectures and composition of prototypes in distribution-based model. The four
numbers for each entry are: Top Left: detection rate for first database. Top Right: number of false positives for first
database. Bottom Left: detection rate for second database. Bottom Right: number of false positives for second
database.
ing patterns, a perceptron classifier can still locate the
face vs. non-face class boundary accurately if the two
pattern classes are indeed linearly highly separable, and
hence still produce good classification results.
The Distance Metric The vertical columns of Ta-
ble 1 show how the system's performance changes with
different distance metrics in the pattern matching stage.
Both the multi-layer perceptron net and single per-
ceptron classifier systems produce better performance
statistics with our 2- Value distance metric than with the
other three distance measures, especially on images from
the second test database. The observation should not at
all be surprising. Our 2- Value distance metric consists
of both T>i and T>2 , two of the three other distance mea-
sures we are comparing against. A system that uses our
2- Value distance should therefore produce performance
figures that are at least as good as a similar system that
uses either V\ or X>2 only. In fact, since our 2- Value
distance systems actually produce better classification
results than the systems using only V\ or X>2, one can
further affirm that the two components V\ and X>2 do
not mutually contain totally redundant information.
Our 2- Value distance metric should also produce clas-
sification results that are at least as good as those ob-
tained with a Mahalanobis distance metric. This is be-
cause both metrics treat and quantify distance in essen-
tially the same way — as a "difference" notion between
a test pattern and a local data distribution. Further-
more, the "difference" notions they use are based on
two very similar Gaussian generative models of the local
data distribution. In our experiments with perceptron-
based classifiers, the 2- Value distance metric actually
out-performs the Mahalanobis distance metric consis-
tently. We suggest two possible reasons for this differ-
ence in performance: (1) As discussed in Section 4.5, we
have too few sample points in our local data distribution
to accurately recover a full Gaussian covariance matrix
for computing Mahalanobis distances. By naively trying
to do so, we get a distance measure that poorly reflects
the "difference" notion we want to capture. (2) The lo-
cal data distribution we are trying to model may not be
truly Gaussian. Our 2-Value distance metric provides
an additional degree of freedom that better describes a
test pattern's location relative to the local non-Gaussian
data distribution.
It is interesting that even a network-based classifier
19
with a "partial" X>2 distance metric almost always out-
performs a similar classifier with a "complete" Maha-
lanobis distance metric. At first glance, the observation
can be surprising because unlike the "complete" Maha-
lanobis distance, the X>2 metric alone does not account
for pattern differences in certain image vector space di-
rections. More specifically, the X>2 metric only measures
a pattern's Euclidean distance from a cluster's 75 most
significant eigenvector subspace, and does not account
for pattern differences within the subspace. We believe
this comparison truly reveals that without sufficient data
to accurately recover a full Gaussian covariance matrix,
the resulting Mahalanobis distance can be a very poor
notion of "pattern difference" .
"Non-Face" Prototypes Table 2 summarizes the
performance statistics for comparing systems with and
without "non-face" prototypes. As expected, the sys-
tems with "non-face" prototypes clearly out-perform
those without "non-face" prototypes. Our results sug-
gest that not only are the "non-face" prototypes an ad-
ditional set of features for face detection, they are in fact
a very discriminative set of additional features.
7 Conclusion
We have successfully developed a system for finding un-
occluded vertical frontal views of human faces in images.
The approach is view based. It models the distribution of
face patterns by means of a few prototype clusters, and
learns from examples a set of distance parameters for
distinguishing between "face" and "non-face" test pat-
terns. We stress again, however, that our ultimate goal
is to develop our face detection approach into a general
methodology for taking on feature detection and pattern
recognition tasks in multiple domains.
We plan to further our work in the following two di-
rections. First, we would like to demonstrate the full
power of our face detection approach by building a more
comprehensive face detection system. One obvious ex-
tension would be to have the system detect faces over a
wider range of poses instead of just near-frontal views.
We believe that our current approach can in fact be used
without modification to perform this new task. All we
need is a means of obtaining or artificially generating a
sufficiently large example database of human faces at the
different poses [1].
Second, we would like to demonstrate the versatility
and generality of our approach by building a few more
feature and pattern detection applications in other prob-
lem domains. Some possibilities include industrial in-
spection applications for detecting defects in manufac-
tured products and terrain feature classification appli-
cations for SAR imagery. We believe that the key to
making this work would be to find appropriate trans-
formation spaces for each new task, wherein the target
pattern classes would be reasonably stable.
References
[1] D. Beymer, A. Shashua, and T. Poggio. Example
Based Image Analysis and Synthesis. A.I. Memo
No. 1431, Artificial Intelligence Laboratory, Mas-
sachusetts Institute of Technology, 1993.
[2] M. Bichsel. Strategies of Robust Objects Recognition
for Automatic Identification of Human Faces. PhD
thesis, ETH, Zurich, 1991.
[3] R. Brunelli and T. Poggio. Face Recognition:
Features versus Templates. IEEE Transactions
on Pattern Analysis and Machine Intelligence,
15(10):1042-1052, 1993.
[4] M. C. Burl, U. Fayyad, P. Perona, P. Smyth, and
M. P. Burl. A Trainable Tool for Finding Small
Volcanoes in SAR Imagery of Venus. Technical Re-
port CNS TR 34, California Institute of Technology,
October 1993.
[5] Richard O. Duda and Peter E. Hart. Pattern Clas-
sification and Scene Analysis. John Wiley and Sons
Inc., New York, 1973.
[6] K. Fukunaga. Introduction to Statistical Pattern
Recognition. Academic Press, 1990.
[7] W. E. L. Crimson and T. Lozano-Perez. Model-
Based Recognition and Localization from Sparse
Range Data. In A. Rosenfeld, editor, Techniques
for 3-D Machine Perception. North-Holland, Ams-
terdam, 1985.
[8] G. Hinton, M. Revow, and P. Dayan. Recognizing
Handwritten Digits using Mixture of Linear Models.
In D. Touretzky G. Tesauro and J. Alspector, ed-
itors, Advances in Neural Information Processings
Systems 7, San Mateo, CA, 1995. Morgan Kaufman.
[9] M. Kirby and L. Sirovich. Applications of the
Karhunen-Loeve Procedure for the Characteriza-
tion of Human Faces. IEEE Transactions on Pat-
tern Analysis and Machine Intelligence, 12(1):103-
108, 1990.
[10] B. Kumar, D. Casasent, and H. Murakami. Prin-
cipal Component Imagery for Statistical Pattern
Recognition Correlators. Optical Engineering,
21(1), Jan/Feb 1982.
[11] A. Mahalanobis, A. Forman, N. Day, M. Bower,
and R. Cherry. Multi-Class SAR ATR using Shift-
Invariant Correlation Filters. Pattern Recognition,
27(4):619-626, April 1994.
[12] A. Pentland, B. Moghaddam, and T. Starner. View-
based and Modular Eigenspaces for Face Recog-
nition. In Proceedings IEEE Conf. on Computer
Vision and Pattern Recognition, pages 84-91, June
1994.
[13] T. Poggio and T. Vetter. Recognition and Struc-
ture from One (2D) Model View: Observations on
Prototypes, Object Classes, and Symmetries. A.I.
Memo No. 1347, Artificial Intelligence Laboratory,
Massachusetts Institute of Technology, 1992.
[14] D. Rumelhart and J. McClelland. Parallel Dis-
tributed Processing, volume 1. MIT Press, Cam-
bridge, Massachusetts, 1986.
[15] P. Sinha. Object Recognition via Image Invariants:
A Case Study. In Investigative Ophthalmology and
Visual Science, volume 35, pages 1735-1740, Sara-
sota, Florida, May 1994.
[16] L. Sirovich and M. Kirby. Low-dimensional Pro-
cedure for the Characterization of Human Faces.
Journal of the Optical Society of America, 4(3):519-
524, March 1987.
[17] K. Sung and T. Poggio. Example-based Learning for
View-based Human Face Detection. In Proceedings
Image Understanding Workshop, volume II, pages
843-850, Monterey, CA, November 1994.
[18] C. Therrien. Decision, Estimation and Classifica-
tion. John Wiley and Sons, Inc., 1989.
[19] M. Turk and A. Pentland. Eigenfaces for Recogni-
tion. Journal of Cognitive Neuroscience, 3(1):71— 86,
1991.
[20] A. Yuille, P. Hallinan, and D. Cohen. Feature Ex-
traction from Faces using Deformable Templates.
International Journal of Computer Vision, 8(2):99-
111, 1992.
20