March 2002
Tuesday, March 19, 2002
Seminar at noon, B&H (Engineering) bldg., Room 190
Organized by
the SHAPE
Lab.
Many data analysis problems involve clustering of a large set of points in an arbitrary metric space. The input to such problems is specified by a set of points P={p_1...p_n}, and a distance function D(.,.). It is assumed that for any two points p,q, the value of D(p,q) can be computed within a fixed time (say T). The goal is to partition P into a set of k clusters C_1...C_k in order to minimize certain objective function describing the clustering quality.
For example, the clustering can be defined by a set of k points (called centers) c_1..c_k, and setting C_i to be the set of points in P for whom c_i is the closest center. The quality of the clustering can be then defined e.g., by a "sum-sum" function
Sum_{i=1..k} Sum_{p in C_i} D(p,c_i)
Since the number of points n is large, even computing D(p,q) for all pairs p,q, and then clustering based on this information is infeasible. In this talk we show how to overcome this difficulty. In particular, we show an algorithm which provably finds an *approximately* optimal clustering (w.r.t. "sum-sum" function) by doing only about O(kn) distance computations. Thus, when k << n, the algorithm runs in time *sublinear* in the description size of a metric (which is Omega(n^2)). We also show some extensions of this algorithm, as well as results for other clustering objective functions.
Last Updated: March. 30, 2002