Nick's Paper-Reading Blog: Supporting Ranking and Clustering as Generalized Order-By and Group-By

paper here: http://www-forward.cs.uiuc.edu/pubs/2007/clusterrank-sigmod07-lwlwc-mar07.pdf

I didn't get all the way through this paper...I'll admit it. But the idea here is to introduce an information retrieval-type operation into the standard SQL language. What the hell does that mean? Well, it seems to mean that they want to do some form of clustering. The example they give is realty. You'd like to look at a set of houses that are either lower-priced in the suburbs or higher-priced but with a nice view on the water. In this case, you want your query to return houses that fit into one of those clusters, and then you want to order houses within each cluster. So, they do it. And they are essentially running k-means with some weird little optimizations so that they don't have to materialize the entire database.

It was pointed out in the discussion that their semantics are weird and inexact. Which is to say: k-means is unstable, and it can give you different results running it multiple times on the same data. They take this instability and amplify it by doing what is essentially an approximation of k-means by creating centroids of (in some sense) adjacent tuples and then running k-means on the centroids. Problem is, I don't know how the approximation relates to the full k-means (which, again, has fuzzy semantics to begin with). So, I don't really know what guarantees I have on my results. Boo.

Nick's Paper-Reading Blog

Monday, May 14, 2007

Supporting Ranking and Clustering as Generalized Order-By and Group-By

No comments:

Blog Archive

About Me

Analytics