Paper here: http://www.cs.princeton.edu/~satyen/papers/privacy.ps
I'll be honest: I saw a lot of greek symbols and ran away screaming, so I actually didn't get a whole lot out of this paper. This is doubly embarrassing as several authors are MSR-SVC-ers I know and love (Frank, Cynthia, Kunal).
Okay, so, the gist is that you want to release a "fuzzied" database with information about people in the form of "contingency tables," which are essentially histograms of various traits (think male vs. female, age ranges [20-29, 30-39, 40-49...], etc.). You want to maintain aggregate statistics about the data set without revealing information about particular people in the data set. The problem is that you want to maintain the accuracy of the data and keep it internally consistent in the process of making it private. This work focuses on keeping the data consistent.
The punchline is that instead of directly tweaking the data itself or the "marginals" (which they never define for us idiot-folk), they translate the data into the Fourier domain and tweak the data there. Turns out that has nice properties, though fuck me if I know even what that means or why it's true.
Monday, May 14, 2007
Supporting Ranking and Clustering as Generalized Order-By and Group-By
paper here: http://www-forward.cs.uiuc.edu/pubs/2007/clusterrank-sigmod07-lwlwc-mar07.pdf
I didn't get all the way through this paper...I'll admit it. But the idea here is to introduce an information retrieval-type operation into the standard SQL language. What the hell does that mean? Well, it seems to mean that they want to do some form of clustering. The example they give is realty. You'd like to look at a set of houses that are either lower-priced in the suburbs or higher-priced but with a nice view on the water. In this case, you want your query to return houses that fit into one of those clusters, and then you want to order houses within each cluster. So, they do it. And they are essentially running k-means with some weird little optimizations so that they don't have to materialize the entire database.
It was pointed out in the discussion that their semantics are weird and inexact. Which is to say: k-means is unstable, and it can give you different results running it multiple times on the same data. They take this instability and amplify it by doing what is essentially an approximation of k-means by creating centroids of (in some sense) adjacent tuples and then running k-means on the centroids. Problem is, I don't know how the approximation relates to the full k-means (which, again, has fuzzy semantics to begin with). So, I don't really know what guarantees I have on my results. Boo.
I didn't get all the way through this paper...I'll admit it. But the idea here is to introduce an information retrieval-type operation into the standard SQL language. What the hell does that mean? Well, it seems to mean that they want to do some form of clustering. The example they give is realty. You'd like to look at a set of houses that are either lower-priced in the suburbs or higher-priced but with a nice view on the water. In this case, you want your query to return houses that fit into one of those clusters, and then you want to order houses within each cluster. So, they do it. And they are essentially running k-means with some weird little optimizations so that they don't have to materialize the entire database.
It was pointed out in the discussion that their semantics are weird and inexact. Which is to say: k-means is unstable, and it can give you different results running it multiple times on the same data. They take this instability and amplify it by doing what is essentially an approximation of k-means by creating centroids of (in some sense) adjacent tuples and then running k-means on the centroids. Problem is, I don't know how the approximation relates to the full k-means (which, again, has fuzzy semantics to begin with). So, I don't really know what guarantees I have on my results. Boo.
Sunday, May 13, 2007
WiLDNet: Design and Implementation of High Performance WiFi Based Long Distance Networks
Original paper here: http://www.cs.washington.edu/education/courses/590l/07sp/wildnet.pdf
Okay, my initial impression of this paper was that it was kind of cute, but not really groundbreaking. After reading some of the background, namely the 2P paper, I think most of the cool work was done there, and it's not entirely clear what new and interesting things happened here. (Side note: I really liked the 2P paper if for no other reason than they presented the idea very clearly. I got it almost immediately.)
The environment they're working in is long-haul wifi. We're talking about standard 802.11 used in a long-distance environment with point-to-point directional antennas. Specifically, a node has a different antenna for each communication partner. The problems in this environment are:
The solutions to these are fairly straightforward. Like 2P, these guys use time-slicing (TDMA) to eliminate collisions. They implement a sliding window ack mechanism to deal with the long propagation delay. Finally, they add erasure coding to cover loss. An interesting tweak to the erasure coding mechanism is that they send the packets they want to get there in the clear first, and then they send the "backup" packets afterwards. This eliminates the necessity of decoding the packets if there is no loss.
That's about it...they do a nice examination of the design tradeoffs of things like overhead versus loss masking for both erasure coding and retransmissions.
Okay, my initial impression of this paper was that it was kind of cute, but not really groundbreaking. After reading some of the background, namely the 2P paper, I think most of the cool work was done there, and it's not entirely clear what new and interesting things happened here. (Side note: I really liked the 2P paper if for no other reason than they presented the idea very clearly. I got it almost immediately.)
The environment they're working in is long-haul wifi. We're talking about standard 802.11 used in a long-distance environment with point-to-point directional antennas. Specifically, a node has a different antenna for each communication partner. The problems in this environment are:
- the different antennas at a node can interfere with each other
- the propagation delay can be large enough to trigger link-layer timeouts
- the 802.11 specification mandates stop-and-wait semantics for frame acks (which take forever on those long links)
- external interference can wreak havoc
The solutions to these are fairly straightforward. Like 2P, these guys use time-slicing (TDMA) to eliminate collisions. They implement a sliding window ack mechanism to deal with the long propagation delay. Finally, they add erasure coding to cover loss. An interesting tweak to the erasure coding mechanism is that they send the packets they want to get there in the clear first, and then they send the "backup" packets afterwards. This eliminates the necessity of decoding the packets if there is no loss.
That's about it...they do a nice examination of the design tradeoffs of things like overhead versus loss masking for both erasure coding and retransmissions.
First post! Woo hoo!
So, I've been meaning to create somewhere to store my random thoughts on the papers I read somewhere. For a while I agonized over whether this should be a wiki or a blog, where it should be hosted, etc. Eventually, I was paralyzed by these decisions, so at some point I said, "Fuck it...I'll just make a blogspot blog." If I decide otherwise later, I'll just migrate the entries. Somehow.
Anyway, this likely won't be coherent, organized, particularly thoughtful, or in any way useful. I just tend to forget what the fuck I've read, so I thought it would be useful to have a place to jot down random thoughts about technical papers I read.
Anyway, this likely won't be coherent, organized, particularly thoughtful, or in any way useful. I just tend to forget what the fuck I've read, so I thought it would be useful to have a place to jot down random thoughts about technical papers I read.
Subscribe to:
Comments (Atom)