Nick's Paper-Reading Blog: October 2007

Tuesday, October 16, 2007

VirtualPower: Coordinated Power Management in Virtualized Enterprise Systems

Authors: Ripal Nathuji (Georgia Institute of Technology) and Karsten Schwan (Georgia Institute of Technology)

Paper: http://www.sosp2007.org/papers/sosp111-nathuji.pdf

Datacenters need power management. Cooling makes things worse. Indusry deployed ACPI.

How do you maintain power information in virtualization layer?

Lots of platform heterogeneity in a datacenter. Variations in power, performance, managability. Homogenize using VMs. Can restrict physical utilization of a VM ("soft scaling"). Can adhere to virtualization SLAs in the guest OS without being specifically aware of them. Take advantage of feedback loop.

Implementation: VPM events created when guest OS makes a power call. Dom0 can retrieve these events.

Future work: VPM tokens, idle power management (additional VPM C-states), efficient soft-scale consolidations

Q: Any estimate of whether adherence to OS power interface has a cost?
A: Want to look at it. Looking at a lightweight paravirtualization solution.

Q: What about consolidating nodes (i.e., shutting down physical nodes)?
A: Well, we can do this on-the-fly automatically.

Integrating Concurrency Control and Energy Management in Device Drivers

Authors: Kevin Klues (Stanford University, Washington University in St. Louis, Technical University of Berlin), Vlado Handziski (Technical University of Berlin), Chenyang Lu (Washington University in St. Louis), Adam Wolisz (Technical University of Berlin, University of California Berkeley), David Culler (Arch Rock Co., University of California Berkeley), David Gay (Intel Research Berkeley), and Philip Levis (Stanford University)

Paper: http://www.sosp2007.org/papers/sosp186-klues.pdf

(SOSP presentation)

Existing embedded devices usually rely on application for power savings. Manually shut off/turn on pieces = bleh. ICEM: Split-phase I/O operations. = make asynchronous. 3 types of device driver:

virtualized

only a functional interface
assume multiple users
buffer I/O requests for energy savings
must be able to tolerate longer latencies

dedicated

assume single user
no concurrency control
explicit energy management

shared

functional and lock interface
multiple user
explicit concurrency control through split-phase lock
implicit energy management based on pending requests
used for stringent timing requests

Power locks: split phase lock with integrated energy/configuration management. (wtf does that mean?). Hook down into dedicated driver. Locking mechanism to grant exclusive access to device. Lock takes care of powering on and off device.

Q: Doesn't it ultimately boil down to application decisions no matter what?
A: Thinking of letting application send hints to system

Q: Does any of this apply to mainstream OSes?
A: Not yet...where we'd really like to see this is in mobile phone OSes.

Q: How does the programming model change for app writers?
A: Very much like async I/O.

Q: Can any of the transaction work apply here? You're sort of grouping operations into a transaction.
A: Hadn't thought about it.

Q: Send is bottleneck. Done anything about that?
A: We're just specifying an architecture. You can specify policy.

AutoBash: Improving Configuration Management with Operating System Causality Analysis

Configuration management sucks. Want to automate:

Replay mode - automatically search for solution
Observation mode - highly interactive problem solving
Healthcare mode - make sure things stay working

In observation mode, testing if app now works is tedious. Set up predicates to determine functionality status. Use Speculator (SOSP 2005) to do lightweight checkpointing. Automatically do regression tests after a problem "solved." Use causality tracking to determine which ones need to be run.

Q: How far would just having transactions in Linux get you?
A: mumble.

Q: How well would this work in a distributed file system environment? Also, how important is it to have predicates for every piece of software?
A: Speculator doesn't work on distributed system. Should work on more general distributed systems.

Q: Maybe you could mine installation process for information about what the correct configuration is supposed to be. Now the Q: where can you get the predicates?
A: mumble.

Q: How well would this work with persistent state transactions (rather than speculation)?
A: If only work in persistent state, bad state can become incorporated into the persistent state. That's bad.

Q: What if the things you try only partially fix the problem or just gives you a clue about what to try next?
A: Which predicates work should tell you something about what needs to happen.

Q: What if you need to apply multiple solutions to get it to work? Can the system figure that out?
A: Future work.

Staged Deployment in Mirage, an Integrated Software Upgrade Testing and Distribution System

Authors: Olivier Crameri (EPFL), Nikola Knezevic (EPFL), Dejan Kostic (EPFL), Ricardo Bianchini (Rutgers), and Willy Zwaenepoel (EPFL)

Paper: http://www.sosp2007.org/papers/sosp076-crameri.pdf

This paper seems to focus on the issue of clustering machines that behave identically with respect to an upgrade.

Heuristically categorize dependencies used at runtime. Can be user-defined. Fingerprint resources. Categorize based on set of resources.

How effective is automatic resource classification? Good...no errors, though small single digit number of vendor-specific rules.

Q: Isn't this going to slow things down and make things easier for people to exploit security flaws?
A: As we said, there's a tradeoff

Q: Could this help you narrow down differences in configuration that cause bugs?
A: Hopefully.

Q: Isn't the number of configurations subject to combinatorial explosions?
A: Sure, possible...in practice hopefully not? We're studying this now.

Dynamo: Amazon's Highly Available Key-Value Store

Authors: Guiseppe DeCandia (Amazon.com), Deniz Hastorun (Amazon.com), Madan Jampani (Amazon.com), Gunavardhan Kakulapati (Amazon.com), Avinash Lakshman (Amazon.com), Alex Pilchin (Amazon.com), Swami Sivasubramanian (Amazon.com), Peter Vosshall (Amazon.com), and Werner Vogels (Amazon.com)

Paper: http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html

Amazon is a loosely coupled, service-oriented architecture. Each service is independent, but must adhere to latency SLAs. Availability is paramount.

RDBMS is a poor fit despite key-value being a good fit: most features unused, scales up and not out, and availability limitations.

Generally care more about availability than consistency.

Needs to be always writable, even in failure, even without previous context.

Want "knobs" to tune tradeoffs between cost, consistency, durability, and latency.

Overview:

consistent hashing
optimistic replication
"sloppy quorum"
anti-entropy mechanisms
object versioning

Use a full membership model rather than overlay routing. Use virtual nodes for load balancing.

"Sloppy quorum": N replicas in ideal state, read from at least R nodes, write to at least W nodes. "Sloppy" because the membership is dynamic based on node availability. Different values for N, R, and W yield different characteristics for the resulting system.

Each write is a new version. In worst case, might read a stale read version. A write based on this creates a branch in the version history.

It is up to the application to resolve version history conflicts! All (relevant) versions returned to app!

Use vector clocks to take care of version history (preserves causality).

Lessons learned:

(missed first)
repartitioning is slow because propagating data to new nodes takes forever (gets throttled; lots of random disk I/O)

use fixed arcs; allow transfer of whole database (a file copy, linear read on disk)

Limitations:

no transactional semantics
no ordered traversal
no large objects
does not scale indefinitely

Q: Failure seems to add load. What kind of MTTF do you need to avoid filling the key space?
A: We overprovision to deal with typical failure scenarios, including whole datacenter dying.

Q: When you need to add capacity, don't you need to shed load off of everybody?
A: Nodes have lots of neighbors. Adding nodes does pull load away from a bunch of others.

Q: How do you do reconciliation?
A: Use merkel hash tree for reconciliation?

Q: How do you prove that you met SLAs?
A: not sure

Q: Talk about the kind of conflicts you saw?
A: 99.94% of reads return single value. Most others returning two versions. Some of those might be returning write retries that happen in parallel.

Q: How often do you not achieve quorum?
A: Never!

Q: Ever been a partition?
A: Sure...rack files. Client can't see it though.

Q: Clients might potentially see lots of versions (even if it's rare). How do clients do reconciliation? No version ancestry?
A: Very application-specific. Sometimes last-write wins. Somewhat hard model to program to. Could potentially not garbage collect. No proof of convergence.

Q: Why did you go with consistent hashing?
A: High availability.

Q: What happens when some keys are more popular than others?
A: Often we don't see that. Often falls into the noise.

PeerReview: Practical Accountability for Distributed Systems

Authors: Andreas Haeberlen (MPI-SWS), Petr Kouznetsov (MPI-SWS), and Peter Druschel (MPI-SWS)

Paper: http://www.sosp2007.org/papers/sosp118-haeberlen.pdf

How do you detect faults when the system is federated and you can't see all of it? Specifically, how do you detect faults, how do you identify faulty nodes, and how do you convince evidence? Obviously, we need verifiable evidence.

Genera solution: keep a log, have an auditor that periodically inspects the log. Log is a hash chain (to prevent changing the log ex post facto).

Probabilistic log checking allows for scalability (otherwise overhead would be quadratic).

Q: How do you prevent collusion?
A: We used consistent hashing to choose witnesses, then secure routing.

Q: How do you deal with selective processing?
A: (reiterates what said in the talk)

Q: Seems like most appropriate to malicious faults given that it's all the same state machines. Is this useful for failing software?
A: (nothing useful...offline)

Q: (you misrepresented my CATS system...) How do you make logs visible in a secure way?
A: ??? Assume always at least one correct witness node.

Q: Why is non-repudiation work from 70s not applicable?
A: (Not sure what you're saying, offline)

Sinfonia: A New Paradigm for Building Scalable Distributed Systems

Authors: Marcos K. Aguilera (HP Labs), Arif Merchant (HP Labs), Mehul Shah (HP Labs), Alistair Veitch (HP Labs), and Christos Karamanolis (VMWare)

Paper: http://www.sosp2007.org/papers/sosp064-aguilera.pdf

(SOSP presentation)

This is an infrastructure for distributed applications. The general idea is to create a set of linear memories, and a particular memory address is identified by a node id and an offset. They use small, short-lived transactions with the semantics that if all read values match what the transaction expects, then the transaction commits.

They have a simplified 2-phase commit protocol with, essentially, no coordinator. The transaction blocks if _any_ of the application nodes crash, but they argue this isn't that big a deal because the action may involve application data located on the application node, and if the application node isn't available, you're screwed anyway (rough paraphrase).

They built a cluster file system and a group communication service.

Crap...wasn't paying attention to questions (at least, not writing them down). From what I remember:

- Why this and not a federated array of bricks?
- Are there any pathological cases you found that resulted in frequent rollback?
- other stuff...

Monday, October 15, 2007

MUVI: Automatically Inferring Multi-Variable Access Correlations and Detecting Related Semantic and Concurrency Bugs

Authors: Shan Lu (University of Illinois), Soyeon Park (University of Illinois), Chongfeng Hu (University of Illinois), Xiao Ma (University of Illinois), Weihang Jiang (University of Illinois), Zhenmin Li (University of Illinois), Raluca A. Popa (MIT), and Yuanyuan Zhou (University of Illinois)

Paper: http://www.sosp2007.org/papers/sosp061-lu.pdf

(SOSP presentation)

There are lots of correlated variables in the world (e.g., "when I access X, I should be accessing Y as well"). It's a bug if you don't access them together. Look for this. Their metric is "distance" in the source code. Find variables that are usually accessed together and rarely accessed separately. Use techniques of "itemset" from data mining.

They looked at codebases from Mozilla, MySQL, Linux, and Postgre-SQL. About a thousand correlations. About 15% false positive rates (macros, coincidences). They can't detect conditional correlations (that seems like a weird programming paradigm anyway).

Q: How difficult is it to resolve the false positives, especially in relation to things like RacerX?
A: ??? Hard to compare.

Q: Can it handle more complicated examples? (ed. didn't she say she couldn't deal with these cases?)
A: ??? Branching shouldn't be an issue.

Q: What about more complex correlations, e.g., c is a sum of a and b?
A: Boil down to considering pair as a unit. Room for improvement though.

TxLinux: Using and Managing Hardware Transactional Memory in the Operating System

Authors: Christopher J. Rossbach (UT Austin), Owen S. Hoffman (UT Austin), Donald E. Porter (UT Austin), Hany E. Ramadan (UT Austin), Aditya Bhandari (UT Austin), and Emmett Witchel (UT Austin)

Paper: http://www.sosp2007.org/papers/sosp056-rossbach.pdf

(SOSP 2007 presentation)

This paper presents an effort to use transactional memory in the Linux kernel. The main contribution seems to be the demand that transactions interoperate with locks for legacy code reasons. They have a "cxspinlock" API that allows this, which in turn allows you to do I/O with transactions. They note that priority inversion can still happen in a transactional system, but they propose a simple hardware extension to alleviate this (I didn't quite follow exactly what this was...see paper I guess).

Questions:
Q) Are there pathological cases you didn't talk about (he did talk about exponential backoff in transaction resolution being a problem)
A) overrunning the transactional memory (roughly) can be a problem; no good solution from community

Q) Is this how you would do things if we started over with transactions?
A) No, probably not. But this is the world we live in.

Q) Doesn't using a simulator goof your numbers? How do you calibrate?
A) mumble.

(there was another question I missed)

Nick's Paper-Reading Blog