Nick's Paper-Reading Blog

Monday, March 3, 2008

Overshadow: A Virtualization-Based Approach to Retrofitting Protection in Commodity Operating Systems

Authors: Xiaoxin Chen (VMWare), Tal Garfinkel (VMWare), E. Christopher Lewis (VMWare), Pratap Subrahmanyam (VMWare), Carl A. Waldspurger (VMWare), Dan Boneh (Stanford), Jeffrey Dwoskin (Princeton), Dan R. K. Ports (MIT)

Questions:

Q: Cloaking needs to be atomic? How?
A: Has to be atomic w/ respect to OS. No real fallout for concurrent systems.

Q: Why is MARSHALL mini-benchmark worse than PASSTHRU?
A:

Q: Could you reverse it and use it in the OS to protect against malicious VM?
A: Huh. Maybe.

Q: What's your threat model?
A: Don't worry about I/O (things like SSL protect network).

Q: Why is mmap read performance worse than write?
A: Write has to touch disk anyway, so minimal additional overhead. Read needs an extra page fault.

Tuesday, October 16, 2007

VirtualPower: Coordinated Power Management in Virtualized Enterprise Systems

Authors: Ripal Nathuji (Georgia Institute of Technology) and Karsten Schwan (Georgia Institute of Technology)

Paper: http://www.sosp2007.org/papers/sosp111-nathuji.pdf

Datacenters need power management. Cooling makes things worse. Indusry deployed ACPI.

How do you maintain power information in virtualization layer?

Lots of platform heterogeneity in a datacenter. Variations in power, performance, managability. Homogenize using VMs. Can restrict physical utilization of a VM ("soft scaling"). Can adhere to virtualization SLAs in the guest OS without being specifically aware of them. Take advantage of feedback loop.

Implementation: VPM events created when guest OS makes a power call. Dom0 can retrieve these events.

Future work: VPM tokens, idle power management (additional VPM C-states), efficient soft-scale consolidations

Q: Any estimate of whether adherence to OS power interface has a cost?
A: Want to look at it. Looking at a lightweight paravirtualization solution.

Q: What about consolidating nodes (i.e., shutting down physical nodes)?
A: Well, we can do this on-the-fly automatically.

Integrating Concurrency Control and Energy Management in Device Drivers

Authors: Kevin Klues (Stanford University, Washington University in St. Louis, Technical University of Berlin), Vlado Handziski (Technical University of Berlin), Chenyang Lu (Washington University in St. Louis), Adam Wolisz (Technical University of Berlin, University of California Berkeley), David Culler (Arch Rock Co., University of California Berkeley), David Gay (Intel Research Berkeley), and Philip Levis (Stanford University)

Paper: http://www.sosp2007.org/papers/sosp186-klues.pdf

(SOSP presentation)

Existing embedded devices usually rely on application for power savings. Manually shut off/turn on pieces = bleh. ICEM: Split-phase I/O operations. = make asynchronous. 3 types of device driver:

virtualized

only a functional interface
assume multiple users
buffer I/O requests for energy savings
must be able to tolerate longer latencies

dedicated

assume single user
no concurrency control
explicit energy management

shared

functional and lock interface
multiple user
explicit concurrency control through split-phase lock
implicit energy management based on pending requests
used for stringent timing requests

Power locks: split phase lock with integrated energy/configuration management. (wtf does that mean?). Hook down into dedicated driver. Locking mechanism to grant exclusive access to device. Lock takes care of powering on and off device.

Q: Doesn't it ultimately boil down to application decisions no matter what?
A: Thinking of letting application send hints to system

Q: Does any of this apply to mainstream OSes?
A: Not yet...where we'd really like to see this is in mobile phone OSes.

Q: How does the programming model change for app writers?
A: Very much like async I/O.

Q: Can any of the transaction work apply here? You're sort of grouping operations into a transaction.
A: Hadn't thought about it.

Q: Send is bottleneck. Done anything about that?
A: We're just specifying an architecture. You can specify policy.

AutoBash: Improving Configuration Management with Operating System Causality Analysis

Configuration management sucks. Want to automate:

Replay mode - automatically search for solution
Observation mode - highly interactive problem solving
Healthcare mode - make sure things stay working

In observation mode, testing if app now works is tedious. Set up predicates to determine functionality status. Use Speculator (SOSP 2005) to do lightweight checkpointing. Automatically do regression tests after a problem "solved." Use causality tracking to determine which ones need to be run.

Q: How far would just having transactions in Linux get you?
A: mumble.

Q: How well would this work in a distributed file system environment? Also, how important is it to have predicates for every piece of software?
A: Speculator doesn't work on distributed system. Should work on more general distributed systems.

Q: Maybe you could mine installation process for information about what the correct configuration is supposed to be. Now the Q: where can you get the predicates?
A: mumble.

Q: How well would this work with persistent state transactions (rather than speculation)?
A: If only work in persistent state, bad state can become incorporated into the persistent state. That's bad.

Q: What if the things you try only partially fix the problem or just gives you a clue about what to try next?
A: Which predicates work should tell you something about what needs to happen.

Q: What if you need to apply multiple solutions to get it to work? Can the system figure that out?
A: Future work.

Staged Deployment in Mirage, an Integrated Software Upgrade Testing and Distribution System

Authors: Olivier Crameri (EPFL), Nikola Knezevic (EPFL), Dejan Kostic (EPFL), Ricardo Bianchini (Rutgers), and Willy Zwaenepoel (EPFL)

Paper: http://www.sosp2007.org/papers/sosp076-crameri.pdf

This paper seems to focus on the issue of clustering machines that behave identically with respect to an upgrade.

Heuristically categorize dependencies used at runtime. Can be user-defined. Fingerprint resources. Categorize based on set of resources.

How effective is automatic resource classification? Good...no errors, though small single digit number of vendor-specific rules.

Q: Isn't this going to slow things down and make things easier for people to exploit security flaws?
A: As we said, there's a tradeoff

Q: Could this help you narrow down differences in configuration that cause bugs?
A: Hopefully.

Q: Isn't the number of configurations subject to combinatorial explosions?
A: Sure, possible...in practice hopefully not? We're studying this now.

Dynamo: Amazon's Highly Available Key-Value Store

Authors: Guiseppe DeCandia (Amazon.com), Deniz Hastorun (Amazon.com), Madan Jampani (Amazon.com), Gunavardhan Kakulapati (Amazon.com), Avinash Lakshman (Amazon.com), Alex Pilchin (Amazon.com), Swami Sivasubramanian (Amazon.com), Peter Vosshall (Amazon.com), and Werner Vogels (Amazon.com)

Paper: http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html

Amazon is a loosely coupled, service-oriented architecture. Each service is independent, but must adhere to latency SLAs. Availability is paramount.

RDBMS is a poor fit despite key-value being a good fit: most features unused, scales up and not out, and availability limitations.

Generally care more about availability than consistency.

Needs to be always writable, even in failure, even without previous context.

Want "knobs" to tune tradeoffs between cost, consistency, durability, and latency.

Overview:

consistent hashing
optimistic replication
"sloppy quorum"
anti-entropy mechanisms
object versioning

Use a full membership model rather than overlay routing. Use virtual nodes for load balancing.

"Sloppy quorum": N replicas in ideal state, read from at least R nodes, write to at least W nodes. "Sloppy" because the membership is dynamic based on node availability. Different values for N, R, and W yield different characteristics for the resulting system.

Each write is a new version. In worst case, might read a stale read version. A write based on this creates a branch in the version history.

It is up to the application to resolve version history conflicts! All (relevant) versions returned to app!

Use vector clocks to take care of version history (preserves causality).

Lessons learned:

(missed first)
repartitioning is slow because propagating data to new nodes takes forever (gets throttled; lots of random disk I/O)

use fixed arcs; allow transfer of whole database (a file copy, linear read on disk)

Limitations:

no transactional semantics
no ordered traversal
no large objects
does not scale indefinitely

Q: Failure seems to add load. What kind of MTTF do you need to avoid filling the key space?
A: We overprovision to deal with typical failure scenarios, including whole datacenter dying.

Q: When you need to add capacity, don't you need to shed load off of everybody?
A: Nodes have lots of neighbors. Adding nodes does pull load away from a bunch of others.

Q: How do you do reconciliation?
A: Use merkel hash tree for reconciliation?

Q: How do you prove that you met SLAs?
A: not sure

Q: Talk about the kind of conflicts you saw?
A: 99.94% of reads return single value. Most others returning two versions. Some of those might be returning write retries that happen in parallel.

Q: How often do you not achieve quorum?
A: Never!

Q: Ever been a partition?
A: Sure...rack files. Client can't see it though.

Q: Clients might potentially see lots of versions (even if it's rare). How do clients do reconciliation? No version ancestry?
A: Very application-specific. Sometimes last-write wins. Somewhat hard model to program to. Could potentially not garbage collect. No proof of convergence.

Q: Why did you go with consistent hashing?
A: High availability.

Q: What happens when some keys are more popular than others?
A: Often we don't see that. Often falls into the noise.

PeerReview: Practical Accountability for Distributed Systems

Authors: Andreas Haeberlen (MPI-SWS), Petr Kouznetsov (MPI-SWS), and Peter Druschel (MPI-SWS)

Paper: http://www.sosp2007.org/papers/sosp118-haeberlen.pdf

How do you detect faults when the system is federated and you can't see all of it? Specifically, how do you detect faults, how do you identify faulty nodes, and how do you convince evidence? Obviously, we need verifiable evidence.

Genera solution: keep a log, have an auditor that periodically inspects the log. Log is a hash chain (to prevent changing the log ex post facto).

Probabilistic log checking allows for scalability (otherwise overhead would be quadratic).

Q: How do you prevent collusion?
A: We used consistent hashing to choose witnesses, then secure routing.

Q: How do you deal with selective processing?
A: (reiterates what said in the talk)

Q: Seems like most appropriate to malicious faults given that it's all the same state machines. Is this useful for failing software?
A: (nothing useful...offline)

Q: (you misrepresented my CATS system...) How do you make logs visible in a secure way?
A: ??? Assume always at least one correct witness node.

Q: Why is non-repudiation work from 70s not applicable?
A: (Not sure what you're saying, offline)