Tuesday, October 16, 2007

Dynamo: Amazon's Highly Available Key-Value Store

Authors: Guiseppe DeCandia (Amazon.com), Deniz Hastorun (Amazon.com), Madan Jampani (Amazon.com), Gunavardhan Kakulapati (Amazon.com), Avinash Lakshman (Amazon.com), Alex Pilchin (Amazon.com), Swami Sivasubramanian (Amazon.com), Peter Vosshall (Amazon.com), and Werner Vogels (Amazon.com)

Paper: http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html

Amazon is a loosely coupled, service-oriented architecture. Each service is independent, but must adhere to latency SLAs. Availability is paramount.

RDBMS is a poor fit despite key-value being a good fit: most features unused, scales up and not out, and availability limitations.

Generally care more about availability than consistency.

Needs to be always writable, even in failure, even without previous context.

Want "knobs" to tune tradeoffs between cost, consistency, durability, and latency.

Overview:
  • consistent hashing
  • optimistic replication
  • "sloppy quorum"
  • anti-entropy mechanisms
  • object versioning
Use a full membership model rather than overlay routing. Use virtual nodes for load balancing.

"Sloppy quorum": N replicas in ideal state, read from at least R nodes, write to at least W nodes. "Sloppy" because the membership is dynamic based on node availability. Different values for N, R, and W yield different characteristics for the resulting system.

Each write is a new version. In worst case, might read a stale read version. A write based on this creates a branch in the version history.

It is up to the application to resolve version history conflicts! All (relevant) versions returned to app!

Use vector clocks to take care of version history (preserves causality).

Lessons learned:
  • (missed first)
  • repartitioning is slow because propagating data to new nodes takes forever (gets throttled; lots of random disk I/O)
    • use fixed arcs; allow transfer of whole database (a file copy, linear read on disk)
Limitations:
  • no transactional semantics
  • no ordered traversal
  • no large objects
  • does not scale indefinitely
Q: Failure seems to add load. What kind of MTTF do you need to avoid filling the key space?
A: We overprovision to deal with typical failure scenarios, including whole datacenter dying.

Q: When you need to add capacity, don't you need to shed load off of everybody?
A: Nodes have lots of neighbors. Adding nodes does pull load away from a bunch of others.

Q: How do you do reconciliation?
A: Use merkel hash tree for reconciliation?

Q: How do you prove that you met SLAs?
A: not sure

Q: Talk about the kind of conflicts you saw?
A: 99.94% of reads return single value. Most others returning two versions. Some of those might be returning write retries that happen in parallel.

Q: How often do you not achieve quorum?
A: Never!

Q: Ever been a partition?
A: Sure...rack files. Client can't see it though.

Q: Clients might potentially see lots of versions (even if it's rare). How do clients do reconciliation? No version ancestry?
A: Very application-specific. Sometimes last-write wins. Somewhat hard model to program to. Could potentially not garbage collect. No proof of convergence.

Q: Why did you go with consistent hashing?
A: High availability.

Q: What happens when some keys are more popular than others?
A: Often we don't see that. Often falls into the noise.

No comments: