Authors: Ripal Nathuji (Georgia Institute of Technology) and Karsten Schwan (Georgia Institute of Technology)
Paper: http://www.sosp2007.org/papers/sosp111-nathuji.pdf
Datacenters need power management. Cooling makes things worse. Indusry deployed ACPI.
How do you maintain power information in virtualization layer?
Lots of platform heterogeneity in a datacenter. Variations in power, performance, managability. Homogenize using VMs. Can restrict physical utilization of a VM ("soft scaling"). Can adhere to virtualization SLAs in the guest OS without being specifically aware of them. Take advantage of feedback loop.
Implementation: VPM events created when guest OS makes a power call. Dom0 can retrieve these events.
Future work: VPM tokens, idle power management (additional VPM C-states), efficient soft-scale consolidations
Q: Any estimate of whether adherence to OS power interface has a cost?
A: Want to look at it. Looking at a lightweight paravirtualization solution.
Q: What about consolidating nodes (i.e., shutting down physical nodes)?
A: Well, we can do this on-the-fly automatically.
Tuesday, October 16, 2007
Integrating Concurrency Control and Energy Management in Device Drivers
Authors: Kevin Klues (Stanford University, Washington University in St. Louis, Technical University of Berlin), Vlado Handziski (Technical University of Berlin), Chenyang Lu (Washington University in St. Louis), Adam Wolisz (Technical University of Berlin, University of California Berkeley), David Culler (Arch Rock Co., University of California Berkeley), David Gay (Intel Research Berkeley), and Philip Levis (Stanford University)
Paper: http://www.sosp2007.org/papers/sosp186-klues.pdf
(SOSP presentation)
Existing embedded devices usually rely on application for power savings. Manually shut off/turn on pieces = bleh. ICEM: Split-phase I/O operations. = make asynchronous. 3 types of device driver:
Q: Doesn't it ultimately boil down to application decisions no matter what?
A: Thinking of letting application send hints to system
Q: Does any of this apply to mainstream OSes?
A: Not yet...where we'd really like to see this is in mobile phone OSes.
Q: How does the programming model change for app writers?
A: Very much like async I/O.
Q: Can any of the transaction work apply here? You're sort of grouping operations into a transaction.
A: Hadn't thought about it.
Q: Send is bottleneck. Done anything about that?
A: We're just specifying an architecture. You can specify policy.
Paper: http://www.sosp2007.org/papers/sosp186-klues.pdf
(SOSP presentation)
Existing embedded devices usually rely on application for power savings. Manually shut off/turn on pieces = bleh. ICEM: Split-phase I/O operations. = make asynchronous. 3 types of device driver:
- virtualized
- only a functional interface
- assume multiple users
- buffer I/O requests for energy savings
- must be able to tolerate longer latencies
- dedicated
- assume single user
- no concurrency control
- explicit energy management
- shared
- functional and lock interface
- multiple user
- explicit concurrency control through split-phase lock
- implicit energy management based on pending requests
- used for stringent timing requests
Q: Doesn't it ultimately boil down to application decisions no matter what?
A: Thinking of letting application send hints to system
Q: Does any of this apply to mainstream OSes?
A: Not yet...where we'd really like to see this is in mobile phone OSes.
Q: How does the programming model change for app writers?
A: Very much like async I/O.
Q: Can any of the transaction work apply here? You're sort of grouping operations into a transaction.
A: Hadn't thought about it.
Q: Send is bottleneck. Done anything about that?
A: We're just specifying an architecture. You can specify policy.
AutoBash: Improving Configuration Management with Operating System Causality Analysis
Configuration management sucks. Want to automate:
Q: How far would just having transactions in Linux get you?
A: mumble.
Q: How well would this work in a distributed file system environment? Also, how important is it to have predicates for every piece of software?
A: Speculator doesn't work on distributed system. Should work on more general distributed systems.
Q: Maybe you could mine installation process for information about what the correct configuration is supposed to be. Now the Q: where can you get the predicates?
A: mumble.
Q: How well would this work with persistent state transactions (rather than speculation)?
A: If only work in persistent state, bad state can become incorporated into the persistent state. That's bad.
Q: What if the things you try only partially fix the problem or just gives you a clue about what to try next?
A: Which predicates work should tell you something about what needs to happen.
Q: What if you need to apply multiple solutions to get it to work? Can the system figure that out?
A: Future work.
- Replay mode - automatically search for solution
- Observation mode - highly interactive problem solving
- Healthcare mode - make sure things stay working
Q: How far would just having transactions in Linux get you?
A: mumble.
Q: How well would this work in a distributed file system environment? Also, how important is it to have predicates for every piece of software?
A: Speculator doesn't work on distributed system. Should work on more general distributed systems.
Q: Maybe you could mine installation process for information about what the correct configuration is supposed to be. Now the Q: where can you get the predicates?
A: mumble.
Q: How well would this work with persistent state transactions (rather than speculation)?
A: If only work in persistent state, bad state can become incorporated into the persistent state. That's bad.
Q: What if the things you try only partially fix the problem or just gives you a clue about what to try next?
A: Which predicates work should tell you something about what needs to happen.
Q: What if you need to apply multiple solutions to get it to work? Can the system figure that out?
A: Future work.
Staged Deployment in Mirage, an Integrated Software Upgrade Testing and Distribution System
Authors: Olivier Crameri (EPFL), Nikola Knezevic (EPFL), Dejan Kostic (EPFL), Ricardo Bianchini (Rutgers), and Willy Zwaenepoel (EPFL)
Paper: http://www.sosp2007.org/papers/sosp076-crameri.pdf
This paper seems to focus on the issue of clustering machines that behave identically with respect to an upgrade.
Heuristically categorize dependencies used at runtime. Can be user-defined. Fingerprint resources. Categorize based on set of resources.
How effective is automatic resource classification? Good...no errors, though small single digit number of vendor-specific rules.
Q: Isn't this going to slow things down and make things easier for people to exploit security flaws?
A: As we said, there's a tradeoff
Q: Could this help you narrow down differences in configuration that cause bugs?
A: Hopefully.
Q: Isn't the number of configurations subject to combinatorial explosions?
A: Sure, possible...in practice hopefully not? We're studying this now.
Paper: http://www.sosp2007.org/papers/sosp076-crameri.pdf
This paper seems to focus on the issue of clustering machines that behave identically with respect to an upgrade.
Heuristically categorize dependencies used at runtime. Can be user-defined. Fingerprint resources. Categorize based on set of resources.
How effective is automatic resource classification? Good...no errors, though small single digit number of vendor-specific rules.
Q: Isn't this going to slow things down and make things easier for people to exploit security flaws?
A: As we said, there's a tradeoff
Q: Could this help you narrow down differences in configuration that cause bugs?
A: Hopefully.
Q: Isn't the number of configurations subject to combinatorial explosions?
A: Sure, possible...in practice hopefully not? We're studying this now.
Dynamo: Amazon's Highly Available Key-Value Store
Authors: Guiseppe DeCandia (Amazon.com), Deniz Hastorun (Amazon.com), Madan Jampani (Amazon.com), Gunavardhan Kakulapati (Amazon.com), Avinash Lakshman (Amazon.com), Alex Pilchin (Amazon.com), Swami Sivasubramanian (Amazon.com), Peter Vosshall (Amazon.com), and Werner Vogels (Amazon.com)
Paper: http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html
Amazon is a loosely coupled, service-oriented architecture. Each service is independent, but must adhere to latency SLAs. Availability is paramount.
RDBMS is a poor fit despite key-value being a good fit: most features unused, scales up and not out, and availability limitations.
Generally care more about availability than consistency.
Needs to be always writable, even in failure, even without previous context.
Want "knobs" to tune tradeoffs between cost, consistency, durability, and latency.
Overview:
"Sloppy quorum": N replicas in ideal state, read from at least R nodes, write to at least W nodes. "Sloppy" because the membership is dynamic based on node availability. Different values for N, R, and W yield different characteristics for the resulting system.
Each write is a new version. In worst case, might read a stale read version. A write based on this creates a branch in the version history.
It is up to the application to resolve version history conflicts! All (relevant) versions returned to app!
Use vector clocks to take care of version history (preserves causality).
Lessons learned:
A: We overprovision to deal with typical failure scenarios, including whole datacenter dying.
Q: When you need to add capacity, don't you need to shed load off of everybody?
A: Nodes have lots of neighbors. Adding nodes does pull load away from a bunch of others.
Q: How do you do reconciliation?
A: Use merkel hash tree for reconciliation?
Q: How do you prove that you met SLAs?
A: not sure
Q: Talk about the kind of conflicts you saw?
A: 99.94% of reads return single value. Most others returning two versions. Some of those might be returning write retries that happen in parallel.
Q: How often do you not achieve quorum?
A: Never!
Q: Ever been a partition?
A: Sure...rack files. Client can't see it though.
Q: Clients might potentially see lots of versions (even if it's rare). How do clients do reconciliation? No version ancestry?
A: Very application-specific. Sometimes last-write wins. Somewhat hard model to program to. Could potentially not garbage collect. No proof of convergence.
Q: Why did you go with consistent hashing?
A: High availability.
Q: What happens when some keys are more popular than others?
A: Often we don't see that. Often falls into the noise.
Paper: http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html
Amazon is a loosely coupled, service-oriented architecture. Each service is independent, but must adhere to latency SLAs. Availability is paramount.
RDBMS is a poor fit despite key-value being a good fit: most features unused, scales up and not out, and availability limitations.
Generally care more about availability than consistency.
Needs to be always writable, even in failure, even without previous context.
Want "knobs" to tune tradeoffs between cost, consistency, durability, and latency.
Overview:
- consistent hashing
- optimistic replication
- "sloppy quorum"
- anti-entropy mechanisms
- object versioning
"Sloppy quorum": N replicas in ideal state, read from at least R nodes, write to at least W nodes. "Sloppy" because the membership is dynamic based on node availability. Different values for N, R, and W yield different characteristics for the resulting system.
Each write is a new version. In worst case, might read a stale read version. A write based on this creates a branch in the version history.
It is up to the application to resolve version history conflicts! All (relevant) versions returned to app!
Use vector clocks to take care of version history (preserves causality).
Lessons learned:
- (missed first)
- repartitioning is slow because propagating data to new nodes takes forever (gets throttled; lots of random disk I/O)
- use fixed arcs; allow transfer of whole database (a file copy, linear read on disk)
- no transactional semantics
- no ordered traversal
- no large objects
- does not scale indefinitely
A: We overprovision to deal with typical failure scenarios, including whole datacenter dying.
Q: When you need to add capacity, don't you need to shed load off of everybody?
A: Nodes have lots of neighbors. Adding nodes does pull load away from a bunch of others.
Q: How do you do reconciliation?
A: Use merkel hash tree for reconciliation?
Q: How do you prove that you met SLAs?
A: not sure
Q: Talk about the kind of conflicts you saw?
A: 99.94% of reads return single value. Most others returning two versions. Some of those might be returning write retries that happen in parallel.
Q: How often do you not achieve quorum?
A: Never!
Q: Ever been a partition?
A: Sure...rack files. Client can't see it though.
Q: Clients might potentially see lots of versions (even if it's rare). How do clients do reconciliation? No version ancestry?
A: Very application-specific. Sometimes last-write wins. Somewhat hard model to program to. Could potentially not garbage collect. No proof of convergence.
Q: Why did you go with consistent hashing?
A: High availability.
Q: What happens when some keys are more popular than others?
A: Often we don't see that. Often falls into the noise.
PeerReview: Practical Accountability for Distributed Systems
Authors: Andreas Haeberlen (MPI-SWS), Petr Kouznetsov (MPI-SWS), and Peter Druschel (MPI-SWS)
Paper: http://www.sosp2007.org/papers/sosp118-haeberlen.pdf
How do you detect faults when the system is federated and you can't see all of it? Specifically, how do you detect faults, how do you identify faulty nodes, and how do you convince evidence? Obviously, we need verifiable evidence.
Genera solution: keep a log, have an auditor that periodically inspects the log. Log is a hash chain (to prevent changing the log ex post facto).
Probabilistic log checking allows for scalability (otherwise overhead would be quadratic).
Q: How do you prevent collusion?
A: We used consistent hashing to choose witnesses, then secure routing.
Q: How do you deal with selective processing?
A: (reiterates what said in the talk)
Q: Seems like most appropriate to malicious faults given that it's all the same state machines. Is this useful for failing software?
A: (nothing useful...offline)
Q: (you misrepresented my CATS system...) How do you make logs visible in a secure way?
A: ??? Assume always at least one correct witness node.
Q: Why is non-repudiation work from 70s not applicable?
A: (Not sure what you're saying, offline)
Paper: http://www.sosp2007.org/papers/sosp118-haeberlen.pdf
How do you detect faults when the system is federated and you can't see all of it? Specifically, how do you detect faults, how do you identify faulty nodes, and how do you convince evidence? Obviously, we need verifiable evidence.
Genera solution: keep a log, have an auditor that periodically inspects the log. Log is a hash chain (to prevent changing the log ex post facto).
Probabilistic log checking allows for scalability (otherwise overhead would be quadratic).
Q: How do you prevent collusion?
A: We used consistent hashing to choose witnesses, then secure routing.
Q: How do you deal with selective processing?
A: (reiterates what said in the talk)
Q: Seems like most appropriate to malicious faults given that it's all the same state machines. Is this useful for failing software?
A: (nothing useful...offline)
Q: (you misrepresented my CATS system...) How do you make logs visible in a secure way?
A: ??? Assume always at least one correct witness node.
Q: Why is non-repudiation work from 70s not applicable?
A: (Not sure what you're saying, offline)
Sinfonia: A New Paradigm for Building Scalable Distributed Systems
Authors: Marcos K. Aguilera (HP Labs), Arif Merchant (HP Labs), Mehul Shah (HP Labs), Alistair Veitch (HP Labs), and Christos Karamanolis (VMWare)
Paper: http://www.sosp2007.org/papers/sosp064-aguilera.pdf
(SOSP presentation)
This is an infrastructure for distributed applications. The general idea is to create a set of linear memories, and a particular memory address is identified by a node id and an offset. They use small, short-lived transactions with the semantics that if all read values match what the transaction expects, then the transaction commits.
They have a simplified 2-phase commit protocol with, essentially, no coordinator. The transaction blocks if _any_ of the application nodes crash, but they argue this isn't that big a deal because the action may involve application data located on the application node, and if the application node isn't available, you're screwed anyway (rough paraphrase).
They built a cluster file system and a group communication service.
Crap...wasn't paying attention to questions (at least, not writing them down). From what I remember:
- Why this and not a federated array of bricks?
- Are there any pathological cases you found that resulted in frequent rollback?
- other stuff...
Paper: http://www.sosp2007.org/papers/sosp064-aguilera.pdf
(SOSP presentation)
This is an infrastructure for distributed applications. The general idea is to create a set of linear memories, and a particular memory address is identified by a node id and an offset. They use small, short-lived transactions with the semantics that if all read values match what the transaction expects, then the transaction commits.
They have a simplified 2-phase commit protocol with, essentially, no coordinator. The transaction blocks if _any_ of the application nodes crash, but they argue this isn't that big a deal because the action may involve application data located on the application node, and if the application node isn't available, you're screwed anyway (rough paraphrase).
They built a cluster file system and a group communication service.
Crap...wasn't paying attention to questions (at least, not writing them down). From what I remember:
- Why this and not a federated array of bricks?
- Are there any pathological cases you found that resulted in frequent rollback?
- other stuff...
Monday, October 15, 2007
MUVI: Automatically Inferring Multi-Variable Access Correlations and Detecting Related Semantic and Concurrency Bugs
Authors: Shan Lu (University of Illinois), Soyeon Park (University of Illinois), Chongfeng Hu (University of Illinois), Xiao Ma (University of Illinois), Weihang Jiang (University of Illinois), Zhenmin Li (University of Illinois), Raluca A. Popa (MIT), and Yuanyuan Zhou (University of Illinois)
Paper: http://www.sosp2007.org/papers/sosp061-lu.pdf
(SOSP presentation)
There are lots of correlated variables in the world (e.g., "when I access X, I should be accessing Y as well"). It's a bug if you don't access them together. Look for this. Their metric is "distance" in the source code. Find variables that are usually accessed together and rarely accessed separately. Use techniques of "itemset" from data mining.
They looked at codebases from Mozilla, MySQL, Linux, and Postgre-SQL. About a thousand correlations. About 15% false positive rates (macros, coincidences). They can't detect conditional correlations (that seems like a weird programming paradigm anyway).
Q: How difficult is it to resolve the false positives, especially in relation to things like RacerX?
A: ??? Hard to compare.
Q: Can it handle more complicated examples? (ed. didn't she say she couldn't deal with these cases?)
A: ??? Branching shouldn't be an issue.
Q: What about more complex correlations, e.g., c is a sum of a and b?
A: Boil down to considering pair as a unit. Room for improvement though.
Paper: http://www.sosp2007.org/papers/sosp061-lu.pdf
(SOSP presentation)
There are lots of correlated variables in the world (e.g., "when I access X, I should be accessing Y as well"). It's a bug if you don't access them together. Look for this. Their metric is "distance" in the source code. Find variables that are usually accessed together and rarely accessed separately. Use techniques of "itemset" from data mining.
They looked at codebases from Mozilla, MySQL, Linux, and Postgre-SQL. About a thousand correlations. About 15% false positive rates (macros, coincidences). They can't detect conditional correlations (that seems like a weird programming paradigm anyway).
Q: How difficult is it to resolve the false positives, especially in relation to things like RacerX?
A: ??? Hard to compare.
Q: Can it handle more complicated examples? (ed. didn't she say she couldn't deal with these cases?)
A: ??? Branching shouldn't be an issue.
Q: What about more complex correlations, e.g., c is a sum of a and b?
A: Boil down to considering pair as a unit. Room for improvement though.
TxLinux: Using and Managing Hardware Transactional Memory in the Operating System
Authors: Christopher J. Rossbach (UT Austin), Owen S. Hoffman (UT Austin), Donald E. Porter (UT Austin), Hany E. Ramadan (UT Austin), Aditya Bhandari (UT Austin), and Emmett Witchel (UT Austin)
Paper: http://www.sosp2007.org/papers/sosp056-rossbach.pdf
(SOSP 2007 presentation)
This paper presents an effort to use transactional memory in the Linux kernel. The main contribution seems to be the demand that transactions interoperate with locks for legacy code reasons. They have a "cxspinlock" API that allows this, which in turn allows you to do I/O with transactions. They note that priority inversion can still happen in a transactional system, but they propose a simple hardware extension to alleviate this (I didn't quite follow exactly what this was...see paper I guess).
Questions:
Q) Are there pathological cases you didn't talk about (he did talk about exponential backoff in transaction resolution being a problem)
A) overrunning the transactional memory (roughly) can be a problem; no good solution from community
Q) Is this how you would do things if we started over with transactions?
A) No, probably not. But this is the world we live in.
Q) Doesn't using a simulator goof your numbers? How do you calibrate?
A) mumble.
(there was another question I missed)
Paper: http://www.sosp2007.org/papers/sosp056-rossbach.pdf
(SOSP 2007 presentation)
This paper presents an effort to use transactional memory in the Linux kernel. The main contribution seems to be the demand that transactions interoperate with locks for legacy code reasons. They have a "cxspinlock" API that allows this, which in turn allows you to do I/O with transactions. They note that priority inversion can still happen in a transactional system, but they propose a simple hardware extension to alleviate this (I didn't quite follow exactly what this was...see paper I guess).
Questions:
Q) Are there pathological cases you didn't talk about (he did talk about exponential backoff in transaction resolution being a problem)
A) overrunning the transactional memory (roughly) can be a problem; no good solution from community
Q) Is this how you would do things if we started over with transactions?
A) No, probably not. But this is the world we live in.
Q) Doesn't using a simulator goof your numbers? How do you calibrate?
A) mumble.
(there was another question I missed)
Sunday, August 5, 2007
Building distributed applications using Sinfonia
Authors: Marcos K. Aguilera, Christos Karamanolis, Arif Merchant, Mehul Shah, and Alistair Veitch
Paper: http://www.hpl.hp.com/techreports/2006/HPL-2006-147.pdf
So...this paper presents a system called Sinfonia that is designed to be intrastructure for building distributed systems. Sinfonia presents 3 features:
Also, and as a side note, I'm always a bit dubious of the concept of backup servers. Are you really better off making a distinction between primary and backup servers as opposed to integrating all the servers you _would_ have employed as backups into the system itself? I.e., are you better off having n primaries and n backups instead of a system with 2n nodes? Seems like the latter would give you greater flexibility, in general, in allocating your resources. I suppose primary and backup are simply logical roles, and that any given physical server could play multiple roles at one time, but if you do that you might find yourself unintentionally compromising failure correlation assumptions (e.g., several of your logical backups live on a single physical machine that crashes, thus screwing you).
Anyway, moving on...the notifications don't seem that interesting except insofar as they interact with the minitransactions, except they don't really talk about this. As far as I can tell, the semantics of the notifications are very weak. If that's the case, would employing them in the same system as the mini-transactions undermine the supposedly strong semantics the mini-transactions are supposed to give you? I suppose if you treat notifications are purely advisory then you should be okay, but isn't that ultimately the same as interacting with a system that has the same weak semantics that you're trying to avoid?
Regardless, however, I suspect their system is very practical and works just fine. I just don't really see any new contribution here. The contribution would come in the form of easing application development and making ultimate end-systems more robust, but without a deployment in the real world to point to, that point is basically lost.
Paper: http://www.hpl.hp.com/techreports/2006/HPL-2006-147.pdf
So...this paper presents a system called Sinfonia that is designed to be intrastructure for building distributed systems. Sinfonia presents 3 features:
- a global address space, sort of -- addresses are tuples of the form (node ID, address). Values can be read and written.
- "minitransactions" -- minitransactions are transactions restricted to be short lived and consist of distinct read, compare, and write phases (in that order), and they may only manipulate values in the global address space
- notifications -- a node may request to be notified when changes occur in the store in a particular range of addresses
Also, and as a side note, I'm always a bit dubious of the concept of backup servers. Are you really better off making a distinction between primary and backup servers as opposed to integrating all the servers you _would_ have employed as backups into the system itself? I.e., are you better off having n primaries and n backups instead of a system with 2n nodes? Seems like the latter would give you greater flexibility, in general, in allocating your resources. I suppose primary and backup are simply logical roles, and that any given physical server could play multiple roles at one time, but if you do that you might find yourself unintentionally compromising failure correlation assumptions (e.g., several of your logical backups live on a single physical machine that crashes, thus screwing you).
Anyway, moving on...the notifications don't seem that interesting except insofar as they interact with the minitransactions, except they don't really talk about this. As far as I can tell, the semantics of the notifications are very weak. If that's the case, would employing them in the same system as the mini-transactions undermine the supposedly strong semantics the mini-transactions are supposed to give you? I suppose if you treat notifications are purely advisory then you should be okay, but isn't that ultimately the same as interacting with a system that has the same weak semantics that you're trying to avoid?
Regardless, however, I suspect their system is very practical and works just fine. I just don't really see any new contribution here. The contribution would come in the form of easing application development and making ultimate end-systems more robust, but without a deployment in the real world to point to, that point is basically lost.
Tuesday, June 26, 2007
ORNL informal meeting notes (2)
Met with Ken Roche, physicist turned CS guy. Met him yesterday and wasn't diligent about taking notes immediately afterwards, so I'm sure I'll have forgotten things.
We had an interesting conversation I, sadly, didn't follow all of. The meeting reminded me how little math I actually know. Nonetheless, I could at least follow the system design aspects of what he was talking about. We spent some time talking about how in god's name you debug a program running on a hundreds-to-thousands of node supercomputer. They have some cute visualization tools that allow you to view the MPI wait times for each node in the cluster. Turns out that they saw, for instance, some outliers that limited the performance of the system. The results are actually the reverse of what one might naively expect: the outliers with very _low_ wait times are the ones holding up the pack (this immediately makes sense if you think about it: if there's an outlier with an exceptionally low wait time, everybody else may have been sitting there waiting on the slow poke who, of course, can immediately continue once he finishes with whatever he's doing). They suspect that the outlier may, in fact, be due to faulty hardware, particularly in the memory subsystem, which is interesting. I wouldn't expect that the memory interface for just one node could go bad, and specifically that it could go bad in a way that wouldn't crash the program. A little jarring.
Another interesting fact is that right now, although there are dual-core processors in Jaguar (they're hot rod supercomputer), the physical memory associated with the processor is strictly partitioned between the cores so as to preserve the programming model (I think...I was a little unclear on why they do this). Moreover, interrupts are handled only by the "master" core, so that for I/O, both cores may end up being blocked. This seems horribly inefficient given that the whole point of multi-core is fast communication between cores. They know this is inefficient. I'm not sure how they're going to fix it.
I think the issue is that they currently run this weird stripped-down, lean-and-mean OS known as Catamount (which amounts to Linux stripped of just about everything that makes it Linux) that doesn't have, for instance, virtual memory. I think this is what necessitates the strict partitioning.
Weird aside: there are apparently these physics guys at MIT (whose names I of course can't remember) that have designed their quantum mechanical simulators _from scratch_, _in assembly code_ in order to get every last ounce of performance out of the fuckers. That scares me. Those fuckers are crazy. Especially since they aren't even CS people. As Ken said, the computers were just an obstacle in the way of being able to get their physics results. I feel really, really dumb right about now...
We had an interesting conversation I, sadly, didn't follow all of. The meeting reminded me how little math I actually know. Nonetheless, I could at least follow the system design aspects of what he was talking about. We spent some time talking about how in god's name you debug a program running on a hundreds-to-thousands of node supercomputer. They have some cute visualization tools that allow you to view the MPI wait times for each node in the cluster. Turns out that they saw, for instance, some outliers that limited the performance of the system. The results are actually the reverse of what one might naively expect: the outliers with very _low_ wait times are the ones holding up the pack (this immediately makes sense if you think about it: if there's an outlier with an exceptionally low wait time, everybody else may have been sitting there waiting on the slow poke who, of course, can immediately continue once he finishes with whatever he's doing). They suspect that the outlier may, in fact, be due to faulty hardware, particularly in the memory subsystem, which is interesting. I wouldn't expect that the memory interface for just one node could go bad, and specifically that it could go bad in a way that wouldn't crash the program. A little jarring.
Another interesting fact is that right now, although there are dual-core processors in Jaguar (they're hot rod supercomputer), the physical memory associated with the processor is strictly partitioned between the cores so as to preserve the programming model (I think...I was a little unclear on why they do this). Moreover, interrupts are handled only by the "master" core, so that for I/O, both cores may end up being blocked. This seems horribly inefficient given that the whole point of multi-core is fast communication between cores. They know this is inefficient. I'm not sure how they're going to fix it.
I think the issue is that they currently run this weird stripped-down, lean-and-mean OS known as Catamount (which amounts to Linux stripped of just about everything that makes it Linux) that doesn't have, for instance, virtual memory. I think this is what necessitates the strict partitioning.
Weird aside: there are apparently these physics guys at MIT (whose names I of course can't remember) that have designed their quantum mechanical simulators _from scratch_, _in assembly code_ in order to get every last ounce of performance out of the fuckers. That scares me. Those fuckers are crazy. Especially since they aren't even CS people. As Ken said, the computers were just an obstacle in the way of being able to get their physics results. I feel really, really dumb right about now...
Friday, June 22, 2007
Sequoia: Programming the Memory Hierarchy
Authors: Kayvon Fatahalian, Timothy J. Knight, Mike Houston, Mattan Erez, Daniel Reiter Horn, Larkhoon Leem, Ji Young Park, Manman Ren, Alex Aiken, William J. Dally, Pat Hanrahan
Paper: http://graphics.stanford.edu/papers/sequoia/sequoia_sc06.pdf
Sequoia is a compiler for parallel-processing machines. It is designed to be first and foremost portable between parallel machines with different memory hierarchies. This is not to say the language is arbitrarily general; a key assumption is that any Sequoia task is essentially recursively defined and that communication among nodes at the same level (and indeed between anything but a parent and its child) is explicitly forbidden. This allows the code to be customizable and portable.
An interesting feature that they don't spend much time talking about (I think because it's both implicity and not in and of itself very complicated) is the separation of the program logic from the specification of the machine. Basically, you embed parameters describing the size of memory chunks in your program, and then the compiler takes what amounts to a manifest describing your hardware and shoves in the relevant values. It's a cute idea.
Perhaps the way to do this kind of thing is specify different parts of the system in different places and then have something that synthesizes them at runtime depending on the conditions? Is that so vague as to be a totally useless question?
Paper: http://graphics.stanford.edu/papers/sequoia/sequoia_sc06.pdf
Sequoia is a compiler for parallel-processing machines. It is designed to be first and foremost portable between parallel machines with different memory hierarchies. This is not to say the language is arbitrarily general; a key assumption is that any Sequoia task is essentially recursively defined and that communication among nodes at the same level (and indeed between anything but a parent and its child) is explicitly forbidden. This allows the code to be customizable and portable.
An interesting feature that they don't spend much time talking about (I think because it's both implicity and not in and of itself very complicated) is the separation of the program logic from the specification of the machine. Basically, you embed parameters describing the size of memory chunks in your program, and then the compiler takes what amounts to a manifest describing your hardware and shoves in the relevant values. It's a cute idea.
Perhaps the way to do this kind of thing is specify different parts of the system in different places and then have something that synthesizes them at runtime depending on the conditions? Is that so vague as to be a totally useless question?
Compilation for Explicitly Managed Memory Hierarchies
Authors: Timothy J. Knight, Ji Young Park, Manman Ren, Mike Houston, Mattan Erez, Kayvon Fatahalian, Alex Aiken, William J. Dally, Pat Hanrahan
Paper: http://graphics.stanford.edu/papers/sequoia-compiler/sequoia_ppopp07.pdf
Cool little paper on optimizing IL code for parallel processors (ostensibly Cell). (As I read it, it became obvious I should have read the Sequoia paper first, but whatever). The interesting piece was the explicit modeling of memory as a tree. Consider, for instance, several processors each with their own local memory and then, say, a single shared main memory. The IL models operations based on this memory hierarchy, for example copying between memory layers, performing computation on a given layer, etc. It's not clear to me whether this does, in fact, model anything other than Cell (processor), but it's kind of a cool idea nonetheless.
I was less interested in the actual optimizations they did, which seem to give benefit to Sequoia programs, than I was in how they model their system (because I'm thinking about programming models for heterogeneous processing environments at the moment). They seem to do some fairly straightforward things like introducing dependencies to ensure orderings where needed, loop hoisting, copy elimination, etc.
Paper: http://graphics.stanford.edu/papers/sequoia-compiler/sequoia_ppopp07.pdf
Cool little paper on optimizing IL code for parallel processors (ostensibly Cell). (As I read it, it became obvious I should have read the Sequoia paper first, but whatever). The interesting piece was the explicit modeling of memory as a tree. Consider, for instance, several processors each with their own local memory and then, say, a single shared main memory. The IL models operations based on this memory hierarchy, for example copying between memory layers, performing computation on a given layer, etc. It's not clear to me whether this does, in fact, model anything other than Cell (processor), but it's kind of a cool idea nonetheless.
I was less interested in the actual optimizations they did, which seem to give benefit to Sequoia programs, than I was in how they model their system (because I'm thinking about programming models for heterogeneous processing environments at the moment). They seem to do some fairly straightforward things like introducing dependencies to ensure orderings where needed, loop hoisting, copy elimination, etc.
Wednesday, June 20, 2007
ORNL informal meeting notes
Met with Olaf Storaasli, former NASA guy:
- got into FPGAs
- NASA likes 'em because they're more energy efficient
- if you do it right, you can get the whole chip active rather than just one part at a time as in microprocessors
- used on actual spacecraft
- btw, the way you deal with radiation is you reload the program twice a day...go figure.
- not clear how to program the buggers
- most work still done in VHDL
- some C-to-FPGA work done (I am skeptical of this approach)
- one guy did Matlab to FPGA (this seems a little cooler to me)
- new Crays ship with 2 FPGAs per processor
Monday, May 14, 2007
Privacy, Accuracy, and Consistency Too: A Holistic Solution to Contingency Table Release
Paper here: http://www.cs.princeton.edu/~satyen/papers/privacy.ps
I'll be honest: I saw a lot of greek symbols and ran away screaming, so I actually didn't get a whole lot out of this paper. This is doubly embarrassing as several authors are MSR-SVC-ers I know and love (Frank, Cynthia, Kunal).
Okay, so, the gist is that you want to release a "fuzzied" database with information about people in the form of "contingency tables," which are essentially histograms of various traits (think male vs. female, age ranges [20-29, 30-39, 40-49...], etc.). You want to maintain aggregate statistics about the data set without revealing information about particular people in the data set. The problem is that you want to maintain the accuracy of the data and keep it internally consistent in the process of making it private. This work focuses on keeping the data consistent.
The punchline is that instead of directly tweaking the data itself or the "marginals" (which they never define for us idiot-folk), they translate the data into the Fourier domain and tweak the data there. Turns out that has nice properties, though fuck me if I know even what that means or why it's true.
I'll be honest: I saw a lot of greek symbols and ran away screaming, so I actually didn't get a whole lot out of this paper. This is doubly embarrassing as several authors are MSR-SVC-ers I know and love (Frank, Cynthia, Kunal).
Okay, so, the gist is that you want to release a "fuzzied" database with information about people in the form of "contingency tables," which are essentially histograms of various traits (think male vs. female, age ranges [20-29, 30-39, 40-49...], etc.). You want to maintain aggregate statistics about the data set without revealing information about particular people in the data set. The problem is that you want to maintain the accuracy of the data and keep it internally consistent in the process of making it private. This work focuses on keeping the data consistent.
The punchline is that instead of directly tweaking the data itself or the "marginals" (which they never define for us idiot-folk), they translate the data into the Fourier domain and tweak the data there. Turns out that has nice properties, though fuck me if I know even what that means or why it's true.
Supporting Ranking and Clustering as Generalized Order-By and Group-By
paper here: http://www-forward.cs.uiuc.edu/pubs/2007/clusterrank-sigmod07-lwlwc-mar07.pdf
I didn't get all the way through this paper...I'll admit it. But the idea here is to introduce an information retrieval-type operation into the standard SQL language. What the hell does that mean? Well, it seems to mean that they want to do some form of clustering. The example they give is realty. You'd like to look at a set of houses that are either lower-priced in the suburbs or higher-priced but with a nice view on the water. In this case, you want your query to return houses that fit into one of those clusters, and then you want to order houses within each cluster. So, they do it. And they are essentially running k-means with some weird little optimizations so that they don't have to materialize the entire database.
It was pointed out in the discussion that their semantics are weird and inexact. Which is to say: k-means is unstable, and it can give you different results running it multiple times on the same data. They take this instability and amplify it by doing what is essentially an approximation of k-means by creating centroids of (in some sense) adjacent tuples and then running k-means on the centroids. Problem is, I don't know how the approximation relates to the full k-means (which, again, has fuzzy semantics to begin with). So, I don't really know what guarantees I have on my results. Boo.
I didn't get all the way through this paper...I'll admit it. But the idea here is to introduce an information retrieval-type operation into the standard SQL language. What the hell does that mean? Well, it seems to mean that they want to do some form of clustering. The example they give is realty. You'd like to look at a set of houses that are either lower-priced in the suburbs or higher-priced but with a nice view on the water. In this case, you want your query to return houses that fit into one of those clusters, and then you want to order houses within each cluster. So, they do it. And they are essentially running k-means with some weird little optimizations so that they don't have to materialize the entire database.
It was pointed out in the discussion that their semantics are weird and inexact. Which is to say: k-means is unstable, and it can give you different results running it multiple times on the same data. They take this instability and amplify it by doing what is essentially an approximation of k-means by creating centroids of (in some sense) adjacent tuples and then running k-means on the centroids. Problem is, I don't know how the approximation relates to the full k-means (which, again, has fuzzy semantics to begin with). So, I don't really know what guarantees I have on my results. Boo.
Sunday, May 13, 2007
WiLDNet: Design and Implementation of High Performance WiFi Based Long Distance Networks
Original paper here: http://www.cs.washington.edu/education/courses/590l/07sp/wildnet.pdf
Okay, my initial impression of this paper was that it was kind of cute, but not really groundbreaking. After reading some of the background, namely the 2P paper, I think most of the cool work was done there, and it's not entirely clear what new and interesting things happened here. (Side note: I really liked the 2P paper if for no other reason than they presented the idea very clearly. I got it almost immediately.)
The environment they're working in is long-haul wifi. We're talking about standard 802.11 used in a long-distance environment with point-to-point directional antennas. Specifically, a node has a different antenna for each communication partner. The problems in this environment are:
The solutions to these are fairly straightforward. Like 2P, these guys use time-slicing (TDMA) to eliminate collisions. They implement a sliding window ack mechanism to deal with the long propagation delay. Finally, they add erasure coding to cover loss. An interesting tweak to the erasure coding mechanism is that they send the packets they want to get there in the clear first, and then they send the "backup" packets afterwards. This eliminates the necessity of decoding the packets if there is no loss.
That's about it...they do a nice examination of the design tradeoffs of things like overhead versus loss masking for both erasure coding and retransmissions.
Okay, my initial impression of this paper was that it was kind of cute, but not really groundbreaking. After reading some of the background, namely the 2P paper, I think most of the cool work was done there, and it's not entirely clear what new and interesting things happened here. (Side note: I really liked the 2P paper if for no other reason than they presented the idea very clearly. I got it almost immediately.)
The environment they're working in is long-haul wifi. We're talking about standard 802.11 used in a long-distance environment with point-to-point directional antennas. Specifically, a node has a different antenna for each communication partner. The problems in this environment are:
- the different antennas at a node can interfere with each other
- the propagation delay can be large enough to trigger link-layer timeouts
- the 802.11 specification mandates stop-and-wait semantics for frame acks (which take forever on those long links)
- external interference can wreak havoc
The solutions to these are fairly straightforward. Like 2P, these guys use time-slicing (TDMA) to eliminate collisions. They implement a sliding window ack mechanism to deal with the long propagation delay. Finally, they add erasure coding to cover loss. An interesting tweak to the erasure coding mechanism is that they send the packets they want to get there in the clear first, and then they send the "backup" packets afterwards. This eliminates the necessity of decoding the packets if there is no loss.
That's about it...they do a nice examination of the design tradeoffs of things like overhead versus loss masking for both erasure coding and retransmissions.
First post! Woo hoo!
So, I've been meaning to create somewhere to store my random thoughts on the papers I read somewhere. For a while I agonized over whether this should be a wiki or a blog, where it should be hosted, etc. Eventually, I was paralyzed by these decisions, so at some point I said, "Fuck it...I'll just make a blogspot blog." If I decide otherwise later, I'll just migrate the entries. Somehow.
Anyway, this likely won't be coherent, organized, particularly thoughtful, or in any way useful. I just tend to forget what the fuck I've read, so I thought it would be useful to have a place to jot down random thoughts about technical papers I read.
Anyway, this likely won't be coherent, organized, particularly thoughtful, or in any way useful. I just tend to forget what the fuck I've read, so I thought it would be useful to have a place to jot down random thoughts about technical papers I read.
Subscribe to:
Comments (Atom)