Tuesday, June 26, 2007

ORNL informal meeting notes (2)

Met with Ken Roche, physicist turned CS guy. Met him yesterday and wasn't diligent about taking notes immediately afterwards, so I'm sure I'll have forgotten things.

We had an interesting conversation I, sadly, didn't follow all of. The meeting reminded me how little math I actually know. Nonetheless, I could at least follow the system design aspects of what he was talking about. We spent some time talking about how in god's name you debug a program running on a hundreds-to-thousands of node supercomputer. They have some cute visualization tools that allow you to view the MPI wait times for each node in the cluster. Turns out that they saw, for instance, some outliers that limited the performance of the system. The results are actually the reverse of what one might naively expect: the outliers with very _low_ wait times are the ones holding up the pack (this immediately makes sense if you think about it: if there's an outlier with an exceptionally low wait time, everybody else may have been sitting there waiting on the slow poke who, of course, can immediately continue once he finishes with whatever he's doing). They suspect that the outlier may, in fact, be due to faulty hardware, particularly in the memory subsystem, which is interesting. I wouldn't expect that the memory interface for just one node could go bad, and specifically that it could go bad in a way that wouldn't crash the program. A little jarring.

Another interesting fact is that right now, although there are dual-core processors in Jaguar (they're hot rod supercomputer), the physical memory associated with the processor is strictly partitioned between the cores so as to preserve the programming model (I think...I was a little unclear on why they do this). Moreover, interrupts are handled only by the "master" core, so that for I/O, both cores may end up being blocked. This seems horribly inefficient given that the whole point of multi-core is fast communication between cores. They know this is inefficient. I'm not sure how they're going to fix it.

I think the issue is that they currently run this weird stripped-down, lean-and-mean OS known as Catamount (which amounts to Linux stripped of just about everything that makes it Linux) that doesn't have, for instance, virtual memory. I think this is what necessitates the strict partitioning.

Weird aside: there are apparently these physics guys at MIT (whose names I of course can't remember) that have designed their quantum mechanical simulators _from scratch_, _in assembly code_ in order to get every last ounce of performance out of the fuckers. That scares me. Those fuckers are crazy. Especially since they aren't even CS people. As Ken said, the computers were just an obstacle in the way of being able to get their physics results. I feel really, really dumb right about now...

Friday, June 22, 2007

Sequoia: Programming the Memory Hierarchy

Authors: Kayvon Fatahalian, Timothy J. Knight, Mike Houston, Mattan Erez, Daniel Reiter Horn, Larkhoon Leem, Ji Young Park, Manman Ren, Alex Aiken, William J. Dally, Pat Hanrahan

Paper: http://graphics.stanford.edu/papers/sequoia/sequoia_sc06.pdf

Sequoia is a compiler for parallel-processing machines. It is designed to be first and foremost portable between parallel machines with different memory hierarchies. This is not to say the language is arbitrarily general; a key assumption is that any Sequoia task is essentially recursively defined and that communication among nodes at the same level (and indeed between anything but a parent and its child) is explicitly forbidden. This allows the code to be customizable and portable.

An interesting feature that they don't spend much time talking about (I think because it's both implicity and not in and of itself very complicated) is the separation of the program logic from the specification of the machine. Basically, you embed parameters describing the size of memory chunks in your program, and then the compiler takes what amounts to a manifest describing your hardware and shoves in the relevant values. It's a cute idea.

Perhaps the way to do this kind of thing is specify different parts of the system in different places and then have something that synthesizes them at runtime depending on the conditions? Is that so vague as to be a totally useless question?

Compilation for Explicitly Managed Memory Hierarchies

Authors: Timothy J. Knight, Ji Young Park, Manman Ren, Mike Houston, Mattan Erez, Kayvon Fatahalian, Alex Aiken, William J. Dally, Pat Hanrahan

Paper: http://graphics.stanford.edu/papers/sequoia-compiler/sequoia_ppopp07.pdf

Cool little paper on optimizing IL code for parallel processors (ostensibly Cell). (As I read it, it became obvious I should have read the Sequoia paper first, but whatever). The interesting piece was the explicit modeling of memory as a tree. Consider, for instance, several processors each with their own local memory and then, say, a single shared main memory. The IL models operations based on this memory hierarchy, for example copying between memory layers, performing computation on a given layer, etc. It's not clear to me whether this does, in fact, model anything other than Cell (processor), but it's kind of a cool idea nonetheless.

I was less interested in the actual optimizations they did, which seem to give benefit to Sequoia programs, than I was in how they model their system (because I'm thinking about programming models for heterogeneous processing environments at the moment). They seem to do some fairly straightforward things like introducing dependencies to ensure orderings where needed, loop hoisting, copy elimination, etc.

Wednesday, June 20, 2007

ORNL informal meeting notes

Met with Olaf Storaasli, former NASA guy:
  • got into FPGAs
  • NASA likes 'em because they're more energy efficient
    • if you do it right, you can get the whole chip active rather than just one part at a time as in microprocessors
    • used on actual spacecraft
    • btw, the way you deal with radiation is you reload the program twice a day...go figure.
  • not clear how to program the buggers
    • most work still done in VHDL
    • some C-to-FPGA work done (I am skeptical of this approach)
    • one guy did Matlab to FPGA (this seems a little cooler to me)
  • new Crays ship with 2 FPGAs per processor