HPC-Colony Project

Adaptive System Software For Improved Resiliency and Performance

Overview Goals FAQ News Participants Publications Links Internal Page

 

News

2010 News & Highlights

  • Summer 2010: Experiments with our latest software show better message logging (512 proc job had 73% reduction in message log volume). We have developed a new synchronized clock scheme which exhibits much better performance than previous distributed protocols. An initial design of our spidercast communications service will be released this Summer. We have developed a new DHT (distributed has table) service (see Tock10b). Our performance results for topology aware load balancing proven with OpenAtom.
  • January 2010: More Great News! The Colony Project was selected to receive a supercomputing allocation through the Innovative and Novel Computational Impact on Theory and Experiment (INCITE) program. The INCITE program promotes cutting-edge research that can only be conducted with state-of-the-art supercomputers. The Leadership Computing Facilities (LCFs) at Argonne and Oak Ridge national laboratories, supported by the U.S. Department of Energy Office of Science, operate the program. The LCFs award sizeable allocations on powerful supercomputers to researchers from academia, government, and industry addressing grand challenges in science and engineering such as developing new energy solutions and gaining a better understanding of climate change resulting from energy use. The DOE has released a press release here.

2009 Highlights

  • November 2009: There will be a Birds-of-a-Feather (BOF) meeting for FastOS projects during the annual Supercomputing Conference in Portland Oregon. The BOF, which will be held Wednesday Nov-18-2009 at 5:30, will include brief presentations from many of the projects funded by the FastOS program (see this link for more details).
  • September 2009: Colony II is officially underway! The three research teams (ORNL, UIUC, and IBM) have received their funding and we are now able to start the next phase of our research. (Funding was delayed to accomodate our PI, Terry Jones, who is joining ORNL.) Colony II will be funded for three years to study adaptive system software approaches to issues associated with load imbalances, faults, and extreme scale systems.

2008 Highlights

  • Our Project Principal Investigator, Terry Jones, will be joining Oak Ridge National Laboratory as a Staff R&D Member in the Computer Science and Mathematics organization. In the last few years, Oak Ridge has dramatically increased their supercomputing facilities. Among the production systems at ORNL is a 4096 core Blue Gene/P machine and a 250 Tflop Cray and much larger machines are currently being installed. Terry will be working with a team of system software researchers who have brought about such innovations as Parallel Virtual Machine and HPC OSCAR.
  • Great News! Colony has been selected by the DOE Office of Science to receive funding for three additional years! Over that timespan, we will focus on furthering our strategies and obtaining results with key scientific applications.

    March, 2008: The Office of Advanced Scientific Computing's Computer Science program recently announced that they are awarding new funds to the Colony Project to continue their collaborative work on improving high performance computing (HPC) system software stacks. Today's system software stacks, including operating systems and runtime systems, unnecessarily limit performance or portability (or in some cases, both). Strategies developed by the Colony Project address a wide range of system software problems such as operating system interference (noise) while introducing important adaptive capabilities that free workloads from performance-reducing load imbalances.

    Colony Project is a collaborative effort that includes the IBM T.J. Watson Research Center, and the University of Illinois at Urbana-Champaign. Colony began its research effort in 2005 and has received its major funding through the DOE Office of Science Advanced Scientific Computing Research (ASCR) program (ASCR link here, ASCR's computer science projects link here).

  • The Colony project received computer time as part of the 2008 BGW Day . We performed a number of experiments to evaluate our latest coordinated scheduling kernel (including parameter space studies). A report describing our tests and results is available here.

2007 Highlights

  • Scaling results from July-26-2007 experiments conducted by the Colony team on their big-pages kernel at the Sixth BGW Day are now available in this report. Additional results in the areas of Resource management and fault tolerance are also available from experiments we conducted during the Fourth BGW Day. These experiments were performed on a 20,000+ core system at IBM's T. J. Watson facility.
  • Compute node Linux demonstrated running NAS parallel benchmark, Charm++ application, and other programs.
  • We assessed operating system evolution on the basis of several key factors related to system call functionality. These results were the basis for a paper presenting the system call usage trends for Linux and Linux-like lightweight kernels. Comparisons are made with several other operating systems employed in high performance computing environments including AIX, HP-UX, OpenSolaris, and FreeBSD.
  • We completed and demonstrated a prototype of our fault tolerance scheme based on message-logging [Chakravorty07], showing that the distribution of objects residing on a failing processor can significantly improve the recovery time after the failure.
  • Our proactive fault-tolerance scheme was integrated to the regular Charm++ distribution and is now available for any Charm++/AMPI user.
  • We extended the set of load balancers available in Charm++, by integrating the recently developed balancers based on machine topology. These balancers use metrics based on volume of communication and number of hops as factors in their balancing decisions

2006 Highlights

  • First prototype Linux solution for Blue Gene compute nodes is operational.
  • We completed a detailed study of the difference in performance observed when running the same application using either Linux or the lightweight Compute Node Kernel (CNK) on the Blue Gene compute nodes. Included in the assessment was a study on the impact of this noise on the performance of Blue Gene.
  • We assessed the effectiveness of our in-memory checkpointing by performing tests on a large BlueGene/L machine. In these tests, we used a 7-point stencil with 3-D domain decomposition, written in MPI. Our results are quite promising to 20,480 processors.
  • Our proactive fault tolerance scheme is based on the hypothesis that, some faults can be predicted. We leverage the migration capabilities of Charm++, to evacuate objects from a processor where faults are imminent. We assessed the performance penalty due to incurred overheads as well as memory footprint penalties for up to 20,480 processors.
  • To accomplish our goal for Global Resource Management, we have developed a new hybrid load balancing algorithm (HybridLB) that is designed for scientific applications with persistent computation and communication patterns. HybridLB utilizes a load balancing hierarchical tree to distribute tasks across processors. We demonstrated this approach can effectively deal with certain problems encountered by centralized approaches (e.g. contention and unsatisfactory memory footprint).

2005 Highlights

  • We studied the behavior of one particular source of asynchronous events: the TLB misses incurred by dynamic memory management (which are absent in the production CNK). We modified CNK to support dynamic memory management with a parameterized page size and analyzed the impact of different strategies/page sizes on NAS kernels and Linpack.
  • We measured the effectiveness of parallel-aware scheduling for mitigating operating system interference (also referred to as OS noise and OS jitter in recent literature). Preliminary results on the Miranda parallel instability code indicate that parallel aware scheduling across the machine can dramatically reduce variability in runtimes (standard deviation decreased from 108.45 seconds to 5.45 seconds) and total wallclock runtime (mean decreased from 452.52 seconds to 254.45 seconds).
  • Our in-memory checkpointing scheme is designed.
  • We analyzed Charm++ current resource management schemes and designed new more scalable schemes.