News
2008 News & Highlights
- Great News! Colony has been selected by the DOE Office of Science to receive funding for three additional
years! Over that timespan, we will focus on furthering our strategies and obtaining results with
key scientific applications.
|
March, 2008: The Office of Advanced Scientific Computing's Computer Science program
recently announced that they are awarding new funds to the Colony Project to continue their
collaborative work on improving high performance computing (HPC) system software stacks.
Today's system software stacks, including operating systems and runtime systems,
unnecessarily limit performance or portability (or in some cases, both). Strategies
developed by the Colony Project address a wide range of system software problems such
as operating system interference (noise) while introducing important adaptive
capabilities that free workloads from performance-reducing load imbalances.
Colony Project is a collaborative effort that includes the IBM T.J. Watson Research Center,
and the University of Illinois at Urbana-Champaign. Colony began its research effort
in 2005 and has received its major funding through the DOE Office of Science Advanced
Scientific Computing Research (ASCR) program (ASCR link
here, ASCR's computer
science projects link here).
|
- The Colony project received computer time as part of the
Spring 2008 BGW Day
. On May-23-2008, we performed a number of experiments to evaluate our latest coordinated scheduling
kernel (including parameter space studies). The tests were successful and we will be publishing our findings
soon.
2007 Highlights
- Scaling results from July-26-2007 experiments conducted by the Colony team on their big-pages kernel at the Sixth
BGW Day are now available in this report. Additional results in
the areas of Resource management and fault tolerance are also
available from experiments we conducted during the Fourth BGW Day. These experiments were performed on a
20,000+ core system at IBM's T. J. Watson facility.
- Compute node Linux demonstrated running NAS parallel benchmark, Charm++ application, and other programs.
- We assessed operating system evolution on the basis of several key factors related to system call functionality.
These results were the basis for a paper presenting the system call usage trends for Linux and Linux-like
lightweight kernels. Comparisons are made with several other operating systems employed in high performance
computing environments including AIX, HP-UX, OpenSolaris, and FreeBSD.
- We completed and demonstrated a prototype of our fault tolerance scheme based on message-logging [Chakravorty07],
showing that the distribution of objects residing on a failing processor can significantly improve the recovery
time after the failure.
- Our proactive fault-tolerance scheme was integrated to the regular Charm++ distribution and is now available for
any Charm++/AMPI user.
- We extended the set of load balancers available in Charm++, by integrating the recently developed balancers
based on machine topology. These balancers use metrics based on volume of communication and number of hops
as factors in their balancing decisions
2006 Highlights
- First prototype Linux solution for Blue Gene compute nodes is operational.
- We completed a detailed study of the difference in performance observed when running the same application
using either Linux or the lightweight Compute Node Kernel (CNK) on the Blue Gene compute nodes.
Included in the assessment was a study on the impact of this noise on the performance of Blue Gene.
- We assessed the effectiveness of our in-memory checkpointing by performing tests on a large BlueGene/L
machine. In these tests, we used a 7-point stencil with 3-D domain decomposition, written in MPI.
Our results are quite promising to 20,480 processors.
- Our proactive fault tolerance scheme is based on the hypothesis that, some faults can be predicted.
We leverage the migration capabilities of Charm++, to evacuate objects from a processor where faults
are imminent. We assessed the performance penalty due to incurred overheads as well as memory footprint
penalties for up to 20,480 processors.
- To accomplish our goal for Global Resource Management, we have developed a new hybrid load balancing
algorithm (HybridLB) that is designed for scientific applications with persistent computation and
communication patterns. HybridLB utilizes a load balancing hierarchical tree to distribute tasks
across processors. We demonstrated this approach can effectively deal with certain problems encountered
by centralized approaches (e.g. contention and unsatisfactory memory footprint).
2005 Highlights
- We studyed the behavior of one particular source of asynchronous events: the TLB misses incurred
by dynamic memory management (which are absent in the production CNK). We modified CNK to support
dynamic memory management with a parameterized page size and analyzed the impact of different
strategies/page sizes on NAS kernels and Linpack.
- We measured the effectiveness of parallel-aware scheduling for mitigating operating system
interference (also referred to OS noise and OS jitter in recent literature). Preliminary results
on the Miranda parallel instability code indicates that parallel aware scheduling across the
machine can dramatically improve variability in runtimes (standard deviation decreased from
108.45 seconds to 5.45 seconds) and total wallclock runtime (mean decreased from 452.52 seconds
to 254.45 seconds).
- Our in-memory checkpointing scheme is designed.
- We analyzed Charm++ current resource management schemes and designed new more scalable schemes.
Publications
- Sayantan Chakravorty, C. L. Mendes, & Laxmikant V.
Kalé. Proactive Fault Tolerance in Large ystems. First Workshop on High
Performance Computing Reliability Issues at HPCA-11, San Francisco/CA,
February 2005.
- Gengbin Zheng. Achieving High Performance on Extremely
Large Parallel Machines, PhD Thesis, Dep. Computer Science, University
of Illinois, May 2005.
- Tarun Agarwal. Strategies for Topology-Aware Task
Mapping and for Rebalancing with Bounded Migrations, MS Thesis, Dep.
Computer Science, University of Illinois, June 2005.
- Sameer Kumar, Gheorghe Almasi, Chao Huang and Laxmikant
V. Kalé. Achieving Strong Scaling with NAMD on Blue Gene/L, University
of Illinois, October 2005, submitted to publication.
- Tarun Agarwal, Amit Sharma and Laxmikant V. Kalé.
Topology-aware task mapping for reducing communication contention on
large parallel machines, Proceedings of IEEE International Parallel and
Distributed Processing Symposium 2006, Greece, April 2006.
- Gengbin Zheng, Chao Huang and Laxmikant V. Kalé.
Performance Evaluation of Automatic Checkpoint-based Fault Tolerance
for AMPI and Charm++. ACM SIGOPS Operating Systems Review: Operating
and Runtime Systems for High-end Systems, 40(2), April 2006.
- Sayantan Chakravorty, Celso L. Mendes, Laxmikant V.
Kalé, Terry Jones, Andrew Tauferner, Todd Inglett and José Moreira.
HPC-Colony: Services and Interfaces for Very Large Systems. ACM SIGOPS
Operating Systems Review: Operating and Runtime Systems for High-end
Systems, 40(2), April 2006.
- Sayantan Chakravorty, Celso L. Mendes and Laxmikant V.
Kalé. Proactive Fault Tolerance in MPI Applications via Task Migration,
Accepted for HiPC2006, Bangalore, India, December 2006.
- Sayantan Chakravorty, Laxmikant V. Kalé. A Fault
Tolerance Protocol with Fast Fault Recovery. IEEE International
Parallel and Distributed Processing Symposium 2007, California, March
2007.
- Terry Jones, Andrew Tauferner, Todd Inglett. HPC System
Call Usage Trends, the 8th LCI International Conference on High
Performance Computing, South Lake Tahoe, CA, May 2007.
- Gregory A. Koenig and Laxmikant V. Kalé. Optimizing
Distributed Application Performance Using Dynamic Grid Topology-Aware
Load Balancing. Proceedings of the IEEE International Parallel and
Distributed Processing Symposium2007, California, March 2007.
- Abhinav Bhatele. Application-specific Topology-aware
Mapping and Load Balancing for three-dimensional Torus Topologies.
Master's Thesis, Dep. of Computer Science, University of Illinois,
Urbana, 2007.
- Sayantan Chakravorty. Fault Tolerance Protocols for Fast
Recovery in Parallel Systems. PhD Thesis, Dep. of Computer Science,
University of Illinois, Urbana, 2007.
- Laxmikant V. Kalé, Eric Bohm, Celso L. Mendes, Terry
Wilmarth and Gengbin Zheng Programming Petascale Applications with
Charm++ and AMPI. In "Petascale Computing: Algorithms and
Applications", CRC Press, 2008 (to appear).
Talks
- Terry Jones, Reducing the Impact of Operating System
Interference on Scientific Applications, ScicomP 11, Edinburgh
Scotland, June 3, 2005.
- Laxmikant V. Kalé. Adaptive MPI: Intelligent Runtime
Strategies and Performance Prediction via Simulation, Oak Ridge
National Lab, Oak Ridge, TN, August 18, 2005.
- Laxmikant V. Kalé. Adaptive MPI: Intelligent Runtime
Strategies and Performance Prediction via Simulation, University of
Tennessee, Knoxville, TN, August 19, 2005.
- Laxmikant V. Kalé. Enhancing Performance and
Productivity for Science and Engineering Applications Across the
Computational Grid, University of Texas, Austin, TX, September 2005.
- Laxmikant V. Kalé. Exploiting the Predictability of
Message-Driven Objects to Scale the Memory Hierarchy, ScalPerf05,
Italy, October 12, 2005.
- Laxmikant V. Kalé. Charm++ and Adaptive MPI: Experiences
with a Novel Parallel Programming Approach, University of Paris-Sud,
France, October 14, 2005.
- Terry Jones, The HPC-Colony Project, BlueGene Consortium Meeting at SC05, Seattle Washington, November 15, 2005.
-
Terry Jones, Operating System Interference Effects at Extreme Scale,
SIAM Conference on Parallel Processing for Scientific Computing, San
Francisco, CA, February 2006.
- Abhinav Bhatele. Dynamic Load balancing in Charm++. Tutorial presented at the 5th Charm++ Workshop, Urbana, April 2007.
-
Celso L. Mendes. How to Write Applications Using Adaptive MPI. Tutorial
presented at the 5th Charm++ Workshop, Urbana, April 2007.
- Laxmikant V. Kalé. State of Charm++. 5th Charm++ Workshop, Urbana, April 2007.
- Sayantan Chakravorty. The Charm++ Fault Tolerance Infrastructure. 5th Charm++ Workshop, Urbana, April 2007.
-
Laxmikant V. Kalé. Parallel Programming Models in the Era of Multi-core
Processors. Manycore Computing Workshop, Seattle, June 2007.
- Laxmikant V. Kalé. Programming to Petascale with
Multicore Chips and Early Experience on Abe with Charm++. NCSA
Multicore Workshop, July 2007.
- Laxmikant V. Kalé. Petascale and Multicore Programming
Models: What is Needed. Keynote talk, 19th International Symposium on
Computer Architecture and High Performance Computing, Gramado-Brazil,
October 2007
|