HPC-Colony Project

Services and Interfaces for Very Large Linux Clusters

 

 

 

Overview Goals FAQ News Participants Links Internal Page

 

Overview

The HPC-Colony project is a joint research effort with Lawrence Livermore National Laboratory, the IBM T.J. Watson Research Center, and the University of Illinois at Urbana-Champaign to create scalable Services and Interfaces that permit easy application porting for high-performance computing (HPC) systems with very large numbers of processors. Funding for the HPC-Colony Project is provided by a grant from the U.S. Department of Energy Office of Science.

The motivation for the HPC-Colony Project is two-fold:

  • Parallel resource management
    • Strategies for scheduling and load balancing must be improved. Difficulties in scheduling workloads and achieving balanced partitioning can limit scaling for complex problems on large machines.
  • Global system management
    • System management is inadequate. Parallel jobs require common operating system services (such as process scheduling, event notification, and job management) to scale to large machines.

Ever increasing numbers of processors and the inherent restrictions found in today's system software impose artificial barriers upon the capacity of our most capable HPC machines. For developers to be able to scale applications to these new processor counts, work is needed to make system software free of imbalances and scaling shortcomings. Moreover, the arduous task of balancing an application is best accomplished using dynamically enforced schemes with global knowledge -- a new opportunity for system software. Indeed, system software improvements are needed to provide important benefits to users of HPC systems:

  • provide higher levels of application scalability; specifically, remove the problems associated with operating system interference (noise) as well as the problems associated with application load imbalances
  • permit application porting without syscall modifications
  • support familiar tools including a wide range of debugging and development tools on compute nodes
  • provide dynamic support for multiple management policies
  • provide support for fault tolerance
  • provide parallel awareness and optimization

The Colony project is developing a coordinated framework using Linux and the Charm++ run-time system to bring about these HPC goals for the benefit of parallel applications.

 


For further information on the Colony Project, contact Terry Jones (email trj@cs.stanford.edu)


Funding for the HPC-Colony Project is provided by a grant from
the U.S. Department of Energy Office of Science.