Our lab, the Middleware and Runtime Systems (MARS) lab, focuses
on building middleware and runtime systems for parallel applications and systems.
Runtime Systems / Application Frameworks
Our lab works on building runtime systems for HPC applications on both accelerator and general HPC systems. We primarily focus on irregular applications including graph applications, N-Body simulations, Molecular Dynamics (MD), and Adaptive Mesh Refinement (AMR) applications.
We have also worked with applications in climate science and visualization in collaboration with researchers working in these areas.
Runtime Strategies and Programming Models on GPU systems:
research is on developing runtime strategies including hybrid asynchronous executions of applications on
both CPU and GPU cores for their effective use, dynamic scheduling, load balancing computations within the GPUs, and data layout optimizations
for both graph-based and scientific applications.
- We have developed bin-packing based load balancing on GPUs, knap-sack formulation of asynchronous executions on CPUs and GPUs and kernel optimizations for AMR
- Developed dynamic load balancing strategies for graph-based applications including BFS, and SSSP.
- Developed an algorithm for hybrid executions of betweenness centrality on both CPU and GPU cores.
- We plan to build hybrid execution, load balancing and data reorganization
strategies for more graph applications.
- In our work on programming models, the aim is to deal with challenges that arise out of executing different programming models on GPU systems. Our recent work is on developing user abstractions and runtime strategies for efficient executions of asynchronous message-passing applications written in Charm++ on GPUs. Developed runtime strategies for both regular applications like matrix computations and irregular applications like N-Body and molecular dynamics applications. This work will be extended to include other programming models.
Performance Modeling, Scalability, Mapping of Applications on Large-Scale Systems:
This research focuses on performance modeling, scalability studies and processor allocation of large applications on large systems, and mapping and remapping/rescheduling strategies on HPC network topologies.
- We have developed processor allocation, mapping and reallocation strategies for simultaneous executions of nested simulations in weather modeling applications that involve dynamically varying weather phenomena like tracking cyclones, and rain clouds.
- Developed techniques that use matching of application signatures to predict large scale runs using small scale runs.
We plan to extend our technique of performance modeling of large-scale runs to
auto-identify and auto-correct scalability bugs.
- Our current focus is on mapping techniques and scalability for graph applications on HPC network topologies.
Middleware is another primary research field in our lab. This includes middleware for supercomputer jobs, grid middleware and fault tolerance for parallel applications.
Middleware for Supercomputers, HPC Grids:
Batch systems and queues are used in many production and research-based supercomputer systems. Our research builds middleware framework that interfaces between the users and the batch queues and systems. The middleware includes prediction techniques that predict queue waiting times and the execution times incurred by the parallel jobs submitted to the batch queues, and scheduling strategies that use these prediction techniques to assign the appropriate batch queue and number of processors for job execution with the aim of reducing the turnaround times of the users and increasing the throughput of the system.
- We have developed techniques for predicting jobs that have short queue waiting times (quick starters).
- Extended the work to predict queue waiting times for all classes of jobs based on history of job submissions.
- Also developed strategies to predict ranges of execution times based on previous job submissions by the user and the loads on the system.
- We developed methods that automatically use these predictions for job molding (changing processor request size) and delayed submissions.
- We have also done work on middleware for metascheduling HPC jobs in a grid of supercomputers
in dynamic electricity markets. The middleware uses predictions of queue waiting times to predict execution periods of jobs in different supecomputers of the grid,
considers electricity price variations at the supercomputer sites during the
execution periods, and submits/migrates the jobs to the supercomputers predicted to have least electricity costs during the predicted period and least response time.
- Our current work is on automatically deciding the best queue configuration for a system based on the history of usage.
Our lab has investigated the use of
replication for fault tolerance. The novelty is that instead of replicating all
the processes, thereby resulting in only about 50% application efficiency in the
presence of failures, our methods replicate a small subset of processes
(typically, less than 1%) based on failure predictions. We demonstrated the
effectiveness of this strategy for current peta-scale and future exa-scale
systems. Our research also built a MPI library that uses this partial
July 08, 2015