Our lab, the Grid Applications Research Lab (GARL), focuses on research in High Performance Computing (HPC) involving challenging parallel applications (large-scale, long-running, dynamic, irregular, multi-component etc.) and challenging parallel systems, namely, GPUs, state-of-art supercomputers and grids. We are specifically interested in the following areas.
This work involves application optimization/refinement/tuning, and developing parallel algorithms for large-scale applications in GPUs, state-of-art supercomputers like BlueGene and grids. We focus on developing scheduling, load balancing and rescheduling strategies for applications including multi-component, long-running, dynamic and irregular applications. Our research attempts to build generic middleware or enabling frameworks encapsulating our optimization techniques for performance improvement and seamless execution of the applications. Our lab has been conducting application-oriented research for effective execution of climate modeling, weather modeling and bioinformatics applications on grids.
Climate Modeling: Assessment of climate change requires long simulations with multi-component climate models and are typically performed for periods of the order of centuries and at fine resolutions. Executions of climate models on single parallel systems limit the duration of the runs, due to reliability on a single resource, and the speed of modeling. We contributed various techniques for long-running climate modeling applications on grids with multiple parallel systems including large-scale analysis of the potential benefits of executing components of climate models on grids with multiple parallel batch systems, a novel execution model in which the set of active batch systems available for execution is dynamically shrunk and expanded during execution, inter-component load balancing in multi-component applications, and a practical grid middleware framework called Morco for executing climate modeling and other long-running multi-component across multiple batch systems of a grid.
Weather Modeling: We have been conducting research in building an integrated framework for simultaneous parallel simulations and remote visualizations of critical climate applications, including tracking tropical cyclones and depressions. The objective of the work is to enable climate science community to interact and collaborate through our remote visualization and steering framework for tracking critical climate events. We have developed novel optimization techniques for adaptively determining various parameters including frequency of climate output, and number of processors used for simulations.
Bioinformatics: Predictions of future sequences in a phylogenetic or evolutionary tree are important for a variety of applications including drug discovery and requires large scale analysis of fine-level mutations of DNA sequences. We proposed a novel grid framework for predictions of future DNA sequences on grids. Our approach studies mutations using large-scale explorations of cellular automata rules for evolutions. Our work involved novel formulation of the problem as an every-running and resource-greedy application suitable for grid computing. We performed analysis and predictions for three HIV sequences and three protein sequences by obtaining cellular automata rules on grids.
Our research efforts on applications have been in collaborations with scientists in climate modeling and computer science.
Besides developing grid-based solutions for specific applications, our lab also conducts fundamental research for generic grid applications. When executing long-running multi-phase parallel applications on grids, it is necessary to adapt the application execution by rescheduling in response to application and resource dynamics. Our lab conducts research on various aspects related to adaptivity and fault tolerance, namely, checkpointing, fault-detection and prediction, rescheduling and load balancing policies.
Checkpointing: We have developed a user-level semi-transparent checkpointing library, called SRS, for enabling MPI parallel applications to reschedule to different number of processors and different clusters in the middle of the execution. Our library automatically performs the necessary redistributions of application data for execution on the new set of processors. Using our library, malleable applications can be executed on heterogeneous environments including grids. One of the important parameters in a checkpointing system that provides fault tolerance is the period of checkpointing the application's state. Our lab has developed strategies based on Markov models for determining efficient checkpointing intervals for malleable parallel applications in which the number of processors can be changed between migrations. Our strategies lead to improved performance of applications in the presence of failures. We have also conducted research on a source-source precompiler for automatically inserting checkpointing calls in a parallel application. This work involves live variable analysis, determination of appropriate locations in the code for checkpointing, and automatic determination of data distributions.
Adaptivity: Our research has developed three novel algorithms for rescheduling large-scale multi-phase parallel applications in response to both resource and application dynamics. Our algorithms determined points of rescheduling for multi-phase parallel applications considering both application and resource dynamics. We also plan to develop energy-aware rescheduling policies to balance power consumption in multi-user environments.
Batch systems and queues are used in many production and research-based supercomputer systems. We are interested in conducting research in analysis and predictions of batch queue dynamics, and development of an intelligent advisor framework that interfaces between users' jobs and batch queues of supercomputing system(s), and automatically decides and allocates users' applications among the batch queues.
We also developed a comprehensive set of performance modeling strategies for predicting execution times of parallel applications on both dedicated and non-dedicated grid resources. We developed different scheduling strategies that efficiently use the performance model functions for selecting a set of processors for execution of tightly-coupled parallel applications on multi-cluster grids.
Our lab has also developed generic solutions for enabling executions of high performance applications on grids. In our research, we developed a comprehensive set of adaptive techniques for efficient broadcast and allgather collective operations for long running MPI parallel applications executing on computational grids. In our work on data grids, we have developed novel algorithms for optimal selection of data segments, needed by a parallel application executing on a set of resources, from many possible data replica sites and downloading to the parallel computational resources.
Our aim with all our above efforts is to make our research techniques culminate in generic enabling frameworks that can be used for a wide class of applications. Our work on climate modeling application has resulted in a middleware framework called Morco that can be used for general long-running multi-component applications on grids. Our checkpointing infrastructure, SRS, helps create malleable parallel applications and has been proven for different class of numerical applications.
During my research in Innovative computing Lab in University of Tennessee, I had worked on scheduling and metascheduling algorithms for parallel jobs in grids as part of the GrADS (later vGrads) project. I had also worked on optimized MPI collective communications as part of the Harness project.