Parallel Programming (3:1)
The objective of this course is to give you some level of
confidence in parallel programming techniques, algorithms and tools. At the end of the course, you would (we
hope) be in a position to deal with reallife parallel problems and to
explore new avenues of research in the area of parallel programming.
Class
 Aditya Acharya
 Pavan Badarla
 Geetha Venkatesh
 Jayasimha Talur
 Aniruddha Acharya
 Aakriti Gupta
 Aashish Viswakarma
 Pavan Akulakrishna
 Jagvir Singh
Instructor
Sathish Vadhiyar
Reading Materials
 Introduction
to Parallel Computing
by
Ananth Grama, Anshul Gupta, George Karypis, Vipin Kumar Publisher:
Addison Wesley; (2003) ISBN: 0201648652
 Various publication materials, lecture slides
Additional Reading
 Petascale
Computing: Algorithms and Applications
David A. Bader (Ed.), Chapman & Hall/CRC Computational Science Series. 2007.
 The
Sourcebook of Parallel Computing
by Jack Dongarra (Editor), Ian
Foster, Geoffrey Fox (Editor), Ken Kennedy, Andy White (Editor), Linda
Torczon, Wiliiam Gropp
Publisher: Morgan Kaufmann; (November 2002) ISBN:
1558608710

Numerical
Linear Algebra for High Performance Computers
by
Jack Dongarra, Iain Duff, Danny Sorensen, Henk van der Vorst Publisher:
SIAM; (1998) ISBN: 0898714281
Meeting Hours
11:3013:00; Tuesday, Thursday; Room 202, SERC
Grading System
Syllabus
 Prerequisites 
Introduction,
MPI,
OpenMP,
CUDAbasics
 Parallel Programming
tools/languages/models  1
 MPI 
collective communication
implementations, communicator groups, process topologies
 Parallel Algorithms
 List Ranking and Parallel Prefix ppt
 Sorting ppt
 Graph ppt
 FFT ppt
 Matrix computations
 Dense matrix computations ppt
 Sparse matrix computations ppt
 Advanced Parallel Programming Constructs and Optimizations
 MPI2 ppt
 Parallel I/O optimizations ppt
 GPU programming  CUDA Optimizations
ppt, Advanced CUDA
Slides
 OpenCL  class ppt,
lecture 5 from open university kit
 Charm++ ppt
 Scientific Applications
 Molecular Dynamics ppt
 Population evolution or Game of Life ppt
 Cosmology/nnody simulations ppt
 Mesh applications ppt
 Parallel System
Software
 Scheduling in Parallel
Systems ppt
 Checkpointing and Fault
tolerance for large systems ppt
Reading Material

Introduction: Grama et al.  2.4, 3.1, 3.5, 5.1, 5.6

MPI1: Online tutorial "MPI Complete Reference". Google for
it.

Collective Communications: Lecture slides, and Google for
paper "Optimization of Collective Communication Operations in MPICH" by Thakur, Rabenseifner and Gropp,
IJHPCA 2005

OpenMP  Lecture slides, and
OpenMP tutorial:
http://www.llnl.gov/computing/tutorials/openMP

Sorting
 Paper:
On the versatility of parallel sorting by regular sampling. Li et al.
Parallel Computing. 1993. (Pages 16)

Paper: Parallel Sorting by regular sampling. Shi and Schaeffer. JPDC
1992. (Pages 24)
 Paper:
Highly scalable parallel sorting. Solomonic and Kale. IPDPS 2010. (Pages
17)
 Book: Introduction to parallel computing by Grama et al. Sections
9.2.1, 9.2.2, 9.5, 9.6.2
 FFT  FFT Chapter in "Introduction to Parallel Computing" book
 Graph algorithms
 Paper: A Scalable Distributed Parallel BreadthFirst Search
Algorithm on BlueGene/L. Yoo et al. SC 2005. (Pages 17)
 Paper:Accelerating large graph algorithms on the GPU using CUDA.
Harish and Narayanan. HiPC 2007. (Page 58)
 Book: Introduction to parallel computing by Grama et al. Sections
10.210.4, Sections 11.4.111.4.6
 List Ranking
 Lecture slides

Paper: Optimization of Linked List Prefix Computations
on Multithreaded GPUs Using CUDA (section III)

CUDA Optimizations  Chapter 5 of
CUDA programming guide and slides
3034 of advanced CUDA

Dense Linear Algebra
 Lecture slides
 Paper: Towards Dense Linear Algebra for Hybrid
Accelerated Manycore Systems. Parallel
Computing. 2010 (section 4.1).
 Paper: Enabling and Scaling Matrix Computations on
Heterogeneous MultiCore and MultiGPU Systems. ICS 2012 (sections 2 and
3).
 Sparse LA
 Sparse Matrix vector multiplication: Paper: Efficient Sparse
matrixvector multiplication on cachebased GPUs. Reguly, Giles. InPar
2012.
 For cholesky factorization and subsequent steps:
 Sources [You can get these papers from me and take photocopies]
 Parallel Algorithms for sparse linear systems  Heath, Ng and
Peyton
 Reordering sparse matrices for parallel elimination  Liu
 A parallel graph partitioning algorithm for a messagepassing
multiprocessor  Gilbert and Zmijewski
 Task scheduling for parallel sparse Cholesky factorization 
Geist and Ng
 Parallel algorithms for forward and back substitution in direct
solution of sparse linear systems  Anshul Gupta and Vipin Kumar
 Matrix computations  Golub and van Loan
 Lecture slides
 General steps of sparse matrix factorization: Heath, Ng and Peyton 
pages 420429
 Parallel ordering
 Heath, Ng and Peyton  pages 429435 till Kernighan Lin
 Liu  pages 75 and 89 (you can read other pages on reduction of
elimination tree heights if interested),
 for Kernighan Lin ordering  Gilbert and Zmijewski  pages
427433, 437440, your lecture slides
 For mapping: Heath, Ng and Peyton
 437439, figures 9 and 10; Geist and Ng  sections 3 and 4
 For numerical factorization: Heath, Ng and Peyton  442450
 Maximal Independent sets by Luby  "Introduction to Parallel
Computing Book"
 MPI2: Online tutorial:
http://wwwunix.mcs.anl.gov/mpi/mpistandard/mpireport2.0/mpi2report.htm
 Parallel
I/O Optimizations: Google for paper Rajeev Thakur, William Gropp, and Ewing
Lusk, “A Case for Using MPI's Derived Datatypes to. Improve I/O
Performance,” in Proceedings of SC98.
 Molecular Dynamics  lecture slides, the paper ``A New Parallel Method
for Molecular Dynamics Simulation of Macromolecular Systems'' by Plimpton
and Hendrickson. Sections 25.
 Game of Life  Lecture slides
 Cosmology
 Lecture slides
 The paper "Scalable parallel formulations of the BarnesHut method
for nbody simulations" by Grama, Kumar and Sameh. In Parallel Computing
24(1998) 797822.
 The paper "Load balancing and data locality in adaptive hierarchical
Nbody methods: BarnesHut, Fast Multipole, and Dardiosity" by Singh,
Holt, Totsuka, Gupta and Hennessey. In Journal of Parallel and
Distributed Computing, 1994
 Mesh applications
 Lecture slides
 Paper: "Multilevel diffusion schemes for repartitioning of adaptive
meshes" by Schloegel et al.
 Paper: "Dynamic repartitioning of adaptively refined meshes" by
Schloegel et al.
 Paper: "Dynamic Octree Load Balancing Using Spacefilling curves" by
Campbell et al.  Section 2.5
 Paper: "Irregularity in Multidimensional spacefilling curves with
applications in multimedia databases" by Mokbel and Aref  Section 4
 OpenCL
 Charm++
 Scheduling
 Paper: "Backfilling with lookahead to optimize the packing of
parallel jobs" by Shmueli and Feitelson. JPDC 2005.
 Paper: "A comparison study of eleven static mapping heuristics for a
class of metatasks on heterogeneous computing systems" by Tracy Braun
et al., HCW 1999.
 Checkpointing:
 Lecture slides
 Paper: "An overview of checkpointing in uniprocessor and distributed
systems, focusing on implementation and performance" by James Plank
 References also given in the lecture
slides
Assignments

Parallel Prefix

Sparse Matrixvector
Multiplication

Loadbalancing BFS

Sparse matrix
triangular solve
Final Project
The final project has to clearly demonstrate the uniqueness of
your work over existing work and show adequate performance improvements. It can
be in

parallelizing well known algorithms or problems,

parallelizing numerical methods

mapping (collective) communications to 3D torus of
Bluegene/L

prograparallelizing numerical methods

mapping (collective) communications to 3D torus of
Bluegene/L

programming interface or abstractions or generic
scheduling and load balancing strategies for a class of problems
Notes  contains important information on assignment deadlines,
policies on plagiarism.
Platform for Assignments and Project
Tyrone, Tesla
Parallel Profilers