Workshop on
Performance
& Productivity of
Extreme-Scale
Parallel Systems
Tuesday, October 17th, 2006
Organized by: Adolfy Hoisie,
Darren J. Kerbyson,
Dan Reed,
Invited Keynote
Presentation
9:00 Architecture, uses, and future of Cell Broadband Engine(TM)* technology
H. Peter Hofstee, Cell BE Chief Scientist, IBM Austin
Session
1: Software Issues in the Extreme
10:00 The Impact of Multicore on Math Software and Exploiting Single Precision Computing to Obtain Double Precision Results, Jack Dongarra, University of Tennessee
10:30 Coffee Break
11:00 Performance Analysis, Modeling and Tuning at Scale, Katherine Yelick, U.C. Berkeley and Lawrence Berkeley National Laboratory
11:30 Reducing HPC network requirements – exploiting overlap between Computation and Communication, Jose Sancho, Los Alamos National Laboratory
12:00 It's not too hot for penguins: a status check of Linux on the BlueGenes, George Almasi, IBM TJ Watson
12:30 Lunch (on your own)
Session 2: Performance
Analysis
2:00 Experiences with Modeling of the performance of the Krak Hydrodynamics Application, Kevin J. Barker, Los Alamos National Laboratory
2:30 Towards the Incorporation of Dynamic Adaptation into Operating Systems: Adaptive Disk I/O, Patricia J. Teller, University of Texas-El Paso
3:00 From Timing Programs to Programmers:
Measuring HPC Productivity, Jeff Hollingsworth,
3:30 Coffee Break
Session
3: Performance Tools and Experiences
4:00 A case study in performance tuning, John Feo, Cray
4:30 Scalable Performance Analysis of Large-Scale
Parallel Applications, Brian Wylie, John von Neumann Institute for
Computing,
5:00 Scalable System Measurement and Performance
Analysis: Recent Progress, Robert J. Fowler, Renaissance Computing
Institute,
5:30 The Red Storm System: Architecture, System Update and Performance Analysis, Doug Doerfler, Sandia National Laboratory
6:00 Closing Remarks
|
|
7th Symposium of the Los Alamos Computer Science Institute: LACSI 2006, 17-19 October Eldorado Hotel, |
Architecture, uses,
and future of Cell Broadband Engine(TM)* technology
H. Peter Hofstee
Cell BE Chief Scientist
IBM Systems and Technology Group,
This talk will very briefly summarize the major decisions that have led to the Cell BE architecture. We will show both the chip and IBM systems roadmap for Cell, and discuss how Cell BE based systems can maintain their competitive edge over time. Next we will discuss some application examples and their performance. The talk will end with some observations on hybrid systems and approaches to programming such systems.
* Cell Broadband Engine is a trademark of Sony Computer Entertainment, Inc.
The Impact of
Multicore on Math Software and Exploiting Single Precision Computing to Obtain
Double Precision Results
Jack Dongarra
Innovative Computing Laboratory; Computer Science Dept,
Recent versions of microprocessors exhibit performance characteristics for 32 bit floating point arithmetic (single precision) that is substantially higher than 64 bit floating point arithmetic (double precision). Examples include the Intel's Pentium IV and M processors, AMD's Opteron architectures and the IBM's Cell Broad Engine processor. When working in single precision, floating point operations can be performed up to two times faster on the Pentium and up to ten times faster on the Cell over double precision. The performance enhancements in these architectures are derived by accessing extensions to the basic architecture, such as SSE2 in the case of the Pentium and the vector functions on the IBM Cell. The motivation for this talk is to exploit single precision operations whenever possible and resort to double precision at critical stages while attempting to provide the full double precision results. The results described here are fairly general and can be applied to various problems in linear algebra such as solving large sparse systems, using direct or iterative methods and some eigenvalue problems.
Performance Analysis, Modeling and Tuning at Scale
Katherine Yelick
U.C. Berkeley and Lawrence Berkeley National Laboratory
The petascale computing systems that are promised over the next few years offer enormous potential to computational science and engineering communities, along with new challenges in the degree and kinds of parallelism that need to be exploited for high performance. To fully utilize the system capabilities, programmers will have to take advantage of multicore processors, heterogeneous processor accelerators, and latency tolerance in the memory and network system, in addition to the ever-increasing degree of parallelism between computational nodes. The Berkeley Institute for Performance Studies, a collaboration between U.C. Berkeley and Lawrence Berkeley National Laboratory, has research efforts in the architectural design, analysis, modeling and tuning of scientific computations. In this talk I will describe some of the major efforts in understanding system bottlenecks through microbenchmarking and analysis, as well as large-scale application studies to see how computational methods map onto various architectures. I will also describe some of our efforts in automatic performance optimization and the use of novel architectural features to address some of the fundamental limitations of high end computing systems, such as memory and network latencies, as well as effects of system scale on the performance and predictability of the interconnect networks.
On the other hand, it is not too difficult to tune programs I will describe work by members of the Berkeley Institute for Performance Studies, including Leonid Oliker, Erich Strohmaier, Hongzhang Shan, John Shalf, James Demmel, Parry Husbands, Rich Vuduc, Sam Williams, Shoaib Kamil, Kaushik Datta, and Rajesh Nishtala.
Reducing HPC network requirements – exploiting overlap between Computation and Communication
Jose Sancho
Performance and Architecture Lab, Los Alamos National
The design and implementation of a high performance communication network are critical factors in determining the performance and cost-effectiveness of a large-scale computing system. One promising technique for extracting maximum application performance given limited network resources is based on overlapping computation with communication, which partially or entirely hides communication delays. In this talk, first we describe the method developed to analyze the potential communication and computation overlap in terms of 'dependent' computation on communication data, and also that which it is computation 'independent' off. And second, we explore the potential overlap for some large-scale production scientific codes for various network profiles. The experimental results obtained for these codes running on a large cluster containing 1,024 processors indicate that reduced network bandwidths and/or increased latencies can be tolerated without negatively impacting application performance. This allows for a potentially significant relaxation of network requirements, and thus suggesting that lower cost networks could be used to build more cost-effective HPC systems in the future if overlapping is exploited in the applications.
It's not too hot for
penguins: a status check of Linux on the BlueGenes
George Almasi
IBM TJ
In this talk I will describe recent efforts at IBM TJ Watson to make the Linux O/S suitable for running high performance jobs on BlueGene. Building on the achievements of the Colony project (LLNL) and ZeptoOS (ANL), we focused on porting Linux to the BlueGene compute nodes, implementing function shipping and other performance tweaks to achieve performance comparable to that of the lightweight Compute Node Kernel.
Experiences
with Modeling the performance of the Krak Hydrodynamics Application
Kevin J. Barker
Performance and Architecture Lab, Los Alamos National
In this talk, we present an analytic performance model of a
large-scale hydrodynamics code developed at Los Alamos National Laboratory and
describe the modeling team’s experience with its development. This modeling
work is part of an ongoing effort to develop models and modeling techniques for
large-scale codes and systems of interest to
Towards the
Incorporation of Dynamic Adaptation into Operating Systems: Adaptive Disk I/O
Patricia J. Teller
In the context of the DAiSES (Dynamic Adaptability in Support of Extreme Scale) research project, we are investigating ways to incorporate adaptation into operating systems, either by varying parameter values or policies at runtime. In response to changing workload characteristics or requirements, operating system (OS) adaptation is meant to dynamically “customize” the OS in an attempt to provide “best” service, based on predefined criteria, for the active workload. Current DAiSES research activities focus on three adaptation targets: disk scheduling, virtual memory management, and file I/O. Thus far, disk scheduling has received most of our attention, and will be the focus of this talk. The questions that we are trying to address in this research include the following:
- What is the best metric to use to fairly allocate disk resources?
- How should multiple different data requirements be satisfied?
- Can I/O performance be predicted and can performance isolation be ensured?
- Is the community measuring I/O performance correctly?
- Is programming for I/O performance necessary?
As you will see, although we partially answer these questions, the answers give rise to yet more questions!
From Timing Programs
to Programmers: Measuring HPC Productivity
Jeff Hollingsworth
A case study in
performance tuning
John Feo,
Cray Incorportated, San Diego Supercomputer Center, University of California, San Diego, La Jolla, California 92093-0505
Performance tuning is a complex requirement of large scale computing. It involves many interrelated issues such as algorithm design, programming language, compilers, runtime systems, and architecture. Tuning for parallel computer systems with thousands of processors and deep memory heirarchies is an almost impossible task. It's difficult to see how peta-scale systems of the same design will make the problem easier.
On the other hand, it is not too difficult to tune programs written for shared-memory parallel systems that use parallelism rather than deep memory hierarchies to hide latencies. In this talk I present a case study of optimizing programs for such a machine. While it is unlikely that future peta-scale systems of this class will retain the simplicity of current systems, at least we begin with a problem we can solve
Scalable Performance
Analysis of Large-Scale Parallel Applications
Brian Wylie
John von Neumann
Institute for Computing (NIC), Forschungszentrum Jülich GmbH, D-52425 Jülich,
Germany
The SCALASCA project is developing a new generation of tools
for scalable performance analysis of large-scale parallel applications, building
on extensive experience gained with the KOJAK toolset developed by
Forschungszentrum Juelich and the
Scalable System
Measurement and Performance Analysis: Recent Progress
Robert J. Fowler
Director of HPC Research, Renaissance Computing Institute, The University of North Carolina at Chapel Hill, 100 Europa Dr, Suite 540, Chapel hill, NC 27517
We will describe recent work at RENCI and
The Red Storm System:
Architecture, System Update and Performance Analysis
Doug Doerfler
Sandia National Laboratory,
The Red Storm Architecture is the basis of the first instantiation of the the Red Storm System and in general the Cray XT3 product line. The Red Storm System has been in use for over year, so it is now possible to analyze the performance of a real application workload and contrast it to other HPC architectures. The Red Storm System has also recently undergone a significant upgrade of its processors and interconnect. In this talk we will give a brief overview of the Red Storm Architecture, the Red Storm System in it's original and upgraded configurations, and an analysis of its performance on actual NNSA ASC applications.