Loading…
This event has ended. View the official site or create your own event → Check it out
This event has ended. Create your own
The attendees list includes all authors (even thought they may not be attending), speakers, artists, etc. 

View the full conference website here:
IEEE Cluster 2013 Conference
View analytic

Sign up or log in to bookmark your favorites and sync them to your phone or calendar.

Sunday, September 22
 

7:00pm

IEEE Cluster 2013 Office
Sunday September 22, 2013 7:00pm - 9:00pm
09th Floor - Room 911 120 W. Market St, Indianapolis, IN

7:00pm

Registration
Sunday September 22, 2013 7:00pm - 9:00pm
Hotel Lobby 120 W. Market St, Indianapolis, IN
 
Monday, September 23
 

6:30am

IEEE Cluster 2013 Office
Monday September 23, 2013 6:30am - 9:00pm
09th Floor - Room 911 120 W. Market St, Indianapolis, IN

6:30am

Registration
Monday September 23, 2013 6:30am - 9:00pm
Hotel Lobby 120 W. Market St, Indianapolis, IN

8:00am

Breakfast
Monday September 23, 2013 8:00am - 9:00am
Tutorial Rooms (Hilton) 120 W. Market St, Indianapolis, IN

8:30am

1/2 day Tutorial - A Beginner's Guide to Scientific Visualization Using VisIt 2.6.0
Limited Capacity seats available

Hands-on: (accounts will be provided on Clemson Cluster) This tutorial provides an introduction to visualization by exploring underlying principles used in scientific visualization. The visualization process is presented as a vehicle for using visualization as a tool for knowledge discovery, gaining insight, and for making better informed decisions when analyzing data. Hands-on exercises using VisIt 2.6.0 are designed to serve both those who wish to participate using their own laptop and those who wish to observe and ask questions. Draft Agenda


Monday September 23, 2013 8:30am - 12:00pm
08th Floor - Circle City 08 (Hilton) 120 W. Market St, Indianapolis, IN

8:30am

1/2 day Tutorial - Globus Online: Scalable Research Data Management Infrastructure for Campuses and High-Performance Computing Facilities
Limited Capacity seats available

Hands-on. The rapid growth of data in scientific research endeavors is placing massive demands on campus computing centers and high-performance computing (HPC) facilities. Computing facilities must provide robust data services built on high-performance infrastructure, while continuing to scale as needs increase. Traditional research data management (RDM) solutions are typically difficult to use and error-prone, and the underlying networking and security infrastructure is often complex and inflexible, resulting in user frustration and sub-optimal use of resources.

An approach that is increasingly common in HPC facilities includes software-as-a-service (SaaS) solutions like Globus Online for moving, syncing, and sharing large data sets. The SaaS approach allows HPC resource owners and systems administrators to deliver enhanced RDM services to end users at optimal quality of service, while minimizing the administrative and operations overhead associated with traditional software. Usage has grown rapidly, with more than 9,500 registered users and over 17 petabytes moved. Globus Online’s reliable file transfer, combined with the recently announced data sharing service, is key functionality for bridging between campus and external resources, and is enabling scientists to more easily scale their research work flows.

Tutorial attendees will explore the challenges such facilities face in delivering scalable RDM solutions. They will be introduced to the RDM functions of Globus Online, learn how other resource owners are using Globus Online, and have the opportunity for hands-on interaction with the service at various levels of technical depth. Draft Agenda


Monday September 23, 2013 8:30am - 12:00pm
12th Floor - Circle City 12 (Hilton) 120 W. Market St, Indianapolis, IN

8:30am

1/2 day Tutorial - GPU computing with CUDA C/C++
Limited Capacity seats available

Hands-on: Planned, Participants will learn the fundamentals of GPU programming using CUDA C/C++. An overview of GPU architecture will be given and attendees will learn fundamental programming optimizations. Profiling tools will be used to guide performance optimization and the use of GPU-accelerated libraries will be demonstrated.

Speakers

Monday September 23, 2013 8:30am - 12:00pm
15th Floor - Circle City 15 (Hilton) 120 W. Market St, Indianapolis, IN

8:30am

Full day Tutorial - Programming for the Intel Xeon Phi (MIC)
Limited Capacity seats available

Hands-on: Interactive session a full day tutorial on how to use the new Intel Xeon Phi Coprocessor, also known as a MIC in high performance computing environments. This will be a course for intermediate to advanced users, where the audience is expected to be familiar with C or Fortran, OpenMP, as well as MPI.

While the MIC is x86 based, and capable of running most user codes with little porting effort, the MIC architecture has significant features that are different from those of present x86 CPUs, and optimal performance requires an understanding of the possible execution models and details of the architecture. The tutorial will be divided in four sections: Introduction to the MIC Architecture; Native Execution and Optimization; Offload Execution; and Symmetric Execution. In each section the users will spend approximately half of their time experimenting with guided hands-on exercises, as shown in the draft agenda: Draft Agenda


Monday September 23, 2013 8:30am - 4:30pm
07th Floor - Circle City 07 (Hilton) 120 W. Market St, Indianapolis, IN

10:00am

Morning Break
Monday September 23, 2013 10:00am - 10:30am
Tutorial Rooms (Hilton) 120 W. Market St, Indianapolis, IN

12:00pm

Lunch
Monday September 23, 2013 12:00pm - 1:00pm
Vincennes Room - 2nd Floor (Hilton) 120 W. Market St, Indianapolis, IN

1:00pm

1/2 day Tutorial - GPU computing with OpenACC
Limited Capacity seats available

Hands-on: Planned, Participants will learn the fundamentals of GPU programming using OpenACC directives. On overview of compiler directives will be given and attendees will learn the OpenACC memory model and how to manage data movement and computation efficiently based on performance analysis.

Speakers

Monday September 23, 2013 1:00pm - 4:30pm
15th Floor - Circle City 15 (Hilton) 120 W. Market St, Indianapolis, IN

1:00pm

1/2 day Tutorial - Hadoop on a Cray Cluster Solutions Systems and Introduction to MapReduce Programming and Generalizing MapReduce as a Unified Cloud and HPC Runtime
Limited Capacity seats available

Hadoop is frequently the first thing that comes to mind when "Big Data" is being discussed. Arguably the most common Map/Reduce software system available, its use with "Scientific Big Data" (SBD) frequently spawns questions regarding its use on HPC systems. In this tutorial we provide an overview of using Hadoop 2 with YARN - from multiple distributions - on a Cray Cluster Solutions systems. Topics to be discussed include simple software installation and configuration on the Cray CS300 cluster supercomputer, performance tuning opportunities, and suggested cluster system configurations. The use of Hadoop's Distributed File System (HDFS) with SBD will be discussed, including discussion on topics such as choices in storage technologies to strategies for using Hadoop with existing SBD information and storage formats. In the final hour, Judy Qiu from Indiana University will conclude the Hadoop tutorial: Many scientific applications are data intensive. It is estimated that organizations with high end computing infrastructures and data centers are doubling the amount of data that they are archiving every year. Twister extends Hadoop MapReduce enabling HPC-Cloud Interoperability. We show how to apply Twister to support large-scale iterative computations that are common in many important data mining and machine learning applications.

Speakers
Sponsors

Monday September 23, 2013 1:00pm - 4:30pm
12th Floor - Circle City 12 (Hilton) 120 W. Market St, Indianapolis, IN

1:00pm

1/2 day Tutorial - Parallel I/O - for Reading and Writing Large Files in Parallel
Limited Capacity seats available

Hands-on: Planned, workstation/personal laptop (with ssh access), access to Stampede will be required. Draft agenda


Monday September 23, 2013 1:00pm - 4:30pm
08th Floor - Circle City 08 (Hilton) 120 W. Market St, Indianapolis, IN

2:30pm

Afternoon Break
Monday September 23, 2013 2:30pm - 3:00pm
Tutorial Rooms (Hilton) 120 W. Market St, Indianapolis, IN

6:00pm

Student Dinner and Tour of IU PTI Research Technologies Advanced Visualization Lab facilities
Meet in the Hilton Hotel Lobby at 5:30 to walk over to the ICTC. Students will be introduced to their mentors and will have the opprotunity to see the Visualization facilities and network with their peers.


Monday September 23, 2013 6:00pm - 8:00pm
IUPUI Informatics & Communications Technology Complex (ICTC) 535 W. Michigan St., Indianapolis, IN
 
Tuesday, September 24
 

6:30am

IEEE Cluster 2013 Office
Tuesday September 24, 2013 6:30am - 9:00pm
09th Floor - Room 911 120 W. Market St, Indianapolis, IN

6:30am

8:00am

Breakfast

Tuesday September 24, 2013 8:00am - 9:00am
09th Floor - Victory Ballroom (Hilton) 120 W. Market St, Indianapolis, IN

9:00am

Opening Keynote - David E. Keyes, King Abdullah University of Science and Technology (KAUST)
Keyes is the founding dean of KAUST's division of Computer, Electrical, and Mathematical Sciences and Engineering. He is also an adjunct professor in applied physics and applied mathematics at Columbia University, and an affiliate of several laboratories of the US Department of Energy. Keyes' work focuses on the algorithmic interface between parallel computing and the numerical analysis of partial differential equations, with an emphasis on scalable solvers for emerging extreme architectures that require drastic reductions in communication and synchronization. For his algorithmic influence on scientific simulations, Keyes has been recognized as a Fellow of the Society for Industrial and Applied Mathematics (SIAM) and of the American Mathematical Society. Keyes' other honors include the IEEE Computer Society's Sidney Fernbach Award, the Association for Computing Machinery's Gordon Bell Prize, and the 2011 SIAM Prize for Distinguished Service to the Profession.


Tuesday September 24, 2013 9:00am - 10:25am
09th Floor - Victory Ballroom (Hilton) 120 W. Market St, Indianapolis, IN

10:30am

Morning Break

Tuesday September 24, 2013 10:30am - 10:55am
09th Floor - Victory Ballroom (Hilton) 120 W. Market St, Indianapolis, IN

11:00am

A-Cache: Resolving Cache Interference for Distributed Storage with Mixed Workloads
Distributed key-value stores employ large main memory caches to mitigate the high costs of disk access. A challenge for such caches is that large scale distributed stores simultaneously face multiple workloads, often with drastically different characteristics. Interference between such competing workloads leads to performance degradation through inefficient use of the main memory cache. This paper diagnoses the cache interference seen for representative workloads and then develops A-Cache, an adaptive set of main memory caching methods for distributed key-value stores. Focused on read performance for common workload patterns, A-Cache leads to throughput improvements of up to 150% for competing data-intensive applications running on server class machines.


Tuesday September 24, 2013 11:00am - 11:25am
15th Floor - Circle City 15 (Hilton) 120 W. Market St, Indianapolis, IN

11:00am

Co-processing SPMD Computation on CPUs and GPUs cluster
Heterogeneous parallel systems with multi processors and accelerators are becoming ubiquitous due to better cost-performance and energy-efficiency. These heterogeneous processor architectures have different instruction sets and are optimized for either task-latency or throughput purposes. Challenges occur in regard to programmability and performance when running SPMD tasks on heterogeneous devices. In order to meet these challenges, we implemented a parallel runtime system that used to co-process SPMD computation on CPUs and GPUs clusters. Furthermore, we are proposing an analytic model to automatically schedule SPMD tasks on heterogeneous clusters. Our analytic model is derived from the roofline model, and therefore it can be applied to a wider range of SPMD applications and hardware devices. The experimental results of the C-means, GMM, and GEMV show good speedup in practical heterogeneous cluster environments.


Tuesday September 24, 2013 11:00am - 11:25am
08th Floor - Circle City 08 (Hilton) 120 W. Market St, Indianapolis, IN

11:00am

Communication and topology-aware load balancing in Charm++ with TreeMatch
Programming multicore or manycore architectures is a hard challenge particularly if one wants to fully take advantage of their computing power. Moreover, a hierarchical topology implies that communication performance is heterogeneous and this characteristic should also be exploited. We developed two load balancers for Charm++ that take into account both aspects, depending on the fact that the application is compute-bound or communication-bound. This work is based on our TreeMatch library that computes process placement in order to reduce an application communication costs based on the hardware topology. We show that the proposed load-balancing schemes manage to improve the execution times for the two aforementioned classes of parallel applications.


Tuesday September 24, 2013 11:00am - 11:25am
12th Floor - Circle City 12 (Hilton) 120 W. Market St, Indianapolis, IN

11:30am

Design of Network Topology Aware Scheduling Services for Large InfiniBand Clusters
The goal of any scheduler is to satisfy users demands for computation and achieve a good performance in overall system utilization by efficiently assigning jobs to resources. However, the current state-of-the-art scheduling techniques do not intelligently balance node allocation based on the total bandwidth available between switches; that leads to over subscription. Additionally, poor placement of processes can lead to network congestion and poor performance. In this paper, we explore the design of a network-topology-aware plugin for the SLURM job scheduler for modern InfiniBand based clusters. We present designs to enhance the performance of applications with varying communication characteristics. Through our techniques, we are able to considerably reduce the amount of network contention observed during the Alltoall / FFT operations. The results of our experimental evaluation indicate that our proposed technique is able to deliver up to a 9% improvement in the communication time of P3DFFT at 512 processes. We also see that our techniques are able to increase the performance of microbenchmarks that rely on point-to-point operations up to 40% for all message sizes. Our techniques were also able to improve the throughput of a typical supercomputing system by up to 8%.


Tuesday September 24, 2013 11:30am - 11:55am
12th Floor - Circle City 12 (Hilton) 120 W. Market St, Indianapolis, IN

11:30am

V2-Code: A New Non-MDS Array Code with Optimal Reconstruction Performance in RAID-6
RAID-6 is widely used to tolerate concurrent failures of any two disks in both disk arrays and storage clusters. Numerous erasure codes have been developed to implement RAID-6, of which MDS Codes are popular. Due to the limitation of parity generating schemes used in MDS codes, RAID-6-based storage systems suffer from low reconstruction performance. To address this issue, we propose a new class of XOR-based RAID-6 code (i.e., V2-Code), which delivers better reconstruction performance than the MDS RAID-6 code at low storage efficiency cost. V2-Code, a very simple yet flexible Non-MDS vertical code, can be easily implemented in storage systems. V2-Code’s unique features include (1) lowest density, (2) steady length of parity chain, and (3) well balanced computation. We perform theoretical analysis and evaluation of the coding scheme under various configurations. The results show that V2-Code is a well-established RAID-6 code that outperforms both X-Code and Code-M in terms of reconstruction time. V2-Code can speed up the reconstruction time of X-Code by a factor of up to 3.31 and 1.79 under single disk failure and double disk failures, respectively.


Tuesday September 24, 2013 11:30am - 11:55am
15th Floor - Circle City 15 (Hilton) 120 W. Market St, Indianapolis, IN

11:30am

PRESENTER UNAVAILABLE: Lit: A High Performance Massive Data Computing Framework Based on CPU/GPU Cluster
Big data processing is receiving significant amount of interest as an important technology to reveal the information behind the data, such as trends, characteristics, etc. MapReduce is considered as the most efficient distributed parallel data processing framework. However, some high-end applications, especially some scientific analyses have both data-intensive and computation-intensive features. Current big data processing techniques like Hadoop are not designed for computation-intensive applications, thus have insufficient computation power. In this paper, we presented Lit, a high performance massive data computing framework based on CPU/GPU cluster. Lit integrated GPU with Hadoop to improve the computational power of each node in the cluster. Since the architecture and programming model of GPU is different from CPU, Lit provided an annotation based approach to automatically generate CUDA codes from Hadoop codes. Lit hided the complexity of programming on CPU/GPU cluster by providing extended compiler and ooptimizer. To utilize the simplified programming, scalability and fault tolerance benefits of Hadoop and combine them with the high performance computation power of GPU, Lit extended the Hadoop by applying a GPUClassloader to detect the GPU, generate and compile CUDA codes, and invoke the shared library. Our experimental results show that Lit can achieve an average speedup of 1x 3x on three typical applications over Hadoop.


Tuesday September 24, 2013 11:30am - 11:55am
08th Floor - Circle City 08 (Hilton) 120 W. Market St, Indianapolis, IN

12:00pm

Lunch

Tuesday September 24, 2013 12:00pm - 1:30pm
09th Floor - Victory Ballroom (Hilton) 120 W. Market St, Indianapolis, IN

1:30pm

Plenary Talk - Jessie J. Walker, University of Arkansas at Pine Bluff
Jessie Walker, is Coordinator of Computer Science at the University of Arkansas at Pine Bluff, a small HBCU located in the Arkansas Delta. Over the last six years he has helped HBCUs and teaching-oriented institutions within Arkansas to leveraged HPC resources/training as a core component of undergraduate education. He has also helped to develop a unique organization within Arkansas known as the Arkansas Minority Cyberinfrastructure Training, Education Consortium (AMC-TEC), with the major goal of empowering HBCUs, and teaching-oriented institutions in Arkansas to acquire, utilize cyberinfrastructure resources both locally and nationally as an essential element of their undergraduate curriculum. As a result of these activities, HBCUs within Arkansas have developed curriculum, which integrates new innovative undergraduate courses in bioinformatics, computational sciences, data analytics, simulation/modeling, and digital humanities. A recent result of these activities has been the development of computational communal research/education labs at HBCU campuses in Arkansas, and the recent acquisition of a HPC and visualization center at the University of Arkansas at Pine Bluff.


Tuesday September 24, 2013 1:30pm - 2:25pm
09th Floor - Victory Ballroom (Hilton) 120 W. Market St, Indianapolis, IN

2:30pm

GGAS: Global GPU Address Spaces for Efficient Communication in Heterogeneous Clusters
Modern GPUs are powerful high-core-count processors, which are no longer used solely for graphics applications, but are also employed to accelerate computationally intensive general-purpose tasks. For utmost performance, GPUs are distributed throughout the cluster to process parallel programs. In fact, many recent high-performance systems in the TOP500 list are such heterogeneous architectures. Despite being highly effective processing units, GPUs on different hosts are incapable of communicating without assistance from a CPU. As a result, communication between distributed GPUs suffers from unnecessary overhead, introduced by switching control flow from GPUs to CPUs and vice versa. Most communication libraries even require intermediate copies from GPU memory to host memory. This overhead in particular penalizes small data movements and synchronization operations, reduces efficiency and limits scalability. In this work we introduce Global GPU Address Spaces (GGAS) to facilitate direct communication between distributed GPUs without CPU involvement. Avoiding context switches and unnecessary copying dramatically reduces communication overhead. We evaluate our approach using a variety of workloads including low-level latency and bandwidth benchmarks, basic synchronization primitives like barriers, and a stencil computation as an example application. We see performance benefits of up to 2x for basic benchmarks and up to 1.67x for stencil computations.


Tuesday September 24, 2013 2:30pm - 2:55pm
08th Floor - Circle City 08 (Hilton) 120 W. Market St, Indianapolis, IN

2:30pm

Mercury: Enabling Remote Procedure Call for High-Performance Computing
Remote procedure call (RPC) is a technique that has been largely adopted by distributed services. This technique, now more and more used in the context of high-performance computing (HPC), allows the execution of routines to be delegated to remote nodes, which can be set aside and dedicated to specific tasks. However, existing RPC frameworks assume a socket-based network interface (usually on top of TCP/IP), which is not appropriate for HPC systems, because this API does not typically map well to the native network transport used on those systems, resulting in lower network performance. In addition, existing RPC frameworks often do not support handling large data arguments, such as those found in read or write calls. We present in this paper an asynchronous RPC interface, called Mercury, specifically designed for use in HPC systems. The interface allows asynchronous transfer of parameters and execution requests and provides direct support of large data arguments. Mercury is generic in order to allow any function call to be shipped. Additionally, the network implementation is abstracted, allowing easy porting to future systems and efficient use of existing native transport mechanisms.


Tuesday September 24, 2013 2:30pm - 2:55pm
15th Floor - Circle City 15 (Hilton) 120 W. Market St, Indianapolis, IN

2:30pm

PRESENTER UNAVAILABLE: I/O Scheduling for Solid State Devices in Virtual Machines (not able to attend)
Solid State Devices (SSD) are supplementing and gradually replacing traditional mechanical hard drives to become the mainstream of storage devices with better performance and lower power consumption. However, the disk I/O schedulers designed for traditional disks do not consider the characteristics of SSDs. Additionally in a virtualized environment the virtual machine’s I/O requests would be scheduled twice, in guest and host OSes. The request latency characteristics observed at both places are quite different. Based on the characteristics of SSD, we design an adaptive I/O scheduler which gives read requests higher priority and further analyze the best possible combination of I/O schedulers for the guest and host OSes. The experimental results show that the average delay of read requests with our adaptive scheduler is reduced by about 11.55 percent compared to the I/O scheduler with fixed dispatch ratio of read and writes requests. Meanwhile, when the guest and the host both adopt our scheduler, its average delay outperforms others’ by about 11.13 percent.


Tuesday September 24, 2013 2:30pm - 2:55pm
12th Floor - Circle City 12 (Hilton) 120 W. Market St, Indianapolis, IN

3:00pm

Checkpoint-Restart for a Network of Virtual Machines
The ability to easily deploy parallel computations on the Cloud is becoming ever more important. The first uniform mechanism for checkpointing a network of virtual machines is described. This is important for the parallel versions of common productivity software. Potential examples of parallelism include Simulink for MATLAB, parallel R for the R statistical modelling language, parallel_blast.py for the BLAST bioinformatics software, IPython.parallel for Python, and GNU parallel for parallel shells. The checkpoint mechanism is implemented as a plugin in the DMTCP checkpoint-restart package. It operates on KVM/QEMU, and has also been adapted to Lguest and pure user-space QEMU. The plugin is surprisingly compact, comprising just 400 lines of code to checkpoint a single virtual machine, and 200 lines of code for a plugin to support saving and restoring network state. Incremental checkpoints of the associated virtual filesystem are accommodated through the Btrfs filesystem. Experiments demonstrate checkpoint times of a fraction of a second by using forked checkpointing, mmap-based restart, and incremental Btrfs-based snapshots.


Tuesday September 24, 2013 3:00pm - 3:25pm
12th Floor - Circle City 12 (Hilton) 120 W. Market St, Indianapolis, IN

3:00pm

Oncilla: A GAS Runtime for Efficient Resource Allocation and Data Movement in Accelerated Clusters
Accelerated and in-core implementations of Big Data applications typically require large amounts of host and accelerator memory as well as efficient mechanisms for transferring data to and from accelerators in heterogeneous clusters. Scheduling for heterogeneous CPU and GPU clusters has been investigated in depth in the high-performance computing (HPC) and cloud computing arenas, but there has been less emphasis on the management of cluster resource that is required to schedule applications across multiple nodes and devices. Previous approaches to address this resource management problem have focused on either using low-performance software layers or on adapting complex data movement techniques from the HPC arena, which reduces performance and creates barriers for migrating applications to new heterogeneous cluster architectures.This work proposes a new system architecture for cluster resource allocation and data movement built around the concept of managed Global Address Spaces (GAS), or dynamically aggregated memory regions that span multiple nodes. We propose a software layer called Oncilla that uses a simple runtime and API to take advantage of non-coherent hardware support for GAS. The Oncilla runtime is evaluated using two different high-performance networks for microkernels representative of the TPC-H data warehousing benchmark, and this runtime enables a reduction in runtime of up to 81%, on average, when compared with standard disk-based data storage techniques. The use of the Oncilla API is also evaluated for a simple breadth-first search (BFS) benchmark to demonstrate how existing applications can incorporate support for managed GAS.


Tuesday September 24, 2013 3:00pm - 3:25pm
15th Floor - Circle City 15 (Hilton) 120 W. Market St, Indianapolis, IN

3:00pm

Influence of InfiniBand FDR on the Performance of Remote GPU Virtualization
The use of GPUs to accelerate general-purpose scientific and engineering applications is mainstream nowadays, but their adoption in current high performance computing clusters is primarily impaired by acquisition costs and power consumption. Therefore, the benefits of sharing a reduced number of GPUs among all the nodes of a cluster are overwhelming for many applications. This approach, usually referred to as remote GPU virtualization, aims at reducing the number of GPUs present in a cluster, while increasing their utilization rate. The performance of the interconnection network is key to achieve reasonable performance results when using remote GPU virtualization. In this line, several networking technologies with throughput comparable to the one by PCI Express have appeared recently. In this paper we analyze the influence of Infiniband FDR on the performance of GPU virtualization, comparing the effect for a variety of GPU-accelerated applications, against other networking technologies, such as Infiniband QDR or Gigabit Ethernet. Given the severe limitations of freely available remote GPU virtualization solutions, the rCUDA framework is used as case study for this analysis.


Tuesday September 24, 2013 3:00pm - 3:25pm
08th Floor - Circle City 08 (Hilton) 120 W. Market St, Indianapolis, IN

3:30pm

Afternoon Break

Tuesday September 24, 2013 3:30pm - 4:00pm
09th Floor - Victory Ballroom (Hilton) 120 W. Market St, Indianapolis, IN

4:00pm

EDR: An Energy-Aware Runtime Load Distribution System for Data-Intensive Applications in the Cloud
Data centers account for a growing percentage of US power consumption. Energy efficiency is now a first-class design constraint for the data centers that support cloud services. Service providers must distribute their data efficiently across multiple data centers. This includes creation of data replicas that provide multiple copies of data for efficient access. However, selecting replicas to maximize performance while minimizing energy waste is an open problem. State of the art replica selection approaches either do not address energy, lack scalability and/or are vulnerable to crashes due to use of a centralized coordinator. Therefore, we propose, develop and evaluate a simple cost-oriented decentralized replica selection system named EDR, implemented with two distributed optimization algorithms. We demonstrate experimentally the cost differences in various replica selection scenarios and show that our novel approach is as fast as the best available decentralized approach —DONAR,while additionally considering dynamic energy costs. We show that an average of 12% savings on total system energy costs can be achieved by using EDR for several data intensive applications.


Tuesday September 24, 2013 4:00pm - 4:25pm
12th Floor - Circle City 12 (Hilton) 120 W. Market St, Indianapolis, IN

4:00pm

FlexQuery: An Online Query System for Interactive Remote Visual Data Exploration at Large Scale
The remote visual exploration of live data generated by scientific simulations is useful for scientific discovery, performance monitoring, and online validation for the sim- ulation results. Online visualization methods are challenged, however, by the continued growth in the volume of simulation output data that has to be transferred from its source – the simulation running on the high end machine – to where it is analyzed, visualized, and displayed. A specific challenge in this context is limits in the communication bandwidth between data source(s) and sinks. data reduction capabilities, but such work does not address the common scenario in which scientists make multiple simultaneous and different queries about the data being produced. This paper considers the general case in which science users are interested in different (sub)sets of the data produced by a high end simulation. We offer the FlexQuery online data query system that can deploy and execute data queries (e.g., those used for purposes of data visualization) ‘along’ the I/O and analytics pipelines. FlexQuery carefully extends such analytics pipelines, using online performance monitoring and data location tracking, to realize data queries in ways that minimize additional data movement and offer low latency in data query execution. Using a real-world scientific application – the Maya astrophysics code and its analytics workflow – we demonstrate FlexQuery’s ability to dynamically deploy queries for low-latency remote data visualization.


Tuesday September 24, 2013 4:00pm - 4:25pm
15th Floor - Circle City 15 (Hilton) 120 W. Market St, Indianapolis, IN

4:00pm

BoF: YARN - A Tale of Adventures Beyond Hadoop 1
The latest release of Apache Hadoop has been adopted by many but information on how to install, configure, deploy, and tune Hadoop 2 - including YARN - is still sparse. We will begin the discussion with an overview of experiences with Apache Hadoop 2 on Cray Cluster Solutions systems, sharing details of how it was installed, configured, and tuned (an ongoing effort). Topics of discussion will include configuration variations, benefits and drawbacks, tuning opportunities, I/O configuration options, etc.

Speakers
Sponsors

Tuesday September 24, 2013 4:00pm - 4:55pm
08th Floor - Circle City 08 (Hilton) 120 W. Market St, Indianapolis, IN

4:30pm

Thermal Aware Automated Load Balancing for HPC Applications
As we move towards the exascale era, power and energy become major challenges. Some of the supercomputers draw more than 10 megawatts, leading to high energy bills. A significant portion of this energy is spent in cooling. In this paper, we propose an adaptive control system that minimizes the cooling energy by using Dynamic Voltage and Frequency Scaling to control the temperature and performing load balancing. This framework, which is a part of the adaptive runtime system, monitors the system and application characteristics and triggers mechanism to limit the temperature. It also performs load balancing whenever imbalance is detected and load balancing is beneficial. We demonstrate, using a set of applications and benchmarks, that the proposed framework can control the temperature of the cores effectively and reduce the timing penalty automatically without any support from the user.


Tuesday September 24, 2013 4:30pm - 4:55pm
12th Floor - Circle City 12 (Hilton) 120 W. Market St, Indianapolis, IN

4:30pm

K MapReduce: A Scalable Tool for Data-Processing and Search/Ensemble Applications on Large-Scale Supercomputers
KMR (K MapReduce) is a high-performance MapReduce system in the MPI environment, targeting large-scale supercomputers such as the K computer. Its objectives are to ease programming in data-processing and to achieve efficiency by utilizing plenty of memory available in large-scale supercomputers. KMR shuffles key-value pairs in a highly scalable way by log-step message-passing algorithms. Multi-threaded mapping and reducing allow KMR for further achieving efficiency in modern multi-core machines. Sorting is extensively used inside of shuffling and reducing, which is optimized in KMR by using the packed keys of fixed-length instead of the raw keys of variable-length for optimizing performance. Besides the MapReduce operations, KMR provides routines for collective file reading with affinity-aware optimizations. This paper presents results of experimental performance studies of KMR on the K computer. Our affinity-aware file loading improves the performance by about 42% than a non-optimized implementation. We also show how KMR can be used to implement real-world scientific applications, including meta-genome search and replica-exchange molecular dynamics.


Tuesday September 24, 2013 4:30pm - 4:55pm
15th Floor - Circle City 15 (Hilton) 120 W. Market St, Indianapolis, IN

6:00pm

Dinner Reception with Poster Sessions and Visualization Showcase with Big Robot Ensemble
Big Robot Ensemble. The Indianapolis-based trio Big Robot creates live, video-enriched art and music, interweaving aesthetic expression with computer interactivity and networked technology. With integration of audio/video design, live percussion, and electronics, the group creates a multi-dimensional performance that explores the cross points of virtual and physical gesture, sound, and space. Big Robot employs interactive processes such as motion tracking, acoustic instrument sampling, audio processing, integration of real-time video, and the use of telecommunications software/devices. The group began in 2007 when Scott Deal, Michael Drews, and Jordan Munson, each newly relocated to Indianapolis Indiana, met and began building ideas based on their artistic interests. While each member of the group plays a unique role, together they possess very similar musical influences of rock, jazz, electronic, and new music that has shaped the artistic viewpoint of the group.
Big Robot performs in live venues as well as telematically over research-grade, high-bandwidth Internet from their studio. The group has presented concerts and residencies throughout the United States since 2009. Their self-titled debut DVD was released in June 2013. Big Robot is a collection of works composed by the group that integrates music and videography with computer interactivity.

For ore information about the group and to view videos of their performances visit: http://www.bigrobot.org.

Artists
BR

Big Robot Ensemble

The Indianapolis-based trio Big Robot creates live, video-enriched art and music, interweaving aesthetic expression with computer interactivity and networked technology. With integration of audio/video design, live percussion, and electronics, the group creates a multi-dimensional performance that explores the cross points of virtual and physical gesture, sound, and space. Big Robot employs interactive processes such as motion tracking... Read More →
JM

Jordan Munson

Lecturer in Music at IUPUI, holds a Bachelor of Music degree in percussion from the University of Kentucky and a Master of Science in Music Technology degree from Indiana University School of Music at IUPUI. As a percussionist, he expanded his performance repertoire to include experimental electronic work and contemporary percussion literature. He often performs solo literature, mostly in the form of composed improvisation with a focus on live... Read More →
MD

Micheal Drews

Composer of contemporary acoustic and electronic music and is Assistant Professor of Music at Indiana University-Indianapolis (IUPUI). His music explores unconventional narrative strategies and the use of interactive music technology to expand traditional ideas of musical performance and creativity. Drews’ compositions have been performed in Europe, South America, and throughout the United States. Broken Symmetry for oboe, piano, and... Read More →
SD

Scott Deal

Performed throughout North America, Asia, and Europe. He has premiered solo, chamber and mixed media works, and can be heard on the Albany, Centaur, Cold Blue and SCI labels. Sequenza 21wrote of that he "presents a riveting performance". Deal’s recordings have been described as “soaring, shimmering explorations of resplendent mood and incredible scale”….”sublimely performed”. His recent recording of John... Read More →

Exhibitors
Sponsors

Tuesday September 24, 2013 6:00pm - 8:30pm
09th Floor - Victory Ballroom (Hilton) 120 W. Market St, Indianapolis, IN

6:00pm

Poster Session with Big Robot Ensemble
#03 Towards High-Performance and Cost-Effective Distributed Storage Systems with Information Dispersal Algorithms - Dongfang Zhao and Ioan Raicu, Student Poster (Grad/PhD);

#06 BASE: Benchmark Analysis Software for Energy-efficient Solutions in Large-scale Storage Systems - Tseng-Yi Chen, Hsin-Wen Wei, Ying-Jie Chen, Tsan-Sheng Hsu and Wei-Kuan Shih, Student Poster (Grad/PhD);

#24 A Synthetic Bursty Workload Generation Method for Web 2.0 Benchmark - Jianwei Yin, Hanwei Chen, Xingjian Lu and Xinkui Zhao;

#31 AptStore:Dynamic Storage Management for Hadoop - Krishnaraj Ravindranathan, Aleksandr Khasymski, Ali Butt, Sameer Tiwari and Milind Bhandarkar, Student Poster (Grad/PhD);

#35 Rockhopper, a True HPC System Built with Cloud Concepts - Richard Knepper, Barbara Hallock, Craig Stewart, Matthew Link and Matthew Jacobs

#37 Optimizations on the Parallel Virtual File System Implementation Integrated with Object-Based Storage Devices - Cengiz Karakoyunlu and John Chandy, Student Poster (Grad/PhD);

#51 Counting Sort for the Live Migration of Virtual Machines - Qingxin Zou, Zhiyu Hao, Xu Cui, Xiaochun Yun and Yongzheng Zhang, Student Poster (Grad/PhD);

#53 Capturing Inter-Application Interference on Clusters - Aamer Shah, Felix Wolf, Sergey Zhumatiy and Vladimir Voevodin;

#59 On Transactional Memory Concurrency Control in Distributed Real-Time Programs - Sachin Hirve, Aaron Lindsay, Binoy Ravindran and Roberto Palmieri, Student Poster (Grad/PhD);

#64 Improving Performance and Energy Efficiency of Matrix Multiplication via Pipeline Broadcast - Li Tan, Longxiang Chen, Zizhong Chen, Ziliang Zong, Rong Ge and Dong Li, Student Poster (Grad/PhD);

#101 MOLAR: A Cost-Efficient, High-Performance Hybrid Storage Cache - Yi Liu, Xiongzi Ge, Xiaoxia Huang and David H.C. Du, Student Poster (Grad/PhD);

#117 Nekkloud: A Software Environment for High-order Finite Element Analysis on Clusters and Clouds - Jeremy Cohen, David Moxey, Chris Cantwell, Pavel Burovskiy, John Darlington and Spencer J. Sherwin;

#129 Unified and Efficient HEC Storage System with a Working-Set based Reorganization Scheme - Junjie Chen and Yong Chen, Student Poster (Grad/PhD);

#137 An Object Interface Storage Node for Clustered File Systems - Orko Momin and John A. Chandy, Student Poster (Grad/PhD);

#154 The Oklahoma PetaStore: Big Data on a Small Budget - Patrick Calhoun, David Akin, Joshua Alexander, Brett Zimmerman, Brandon George and Henry Neeman;

#173 Parallelizing Windowed Stream Joins in a Shared-Nothing Cluster - Abhirup Chakraborty and Ajit Singh;

#192 Integrating Deadline-Modification SCAN Algorithm to Xen-based Cloud Platform - Tseng-Yi Chen, Hsin-Wen Wei, Ying-Jie Chen, Wei-Kuan Shih and Tsan-Sheng Hsu;

#194 Twitter Bootstrap and AngularJS – Frontend Frameworks to expedite Science Gateway development - Viknes Balasubramanee, Chathuri Wimalasena, Raminderjeet Singh, Marlon Pierce, Student Poster;

#195 Parallelization of software pipelines using the mpififo tool - Nathan Weeks, Marina Kraeva and Glenn Luecke;

#196 - Runtime System Design of Decoupled Execution Paradigm for Data-Intensive High-End Computing - Kun Feng, Yanlong Yin, Chao Chen, Hassan Eslami, Xian-He Sun, Yong Chen, Rajeev Thakur and William Gropp, Student Poster (Grad/PhD);

#197 ECG Identification of Arrhythmias by using an Associative Petri Net - Dong-Her Shih, Hsiu-Sen Chiang and Ming-Hung Shih;

#198 Automotive Big Data - Tim Barrett, Graham Lenes, Ken Kennedy, Philipp Lix and Amy Apon, Student Poster;

#199 ConHA: An SOA-based API Gateway for Consolidating Heterogeneous HA Clusters - Mingyu Li, Qian Zhang, Hanyue Chu, Xiaohui Hu and Fanjiang Xu, Student Poster;

#203 On Service Migration in the Cloud to Facilitate Mobile Accesses - Yang Wang and Wei Shi;

#204 - Model-Driven Multisite Workflow Scheduling - Ketan Maheshwari, Eun-Sung Jung, Jiayuan Meng, Venkatram Vishwanath, Rajkumar Kettimuthu;

#205 Zput: a speedy data uploading approach for the Hadoop Distributed File System - Youwei Wang, Weiping Wang, Can Ma and Dan Meng, Student Poster;

#208 LittleFe - The High Performance Computing Education Appliance - Charles Peck, Ivan Babic, Jennifer Houchins, Mohammad Mobeen Ludin, Skylar Thompson, Aaron Weeden, Kristin Muterspaw and Elena Sergienko, Student Poster;

#213 Understanding the Performance of Stencil Computations on Intel's Xeon Phi - Joshua Peraza, Ananta Tiwari, Michael Laurenzano, Laura Carrington, William A. Ward and Roy Campbell


Tuesday September 24, 2013 6:00pm - 8:30pm
09th Floor - Victory Ballroom (Hilton) 120 W. Market St, Indianapolis, IN

6:00pm

Visualization Showcase with Big Robot Ensemble
1) Daily Regional Weather Forecasts in Support of Vortex2. Quan Zhou, Beth Plale, Keith Danielson, Robert Ping, Janae Cummings and Alan Mauro. The Vortex 2 campaign funded by the National Science Foundation was a 6-week effort Spring 2010 to position instruments near mesoscale severe storms as they were forming. The Data To Insight Center of Indiana University executed short-term regional weather forecasts each day of the 6-week campaign. The visualization unfolds the daily location of a regional weather forecast done in support of Vortex2 field effort that took place May-June 2010 by using the World Wide Telescope (WWT) Tool - a web 2.0 visualization software environment. Moreover, this visualization uses times series and three-dimensional geospatial visualization techniques to display spatial distribution and temporal relation for Vortex2 data set. The video employs voice-over and text to convey the importance of a forecast in relation to actual weather.

2) Gallery of images created in live leaves by chloroplast movements. Margaret Dolinsky and Roger Hangarter. Our objective is to showcase an art/science collaboration in which aesthetic experience is the vehicle to examine photosynthesis in the context of its central role in life on Earth. This collaboration has resulted in a collection of high resolution images and time-lapse movies that reveal the process of chloroplast movements. The images portray the movements as they occur from the level of a single cell to the whole leaves. The resultant images display how the subcellular changes affect the optical properties of leaves, which illustrates how chloroplast positioning maximizes photosynthesis. This presentation will give visitors the opportunity to examine the results of how chloroplasts act as living pixels that move in response to light to render the art in a living canvas. The exhibition will display a number of high resolution art images created in living leaves to illustrate the dynamic biology of green plants.

3) High Performance Computing for Designing Groundwater Remediation Systems. Loren Shawn Matott, Camden Reslink, Christine Baxter, Beth Hymiak, Shirmin Aziz, Adrian Levesque and Martins Innus. This research involves the design of cost-effective systems to safeguard the nation's groundwater supplies from contaminated sites. What looks like a colorful whirlpool in the image is actually a remediation cost-surface containing many peaks and valleys. Valley locations outside of the red area are sub-optimal and easily entrap computer algorithms which seek to identify the best design point on the cost surface. In this way the image dramatically illustrates the problem of "artificial minima" which bedevils engineers tasked with cleaning up solvent-contaminated groundwater.

4) Molecular Simulations of the Dynamic Properties of Wild Type and Mutated 14-3-3σ Proteins. Albert, William, Michael Boyles, David Reagan, Jing-YuanLiu and Divya Neelagiri. Protein-protein interactions are important for biological functions. Aberrant interaction events can cause diseases such as cancer and diabetes, yet how proteins recognize each other and form stable complexes is not fully understood. In this work, we used 14-3-3σ as a model and introduced small changes that alter the binding affinity of the protein chains. We then investigated the dynamic properties of the interfacial cores of the wild type and mutant 14-3-3σ by performing water explicit molecular dynamics simulations. We observed a highly packed interfacial core that has a low water exchange rate in the wild type 14-3-3σ but not in the F25G mutant. This suggest that the properties of the interfacial core are critical to specific protein-protein interactions and that the interfacial core may serve as the nucleation seeding site for strong hydrophobic interactions which has been recognized as the driving force for protein association. The outcome of this study will help us understand the principle question of how proteins recognize each other. The molecular dynamics simulations were computed using Indiana University supercomputing resources. Visualizations were produced using the Visual Molecular Dynamics (VMD) software. Artistic enhancements and final video production were accomplished using Autodesk Maya and Adobe After Effects.

5) Places & Spaces: Mapping Science. Katy Börner and Todd Theriault. For centuries, cartographic maps of earth and water have guided human exploration. They have marked the border between the known and the unknown, firing the imagination and fueling the desire for new knowledge and new experience. Over time, geographic maps have become more accurate, more sophisticated, but the thirst for discovery, along with the need for maps to guide our travels, remains undiminished.

Today, our opportunities for discovery reside less in physical places than in abstract spaces. The sea of information is one such space, and it is ever growing, ever changing. Search engines can retrieve facts from this ocean of data, but they cannot answer larger questions about the seascape as a whole: How big is this ocean? How can we navigate to the useful islands of knowledge? How is knowledge interlinked on a global scale? In which areas is it worth investing time, effort, and resources?

Drawing from across cultures and across scholarly disciplines, the Places & Spaces: Mapping Science exhibit demonstrates the power of maps to address these vital questions about the contours and content of human knowledge. Created by leading figures in the natural, physical, and social sciences, scientometrics, visual arts, social and science policymaking, and the humanities, the maps in Places & Spaces allow us to better grasp the abstract contexts, relationships, and dynamism of human systems and collective intelligence.


Now entering its ninth year, the exhibit has traced the evolution of science maps, featuring the best examples of knowledge domain mapping, novel location-based cartographies, data visualizations, and science-inspired art works. Individually and as a whole, the maps of Places & Spaces allow data to tell stories which both the scientist and the layperson can understand and appreciate.

6) Visualization of Globular Star Clusters. DavidReagan, WilliamSherman, Enrico Vesperini, Anna Lisa Varri and Chris Eller. Dr. Vesperini uses IU’s supercomputers to simulate the formation and dynamical evolution of globular star clusters. Indiana University’s Advanced Visualization Lab utilized the open-source application ParaView to create visual representations of three such simulations. The first shows the gravitational collapse of a globular cluster. The second follows the evolution of two stellar populations in a globular cluster. The third shows the evolution of a rotating stellar system with a central toroidal (or donut-shaped) structure.

7) Visualization of Nuclear Pasta. DavidReagan, Andre S.Schneider, Charles J.Horowitz, JosephHughto, Don K.Berry, Eric A.Wernert, and Chris Eller. Some massive stars die in giant supernova explosions that squeeze all of the empty space out of atoms until their nuclei start to touch and interact in complex ways to form a neutron star 100 trillion times denser than water. Dr. Horowitz and his group use IU’s supercomputers to simulate these events, where nuclei merge into spaghetti- and lasagna-like structures called nuclear pasta. Indiana University’s Advanced Visualization Lab utilized the open-source application ParaView to create stereoscopic visualizations which allow the researchers to study the formation of these intricate structures and explore their properties.

8) Visualization of the Buffalo Inner and Outer Harbor. Martins Innus, Adrian Levesque, Ayla Abyad, Jacob Brubaker, and Hans Baumgartner. We will describe our experience developing an interactive visual application which supports the study of viable Buffalo Harbor Bridge locations and alternatives, as well as helping advance the project through an Environmental Impact Statement (EIS), by providing stakeholders and inte


Tuesday September 24, 2013 6:00pm - 8:30pm
09th Floor - Victory Ballroom (Hilton) 120 W. Market St, Indianapolis, IN
 
Wednesday, September 25
 

8:00am

Breakfast

Wednesday September 25, 2013 8:00am - 9:00am
09th Floor - Victory Ballroom (Hilton) 120 W. Market St, Indianapolis, IN

8:00am

IEEE Cluster 2013 Office
Wednesday September 25, 2013 8:00am - 6:00pm
09th Floor - Room 911 120 W. Market St, Indianapolis, IN

8:00am

9:00am

Plenary Talk - Designing, Deploying and Enabling Science on Stampede
The Texas Advanced Computing Center (TACC) designed the Stampede supercomputing system to be one of the most powerful production clusters in the world—and added an innovative new technology, the Intel Xeon Phi coprocessors, to add even more computational power. However, Stampede was designed for more than raw performance: it presents a comprehensive science environment to researchers, with a variety of integrated capabilities. Both its scale and its comprehensive capabilities benefit from the continuing advances of x86/Linux cluster technologies, and have helped make Stampede the most widely used petascale computing system in the world. This presentation will present the design decisions that went into Stampede, discuss the deployment challenges leading up to production, and showcase many of the research projects in the ‘stampede of science’ now being enabled by this world-class cluster.


Wednesday September 25, 2013 9:00am - 9:55am
09th Floor - Victory Ballroom (Hilton) 120 W. Market St, Indianapolis, IN

10:00am

Morning Break

Wednesday September 25, 2013 10:00am - 10:25am
09th Floor - Victory Ballroom (Hilton) 120 W. Market St, Indianapolis, IN

10:30am

Using Clusters in Undergraduate Research: Distributed Animation Rendering, Photo Processing, and Image Transcoding
With distributed and parallel computing becoming increasingly important in both industrial and scientific endeavors, it is imperative that students are introduced to the challenges and methods of high performance and high throughput computing. Because these topics are often absent in standard undergraduate computer science curriculums, it is necessary to encourage and support independent research and study involving distributed and parallel computing. In this paper, we present three undergraduate research projects that utilize distributed computing clusters: animation rendering, photo processing, and image transcoding. We describe the challenges faced in each project, examine the solutions developed by the students, and then evaluate the performance and behavior of each system. At the end, we reflect on our experience using distributed clusters in undergraduate research and offer six general guidelines for mentoring and pursing distributed computing research projects with undergraduates. Overall, these projects effectively promote skills in high performance and throughput computing while enhancing the undergraduate educational experience.


Wednesday September 25, 2013 10:30am - 10:55am
15th Floor - Circle City 15 (Hilton) 120 W. Market St, Indianapolis, IN

10:30am

Optimizing Power Allocation to CPU and Memory Subsystems in Overprovisioned HPC Systems
Energy consumption and power draw pose two major challenges to the HPC community for designing larger systems. Present day HPC systems consume as much as 10MW of electricity and this is fast becoming a bottleneck. Although energy bills will significantly increase with machine size, power consumption is a hard constraint that must be addressed. Intel's Running Average Power Limit (RAPL) toolkit is a recent feature that enables power capping of CPU and memory subsystems on modern hardware. In this paper, we use RAPL to evaluate the possibility of improving execution time efficiency of an application by capping power while adding more nodes. We profile the strong scaling of an application using different power caps for both CPU and memory subsystems. Our proposed interpolation scheme uses an application profile to optimize the number of nodes and the distribution of power between CPU and memory subsystems to minimize execution time under a strict power budget. We validate these estimates by running experiments on a 20-node Sandy Bridge cluster. Our experimental results closely match the model estimates and show speedups greater than 1.47X for all applications compared to not capping CPU and memory power. We demonstrate that the quality of solution that our interpolation scheme provides matches very closely to results obtained via exhaustive profiling.


Wednesday September 25, 2013 10:30am - 10:55am
12th Floor - Circle City 12 (Hilton) 120 W. Market St, Indianapolis, IN

10:30am

Highly Optimized Full GPU-Acceleration of Non-hydrostatic Weather Model SCALE-LES
SCALE-LES is a non-hydrostatic Weather Model developed at RIKEN, Japan. It is intended to be a global high-resolution model that would be scaled to exascale systems. This paper introduces the full GPU acceleration of all SCALE-LES modules. Moreover, the paper demonstrates the strategies to handle the unique challenges of accelerating SCALE-LES using GPU. The proposed acceleration is important for identifying the expectations and requirements of scaling SCALE-LES, and similar real world applications, into the exascale era. The GPU implementation includes the optimized GPU acceleration of SCALE-LES for a single GPU with both CUDA Fortran and OpenACC. It also includes scaling SCALE-LES for GPU-accelerated clusters. The results and analysis show how the optimization strategies affect the performance gain in SCALE-LES when moving from conventional CPU clusters towards GPU-powered clusters.


Wednesday September 25, 2013 10:30am - 10:55am
08th Floor - Circle City 08 (Hilton) 120 W. Market St, Indianapolis, IN

11:00am

Using a Shared, Remote Cluster for Teaching HPC
Production clusters are a common environment for programming assignments in courses involving High Performance Computing, but they can present challenges, especially in the context of use by courses at institutions remote from these resources. In the context of the Oklahoma Cyberinfrastructure Initiative, the University of Oklahoma has been providing its centrally managed clusters for use by courses statewide. This paper explores mechanisms for, and advocates in favor, using this shared, remote resource for teaching.

Speakers
avatar for Henry Neeman

Henry Neeman

OU Information Technology
Henry Neeman is founding Director of the OU Supercomputing Center for Education & Research (OSCER), Assistant Vice President for Information Technology - Research Strategy Advisor, Associate Professor of Engineering, and Adjunct Associate Professor in Computer Science at the University of Oklahoma (OU). He received his BS in Computer Science and his BA in Statistics from the State University of New York at Buffalo in 1987, his MS in CS from... Read More →

Authors
avatar for Henry Neeman

Henry Neeman

OU Information Technology
Henry Neeman is founding Director of the OU Supercomputing Center for Education & Research (OSCER), Assistant Vice President for Information Technology - Research Strategy Advisor, Associate Professor of Engineering, and Adjunct Associate Professor in Computer Science at the University of Oklahoma (OU). He received his BS in Computer Science and his BA in Statistics from the State University of New York at Buffalo in 1987, his MS in CS from... Read More →

Wednesday September 25, 2013 11:00am - 11:25am
15th Floor - Circle City 15 (Hilton) 120 W. Market St, Indianapolis, IN

11:00am

A Case of System-Wide Power Management for Scientific Applications
The advance of high-performance computing systems towards exascale will be constrained by the systems' energy consumption levels. Large numbers of processing components, memory, interconnects, and storage components must all be considered to achieve exascale performance within a targeted energy bound. While application-aware power allocation schemes for computing resources are well studied, a portable and scalable budget-constrained power management scheme for scientific applications on exascale systems is still required. Execution activities within scientific applications can be categorized as CPU-bound, I/O-bound and communication-bound. Such activities tend to be clustered into 'phases', offering opportunities to manage their power consumption separately. Our experiments have demonstrated that their performance and energy consumption are affected differently by CPU frequency, an opportunity to fine tune CPU frequency for a minimal impact on the total execution time but significant savings on the energy consumption. By exploiting this opportunity, we present a phase-aware hierarchical power management framework that can opportunistically deliver good tradeoffs between system power consumption and application performance under a power budget. Our hierarchical power management framework consists of two main techniques: Phase-Aware CPU Frequency Scaling (PAFS) and opportunistic provisioning for power-constrained performance optimization. We have performed a systematic evaluation using both simulations and representative scientific applications on real systems. Our results show that our techniques can achieve 4.3%-17% better energy efficiency for large-scale scientific applications.


Wednesday September 25, 2013 11:00am - 11:25am
12th Floor - Circle City 12 (Hilton) 120 W. Market St, Indianapolis, IN

11:00am

Accelerating Subsurface Transport Simulation on Heterogeneous Clusters
Reactive transport numerical models simulate chemical and microbiological reactions that occur along a flowpath. These models have to compute reactions for a large number of locations. They solve the set of ordinary differential equations (ODEs) that describes the reaction for each location through the Newton-Raphson technique. This technique involves computing a Jacobian matrix and a residual vector for each set of equations, and then solving iteratively the linearized system by performing Gaussian Elimination and LU decomposition until convergence. STOMP, a well known subsurface flow simulation tool, employs matrices with sizes in the order of 100x100 elements and, for numerical accuracy, LU factorization with full pivoting instead of the faster partial pivoting. Modern high performance computing systems are heterogeneous machines whose nodes integrate both CPUs and GPUs, exposing unprecedented amounts of parallelism. To exploit all their computational power, applications must use both the types of processing elements. For the case of subsurface flow simulation, this mainly requires implementing efficient batched LU-based solvers and identifying efficient solutions for enabling load balancing among the different processors of the system.
In this paper we discuss two approaches that allows scaling STOMP's performance on heterogeneous clusters. We initially identify the challenges in implementing batched LU-based solvers for small matrices on GPUs, and propose an implementation that fulfills STOMP's requirements. We compare this implementation to other existing solutions. Then, we combine the batched GPU solver with an OpenMP-based CPU solver, and present an adaptive load balancer that dynamically distributes the linear systems to solve between the two components inside a node. We show how these approaches, integrated into the full application, provide speed ups from 6 to 7 times on large problems, executed on up to 16 nodes of a cluster with two AMD Opteron 6272 and a Tesla M2090 per node.


Wednesday September 25, 2013 11:00am - 11:25am
08th Floor - Circle City 08 (Hilton) 120 W. Market St, Indianapolis, IN

11:30am

Teaching undergraduates using local virtual clusters
We describe five years of experience using Beowulf clusters of virtual machines in teaching and research. Each virtual node in each undergraduate student-managed cluster is hosted on a standard classroom/laboratory machine, otherwise dedicated to general computing in support a CS curriculum. Applications highlighted include teaching purposes ranging from introductory CS to advanced projects, research efforts including classical analysis of cluster performance, map-reduce computations in social-networking research, and numerous interdisciplinary applications both in natural sciences and other fields. Recent developments and next steps such as portable design and heterogeneous computing are discussed.


Wednesday September 25, 2013 11:30am - 11:55am
15th Floor - Circle City 15 (Hilton) 120 W. Market St, Indianapolis, IN

11:30am

A Scalable and Portable Approach to Accelerate Hybrid HPL on Heterogeneous CPU-GPU Clusters
Accelerating High-Performance Linkpack (HPL) on heterogeneous clusters with multi-core CPUs and GPUs has attracted lots of attention from the High Performance Computing community. It is becoming common for large scale clusters to have GPUs on only a subset of nodes in order to limit system costs. The major challenge for HPL in this case is to efficiently take advantage of all of the CPU and GPU resources available on a cluster. In this paper, we present a novel two-level workload partitioning approach for HPL that distributes workload based on the compute power of CPU/GPU nodes across the cluster. Our approach also handles multi-GPU configurations. Unlike earlier approaches for heterogeneous clusters with CPU and GPU nodes, our design takes advantage of asynchronous kernel launches and CUDA copies to overlap computation and CPU-GPU data movement. It uses techniques such as process grid reordering to reduce MPI communication/contention while ensuring load balance across nodes. Our experimental results using 32 GPU and 128 CPU nodes of Oakley, a Top500 cluster at Ohio Supercomputer Center, shows that our proposed approach can achieve more than 80% of combined actual peak performance of CPU and GPU nodes. This provides 47% and 63% increase in the HPL performance that can be reported using only CPU or only GPU nodes.


Wednesday September 25, 2013 11:30am - 11:55am
08th Floor - Circle City 08 (Hilton) 120 W. Market St, Indianapolis, IN

11:30am

Distributed Resource Exchange: Virtualized Resource Management for SR-IOV InfiniBand Clusters
The commoditization of high performance inter- connects, like 40+ Gbps InfiniBand, and the emergence of low- overhead I/O virtualization solutions based on SR-IOV, is enabling the proliferation of such fabrics in virtualized datacenters and cloud computing platforms. As a result, such platforms are better equipped to execute workloads with diverse I/O requirements, ranging from throughput-intensive applications, such as ‘big data’ analytics, to latency-sensitive applications, such as online applications with strict response-time guarantees. Improvements are also seen for the virtualization infrastructures used in datacenter settings, where high virtualized I/O performance supported by high-end fabrics enables more applications to be configured and deployed in multiple VMs – VM ensembles (VMEs) – distributed and communicating across multiple datacenter nodes. A challenge for I/O-intensive VM ensembles is the efficient management of the virtualized I/O and compute resources they share with other consolidated applications, particularly in lieu of VME-level SLA requirements like those pertaining to low or predictable end-to-end latencies for applications comprised of sets of interacting services.

This paper addresses this challenge by presenting a management solution able to consider such SLA requirements, by supporting diverse SLA-aware policies, such as those maintaining bounded SLA guarantees for all VMEs, or those that minimize the impact of misbehaving VMEs. The management solution, termed Distributed Resource Exchange (DRX), borrows techniques from principles of microeconomics, and uses online resource pricing methods to provide mechanisms for such distributed and coordinated resource management. DRX and its mechanisms allow policies to be deployed on such a cluster in order to provide SLA guarantees to some applications by charging all the interfering VMEs ‘equally’ or based on the ‘hurt’, i.e. amount of I/O performed by the VMEs. While these mechanisms are general, our implementation is specifically for SR-IOV-based fabrics like InfiniBand and the KVM hypervisor. Our experimental evaluation consists of workloads representative of data-analytics, transactional and parallel benchmarks. The results demonstrate the feasibility of DRX and its utility to maintain SLA for transactional applications. We also show that the impact to the interfering workloads is also within acceptable bounds for certain policies.


Wednesday September 25, 2013 11:30am - 11:55am
12th Floor - Circle City 12 (Hilton) 120 W. Market St, Indianapolis, IN

12:00pm

Lunch

Wednesday September 25, 2013 12:00pm - 1:25pm
09th Floor - Victory Ballroom (Hilton) 120 W. Market St, Indianapolis, IN

12:30pm

BoF: HPC Cluster Interconnect Topologies: Is Fat Tree the right Answer
In this presentation, we will begin the discussion with an overview of cluster configuration variations with FDR Infiniband, Gigabit Ethernet to 40 GigE or with proprietary interconnects like Quadrics and Myrinet and Aries. We will also discuss the cost increase of components that don’t give us more compute speed but are part of the infrastructure costs associated with a complete end–to-end system. In the end of our session we will brainstorm when buying a new cluster, do we need to have a full fat tree interconnect topology or if there is something more cost effective?


Wednesday September 25, 2013 12:30pm - 1:25pm
09th Floor - Victory Ballroom (Hilton) 120 W. Market St, Indianapolis, IN

1:30pm

Distributed Data Provenance for Large-Scale Data-Intensive Computing
It has become increasingly important to capture and understand the origins and derivation of data (its provenance). A key issue in evaluating the feasibility of data provenance is its performance, overheads, and scalability. In this paper, we explore the feasibility of a general metadata storage and management layer for parallel file systems, in which metadata includes both file operations and provenance metadata. We experimentally investigate the design optimality---whether provenance metadata should be loosely-coupled or tightly integrated with a file metadata storage systems. We consider two systems that have applied similar distributed concepts to metadata management, but focusing singularly on kind of metadata: (i) FusionFS, which implements a distributed file metadata management based on distributed hash tables, and (ii) SPADE, which uses a graph database to store audited provenance data and provides distributed module for querying provenance. Our results on a 32-node cluster show that FusionFS+SPADE is a promising prototype with negligible provenance overhead and has promise to scale to petascale and beyond. Furthermore, FusionFS with its own storage layer for provenance capture is able to scale up to 1K nodes on BlueGene/P supercomputer.


Wednesday September 25, 2013 1:30pm - 1:55pm
15th Floor - Circle City 15 (Hilton) 120 W. Market St, Indianapolis, IN

1:30pm

GPU-Accelerated Scalable Solver for Banded Linear Systems
Solving a banded linear system efficiently is important to many scientific and engineering applications. Current solvers achieve good scalability only on the linear systems that can be partitioned into independent subsystems. In this paper, we present a GPU based, scalable Bi-Conjugate Gradient Stabilized solver that can be used to solve a wide range of banded linear systems. We utilize a row-oriented matrix decomposition method to divide the banded linear system into several correlated sublinear systems and solve them on multiple GPUs collaboratively. We design a number of GPU and MPI optimizations to speed up inter-GPU and inter-machine communications. We evaluate the solver on Poisson equation and advection diffusion equation as well as several other banded linear systems. The solver achieves a speedup of more than 21 times running from 6 to 192 GPUs on the XSEDE’s Keeneland supercomputer and because of small communication overhead, can scale upto 32 GPUs on Amazon EC2 with relatively slow ethernet network.


Wednesday September 25, 2013 1:30pm - 1:55pm
08th Floor - Circle City 08 (Hilton) 120 W. Market St, Indianapolis, IN

1:30pm

Developing Communication-aware Service Placement Frameworks in the Cloud Economy
In a Cloud system, a number of services are often deployed with each service being hosted by a collection of Virtual Machines (VM). The services may interact with each other and the interaction patterns may be dynamic, varying according to the system information at runtime. These impose a challenge in determining the amount of resources required to deliver a desired level of QoS for each service. In this paper, we present a method to determine the sufficient number of VMs for the interacting Cloud services. The proposed method borrows the ideas from the Leontief Open Production Model in economy. Further, this paper develops a communication-aware strategy to place the VMs to Physical Machines (PM), aiming to minimize the communication costs incurred by the service interactions. The developed communication-aware placement strategy is formalized in a way that it does not need to the specific communication pattern between individual VMs. A genetic algorithm is developed to find a VM-to-PM placement with low communication costs. Simulation experiments have been conducted to evaluate the performance of the developed communication-aware placement framework. The results show that compared with the placement framework aiming to use the minimal number of PMs to host VMs, the proposed communication-aware framework is able to reduce the communication cost significantly with only a very little increase in the PM usage.


Wednesday September 25, 2013 1:30pm - 1:55pm
12th Floor - Circle City 12 (Hilton) 120 W. Market St, Indianapolis, IN

2:00pm

A Parallel Optimization Method for Stencil Computation on the Domain that is Bigger than Memory Capacity of GPUs
The problem size of the stencil computation on GPU cluster is limited by the memory capacity GPUs, which is typically smaller than that of host memories. This paper proposes and evaluates parallel optimization method for stencil computation to achieve scalability, larger problem size than the memory capacity of GPUs and high performance. It uses 2D decomposition to achieve scalability over GPUs. Then it enables bigger sub-domain on each GPU to achieve bigger problem size. It applies temporal blocking method to improve memory access locality of stencil computation and reuses former result to solve redundant problem to get higher performance. Evaluation of stencil simulation on 3D domain shows that our new method for 7-point and 19-point on GPUs achieves good scalability which is 1.45 times and 1.72 times better than other methods on average.


Wednesday September 25, 2013 2:00pm - 2:25pm
08th Floor - Circle City 08 (Hilton) 120 W. Market St, Indianapolis, IN

2:00pm

Distance-Aware Virtual Cluster Performance Optimization: A Hadoop Case Study
Cloud computing and big data are becoming two important developing trends in information technology area. However, data-intensive computing has some challenges to work well on virtual machines in cloud computing for virtualized resource competition and complex network communication. Net- work becomes one of the most notorious bottlenecks, which highlights strategies to lower communication and transmission cost in virtual cluster.
In this paper, we present a novel cluster performance op- timization strategy named vClusterOpt. vClusterOpt finds out centralized subgraphs of node graph and choose node with the shortest logical distance as kernel node of the subgraph to reduce inter-machine communication and transmission cost under vir- tual cluster. To calculate logical distance accurately, we define two kinds of logical distance: Logical Communication Distance(LCD) and Logical Transmission Distance(LTD). VM with the shortest LCD with others is used as the communication kernel node who has the most information communication stress, while VM with the shortest LTD is treated as transmission kernel node who has the most data transmission stress. We choose benchmarks running on Hadoop as the represent of data-intensive computing service to demonstrate effectiveness of our approach. Experiments show that an average of 20% performance improvement can get by our distance-aware virtual cluster optimization strategy.


Wednesday September 25, 2013 2:00pm - 2:25pm
12th Floor - Circle City 12 (Hilton) 120 W. Market St, Indianapolis, IN

2:00pm

Fast Data Analysis with Integrated Statistical Metadata in Scientific Datasets
Scientific datasets and libraries, such as HDF5, ADIOS, and NetCDF, have been used widely in many data-intensive applications. These libraries have their special file formats and I/O functions to provide efficient access to large datasets. Recent studies have started to utilize indexing, subsetting, and data reorganization to manage the increasingly large datasets. In this work, we present an approach to boost the data analysis performance, namely Fast Analysis with Statistical Metadata (FASM), via data subsetting and integrating a small amount of statistics into the original datasets. The added statistical information illustrates the data shape and provides knowledge of the data distribution; therefore the original I/O libraries can utilize these statistical metadata to perform fast queries and analyses. Various subsetting schemes can affect the access pattern and the I/O performance. We present a comparison study of different subsetting schemes by focusing on three dominant factors, the shape, the concurrency, and the locality. The added statistical metadata slightly increases the original data size, and we evaluate the cost and trade-off as well. This work is the first study that utilizes statistical metadata with various subsetting schemes to perform fast queries and analyses on large datasets. The proposed FASM approach is currently evaluated with the PnetCDF on Lustre file systems, but can also be implemented with other scientific libraries. The FASM can potentially lead to a new dataset design and can have an impact on big data analysis.

Speakers
Authors

Wednesday September 25, 2013 2:00pm - 2:25pm
15th Floor - Circle City 15 (Hilton) 120 W. Market St, Indianapolis, IN

2:30pm

Force-directed Geographical Load Balancing and Scheduling for Batch Jobs in Distributed Datacenters
This work focuses on the load balancing and scheduling problem for batch jobs considering a cloud system comprised of geographically dispersed, heterogeneous datacenters. Each batch job is modeled using a directed acyclic graph of heterogeneous tasks. Load balancing and scheduling of batch jobs with loose deadlines results in operational cost reduction in the cloud system due to availability of renewable energy sources in datacenters’ site and time of use dependent energy pricing in utility companies. A solution for load balancing and scheduling problem based on the force-directed scheduling approach is presented that considers the online application workload and limited resource and peak power capacity in each datacenter. The simulation results demonstrate significant operational cost decrease (up to 40%) using the proposed algorithm with respect to a greedy solution.


Wednesday September 25, 2013 2:30pm - 2:55pm
12th Floor - Circle City 12 (Hilton) 120 W. Market St, Indianapolis, IN

2:30pm

Active-Learning-based Surrogate Models for Empirical Performance Tuning
Performance models have profound impact on hardware-software codesign, architectural explorations, and performance tuning of scientific applications. Developing algebraic performance models is becoming an increasingly challenging task. In such situations, a statistical surrogate-based performance model, fitted to a small number of input-output points obtained from empirical evaluation on the target machine, provides a range of benefits. Accurate surrogates can emulate the output of the expensive empirical evaluation at new inputs and therefore can be used to test and/or aid search, compiler, and autotuning algorithms. We present an iterative parallel algorithm that builds surrogate performance models for scientific kernels and work-loads on single-core and multicore and multinode architectures. We tailor to our unique parallel environment an active learning heuristic popular in the literature on the sequential design of computer experiments in order to identify the code variants whose evaluations have the best potential to improve the surrogate. We use the proposed approach in a number of case studies to illustrate its effectiveness.


Wednesday September 25, 2013 2:30pm - 2:55pm
08th Floor - Circle City 08 (Hilton) 120 W. Market St, Indianapolis, IN

2:30pm

Expediting Scientific Data Analysis with Transparent Reorganization of Data
Data producers typically optimize the layout of data files to minimize the write time. In most cases, data analysis tasks read these files in access patterns different from the write patterns causing poor read performance. In this paper, we introduce Scientific Data Services (SDS), a framework for bridging the performance gap between writing and reading scientific data. SDS reorganizes data to match the read patterns of analysis tasks and enables transparent data reads from the reorganized data. We implemented a HDF5 Virtual Object Layer (VOL) plugin to redirect the HDF5 dataset read calls to the reorganized data. To demonstrate the effectiveness of SDS, we applied two parallel data organization techniques: a sort-based organization on a plasma physics data and a transpose-based organization on mass spectrometry imaging data. We also extended the HDF5 data access API to allow selection of data based on their values through a query interface, called SDS Query. We evaluated the execution time in accessing various subsets of data through existing HDF5 Read API and SDS Query. We showed that reading the reorganized data using SDS is up to 55X faster than reading the original data.


Wednesday September 25, 2013 2:30pm - 2:55pm
15th Floor - Circle City 15 (Hilton) 120 W. Market St, Indianapolis, IN

3:00pm

Afternoon Break
Wednesday September 25, 2013 3:00pm - 3:25pm
09th Floor - Victory Ballroom (Hilton) 120 W. Market St, Indianapolis, IN

3:30pm

Application Power Profiling on IBM Blue Gene/Q
The power consumption of state of the art supercomputers, because of their complexity and unpredictable workloads, is extremely difficult to estimate. Accurate and precise results, as are now possible with the latest generation of supercomputers, are therefore a welcome addition to the landscape. Only recently have end users been afforded the ability to access the power consumption of their applications. However, just because it's possible for end users to obtain this data does not mean it's a trivial task. This emergence of new data is therefore not only understudied, but also not fully understood.

In this paper, we provide detailed power consumption analysis of (micro)benchmarks running on Argonne's latest generation of IBM Blue Gene supercomputers, Mira, a Blue Gene/Q system. The analysis is done utilizing our power monitoring library, MonEQ, built on the IBM provided Environmental Monitoring (EMON) API. We describe the importance of sub-second polling of various power domains and the implications they present. To this end, previously well understood applications will now have new facets of potential analysis.


Wednesday September 25, 2013 3:30pm - 3:55pm
08th Floor - Circle City 08 (Hilton) 120 W. Market St, Indianapolis, IN

3:30pm

JUMMP: Job Uninterrupted Maneuverable MapReduce Platform
In this paper, we present JUMMP, the Job Uninterrupted Maneuverable MapReduce Platform, an automated scheduling platform that provides a customized Hadoop environment within a batch-scheduled cluster environment. JUMMP enables an interactive pseudo-persistent MapReduce platform within the existing administrative structure of an academic high performance computing center by ``jumping'' between nodes with minimal administrative effort. Jumping is implemented by the synchronization of stopping and starting daemon processes on different nodes in the cluster. Our experimental evaluation shows that JUMMP can be as efficient as a persistent Hadoop cluster on dedicated computing resources, depending on the jump time. Additionally, we show that the cluster remains stable, with good performance, in the presence of jumps that occur as frequently as the average length of reduce tasks of the currently executing MapReduce job. JUMMP provides an attractive solution to academic institutions that desire to integrate Hadoop into their current computing environment within their financial, technical, and administrative constraints.


Wednesday September 25, 2013 3:30pm - 3:55pm
12th Floor - Circle City 12 (Hilton) 120 W. Market St, Indianapolis, IN

3:30pm

HPC Runtime Support for Fast and Power Efficient Locking and Synchronization
As compute nodes increase in parallelism, existing intra-node locking and synchronization primitives need to be scalable, fast, and power efficient. Most parallel runtime systems try to find a balance between these properties during synchronization by fine-tuned spin-waiting and processor yielding to the OS. Unfortunately, the code path followed by the OS to put the processor into a lower power state for idling almost always includes the interrupt processing path. This introduces an unnecessary overhead for both the waiting tasks and the task waking them up. In this work we investigate a pair of x86 specific instructions, MONITOR and MWAIT, that can be used to build these primitives with the desired performance and power efficiency properties. This pair of instructions allow a processor to quickly pause execution until another one wakes it up with single memory store avoiding the overhead of switching to the idle thread of the OS for the waiting task and sending IPIs for the waking task. We implement a locking primitive using these instructions and evaluate its effectiveness in OpenMP on low to high scales. In these tests we have seen performance improvements of up to 38x and power reduction by 10% at 64 cores and very good scaling. With these results as a motivation we propose that other high- core count processors include these type of instructions and make them available from userspace.


Wednesday September 25, 2013 3:30pm - 3:55pm
15th Floor - Circle City 15 (Hilton) 120 W. Market St, Indianapolis, IN

4:00pm

Dynamic Slot Allocation Technique for MapReduce Clusters (unable to attend, Video/Skype Presentation)
MapReduce is a popular parallel computing paradigm for large-scale data processing in clusters and data centers. However, the slot utilization can be low, especially when Hadoop Fair Scheduler is used, due to the pre-allocation of slots among map and reduce tasks, and the order that map tasks followed by reduce tasks in a typical MapReduce environment. To address this problem, we propose to allow slots to be dynamically (re)allocated to either map or reduce tasks depending on their actual requirement. Specifically, we have proposed two types of Dynamic Hadoop Fair Scheduler (DHFS), for two different levels of fairness (i.e., cluster and pool level. The experimental results show that the proposed DHFS can improve the system performance significantly (by 32%~55% for a single job and 44%~68% for multiple jobs) while guaranteeing the fairness.


Wednesday September 25, 2013 4:00pm - 4:25pm
12th Floor - Circle City 12 (Hilton) 120 W. Market St, Indianapolis, IN

4:00pm

A Cost-Aware Region-Level Data Placement Scheme for Hybrid Parallel I/O Systems
Parallel I/O systems represent the most commonly used engineering solution to mitigate the performance mismatch between CPU and disk performance; however, parallel I/O systems are application dependent and may not work well for certain data access requests. Newly emerged solid state drives (SSD) are able delivering better performance but incur a high monetary cost. While SSDs cannot always replace HDDs, the hybrid SSD-HDD approach uniquely addresses common performance issues in parallel I/O systems. The hybrid SSD-HDD architecture depends on the utilization of the SSD and scheduling of data placement. In this paper, we propose a cost-aware region-level (CARL) data placement scheme for hybrid parallel I/O systems. CARL divides large files into several small regions and selectively places regions with high access cost onto the SSD-based file servers where the region costs are calculated according to data access patterns. We have implemented CARL under MPI-IO and the PVFS2 parallel file system environment. Experimental results of representative benchmarks show that CARL is both feasible and able to improve I/O performance significantly.


Wednesday September 25, 2013 4:00pm - 4:25pm
08th Floor - Circle City 08 (Hilton) 120 W. Market St, Indianapolis, IN

4:00pm

A New Design of RDMA-based Small Message Channels for InfiniBand Clusters
We propose a novel design for RDMA-based small message channels that significantly improves the MVAPICH design. First, we develop a technique that eliminates persistent buffer association, a scheme used in MVAPICH that not only results in significant memory requirement, but also imposes restrictions in memory management. Building upon this technique, we propose a novel shared RDMA-based small message channel design that allows MPI processes on the same SMP node to share small message channels, which greatly reduces the number of small message channels needed for an MPI program on clusters with SMP nodes. Our techniques considerably improve the scalability and reduce memory requirement in comparison to MVAPICH, allowing RDMA-based small message channels to be used by a much larger number of MPI processes. The experimental results demonstrate that our techniques achieve the improvements without adding noticeable overheads or sacrificing the performance benefits of RDMA in practice.


Wednesday September 25, 2013 4:00pm - 4:25pm
15th Floor - Circle City 15 (Hilton) 120 W. Market St, Indianapolis, IN

4:30pm

Write Bandwidth Optimization of Online Erasure Code Based Cluster File System
As the data volume is growing from big to huge in many science labs and data centers, more and more data owners are willing to choose Erasure Code based storage to reduce the storage cost. However, online Erasure Code based cluster file systems still have not been applied widely because of write bottlenecks in data encoding and data placement. We proposed two optimizations to address them respectively. We propose a Partition Encoding policy to accelerate the encoding arithmetic through SIMD extensions and to overlap data encoding with data committing. We devise Adaptive Placement policy to provide incremental expansion and high availability, as well as good scalability. The experimental results in our prototype ECFS show that the aggregate write bandwidth can be improved by 42%, while keeping the storage in a more balanced state.


Wednesday September 25, 2013 4:30pm - 4:55pm
08th Floor - Circle City 08 (Hilton) 120 W. Market St, Indianapolis, IN

4:30pm

Insight and Reduction of MapReduce Stragglers in Heterogeneous Environment
Speculative and clone execution are existing techniques to overcome the problems of task stragglers and performance degradation in heterogeneous clusters for big data processing. In this paper, we propose an alternative approach to solving the problems based on analysis results of profiling and the relations of the system parameters. Our approach adjusts the amount of task slots of nodes dynamically to match the processing power of the nodes, according to current task progress rate and resource utilization. It contrasts with the existing techniques by attempting to prevent task stragglers from occurring in the first place through maintaining a balance between resource supply and demand. We have implemented this method in the Hadoop MapReduce platform, and the TPC-H benchmark results show that it achieves 20-30% performance improvement and 35-88% straggler reduction than existing techniques.


Wednesday September 25, 2013 4:30pm - 4:55pm
12th Floor - Circle City 12 (Hilton) 120 W. Market St, Indianapolis, IN

4:30pm

Optimizing Blocking and Nonblocking Reduction Operations for Multicore Systems: Hierarchical Design and Implementation
Many scientific simulations, using the Message Passing Interface (MPI) programming model, are sensitive to the performance and scalability of reduction collective operations such as MPI_Allreduce and MPI_Reduce. These operations are the most widely used abstractions to perform mathematical operations over all processes that are part of the simulation. In this work, we propose a hierarchical design to implement the reduction operations on multicore systems. This design aims to improve the efficiency of reductions by 1) tailoring the algorithms and customizing the implementations for various communication mechaisms in the system 2) providing the ability to configure the depth of hierarchy to match the system architecture, and 3) providing the ability to independently progress each of this hierarchy. Using this design, we implement MPI_Allreduce and MPI_Reduce operations (and its nonblocking variants MPI_Iallreduce and MPI_Ireduce) for all message sizes, and evaluate on multiple architectures including InfiniBand and Cray XT5. We leverage and enhance our existing infrastructure, Cheetah, which is a framework for implementing hierarchical collective operations to implement these reductions.

The experimental results show that the Cheetah reduction operations outperform the production-grade MPI implementations such as Open MPI default, Cray MPI, and MVAPICH2, demonstrating its efficiency, flexibility and portability. On InfiniBand systems, with a microbenchmark, a 512-process Cheetah nonblocking Allreduce and Reduce achieves a speedup of 23x and 10x, respectively, compared to the default Open MPI reductions. The blocking variants of the reduction operations also show similar performance benefits. A 512-process nonblocking Cheetah Allreduce achieves a speedup of 3x, compared to the default MVAPICH2 Allreduce implementation. On a Cray XT5 system, a 6144-process Cheetah Allreduce outperforms the Cray MPI by 145%. The evaluation with an application kernel, Conjugate Gradient solver, shows that the Cheetah reductions speeds up total time to solution by 195%, demonstrating the potential benefits for scientific simulations.


Wednesday September 25, 2013 4:30pm - 4:55pm
15th Floor - Circle City 15 (Hilton) 120 W. Market St, Indianapolis, IN

5:00pm

Making Work Queue Cluster-Friendly for Data Intensive Scientific Applications
Researchers with large-scale data-intensive applications often wish to scale up applications to run on multiple clusters, employing a middleware layer for resource management across clusters. However, at the very largest scales, such middleware is often "unfriendly" to individual clusters, which are usually designed to support communication within the cluster, not outside of it. To address this problem we have modified the Work Queue master-worker application framework to support a hierarchical configuration that more closely matches the physical architecture of existing clusters. Using a synthetic application we explore the properties of the system and evaluate its performance under multiple configurations, with varying worker reliability, network capabilities, and data requirements. We show that by matching the software and hardware architectures more closely we can gain both a modest improvement in runtime and a dramatic reduction in network footprint at the master. We then run a scalable molecular dynamics application (AWE) to examine the impact of hierarchy on performance, cost and efficiency for real scientific applications and see a 96% reduction in network footprint, making it much more palatable to system operators and opening the possibility of increasing the application scale by another order of magnitude or more.


Wednesday September 25, 2013 5:00pm - 5:25pm
12th Floor - Circle City 12 (Hilton) 120 W. Market St, Indianapolis, IN

5:00pm

Streamer: A Distributed Framework for Incremental Closeness Centrality Computation
Networks are commonly used to model the traffic patterns, social interactions, or web pages. The nodes in a network do not possess the same characteristics: some nodes are naturally more connected and some nodes can be more important. Closeness centrality~(CC) is a global metric that quantifies how important is a given node in the network. When the network is dynamic and keeps changing, the relative importance of the nodes also changes. The best known algorithm to compute the CC scores is not practical to recompute them from scratch after each modification. In this paper, we propose Streamer, a distributed memory framework for incrementally maintaining the closeness centrality scores of a network upon changes. It leverages pipeline parallelism, takes the NUMA effects into account, and proves to be scalable. The design we propose makes the CC maintenance within dynamic networks feasible in real time.


Wednesday September 25, 2013 5:00pm - 5:25pm
08th Floor - Circle City 08 (Hilton) 120 W. Market St, Indianapolis, IN

5:00pm

Scaling of Many-Task Computing Approaches in Python on Cluster Supercomputers
We compare three packages for performing many-task computing in Python: pi4py, IPython Parallel, and Celery. Each of these implementations are based on a different message passing standard, which implies that in an indirect way, we are comparing the Message Passing Interface (MPI)---the current standard for distributed computing---with two emerging technologies, the Advanced Message Queuing Protocol (AMQP) and ZeroMQ. We describe these packages in detail and compare their features as applied to task-based parallel computing on a cluster, including a scaling study using over 12,000 cores and several thousand tasks. Our results suggest that these new, fault-tolerant technologies have a clear place in cluster computing, and that no single technique is the obvious choice. To the best of our knowledge, this is the first time that IPython Parallel and Celery have been run at this scale.


Wednesday September 25, 2013 5:00pm - 5:25pm
15th Floor - Circle City 15 (Hilton) 120 W. Market St, Indianapolis, IN

5:30pm

Excursions and Dinner on Own
Wednesday September 25, 2013 5:30pm - 6:00pm
X Locations Indianapolis

5:30pm

Steering Committee Meeting
Wednesday September 25, 2013 5:30pm - 7:00pm
16th Floor - Circle City 16 (Hilton) 120 W. Market St, Indianapolis, IN

7:00pm

Steering Committee Dinner
Wednesday September 25, 2013 7:00pm - 8:30pm
Restaurant at the Hilton - 120 West Market 120 W. Market St, Indianapolis, IN
 
Thursday, September 26
 

8:00am

Breakfast

Thursday September 26, 2013 8:00am - 9:00am
09th Floor - Victory Ballroom (Hilton) 120 W. Market St, Indianapolis, IN

8:00am

IEEE Cluster 2013 Office
Thursday September 26, 2013 8:00am - 7:00pm
09th Floor - Room 911 120 W. Market St, Indianapolis, IN

8:00am

9:00am

Panel: Clouds and Clusters – the Walmart effect?
The rapid development and massive investments in cloud methodologies and their market penetration as opposed to the clusters makes it timely to analyze the merits of these related but different modalities of delivering cyber-services. The assembled panel of experts will provide their thoughts on the current and futures of these paradigms in different application domains. Clouds promise to deliver cost savings and flexibility over clusters that continue to deliver high QOS – the question at hand is to identify the value propositions of both in a thoughtful manner so we do not replicate “urban blight” in the ITverse.


Thursday September 26, 2013 9:00am - 10:30am
09th Floor - Victory Ballroom (Hilton) 120 W. Market St, Indianapolis, IN

10:30am

Morning Break

Thursday September 26, 2013 10:30am - 11:00am
09th Floor - Victory Ballroom (Hilton) 120 W. Market St, Indianapolis, IN

11:00am

Concurrent Panel: Research Data Alliance
This panel will present ongoing development and organizational work being done by the Research Data Alliance (RDA) (http://rd-alliance.org). The Research Data Alliance is a new international organization that is being supported by the governments of the European Commission, the Australia Commonwealth Government and the United States Government to implement the technology, practice, and connections that make data work across barriers of government policy and or disciplinary specific standards. The Research Data Alliance aims to accelerate and facilitate research data sharing and exchange. In this panel we will feature current thrusts of the new RDA Working Groups that focus on Big Data Analytics, Data Type Registries, Persistent Identifiers for Data, Data Terminology, and Data Policy. Additionally the panelists and moderator will speak about the current status of the Research Data Alliance in its start-up mode and will update attendees on work that will be on the agenda for the upcoming RDA Second Plenary meeting in Washington, DC that will be held the week prior to the Cluster 2013 Conference.


Thursday September 26, 2013 11:00am - 12:00pm
08th Floor - Circle City 08 (Hilton) 120 W. Market St, Indianapolis, IN

11:00am

Concurrent Panel: The many pathways of Campus Bridging - Campus Bridging stories from four different instances.
Campus Bridging narrows the gap between researcher and national scale cyberinfrastructure by implementing technologies that make these resources appear proximal to the researcher and easy to use. These efforts include making data transfer and management more simple and transparent, creating simpler job submission protocols, and making it easier to join resources together. We hope to discuss the technologies that make it easier for researchers to transition analyses from the lab or office to regional and national infrastructure, success stories from those who have made progress in Campus Bridging, and areas where the gap between local and national resources can be overcome. Our panel brings together users and creators of XSEDE Campus Bridging technologies: Ian Foster of the Globus Online Project, Tom Bishop who uses Genesis II to manage data and jobs on XSEDE resources, Scott Teige, who has used the Open Science Grid-XSEDE job submission portal, and Marcus Alfred, who has used the XSEDE Campus BridgingCluster software packages.


Thursday September 26, 2013 11:00am - 12:00pm
12th Floor - Circle City 12 (Hilton) 120 W. Market St, Indianapolis, IN

12:00pm

Lunch

Thursday September 26, 2013 12:00pm - 12:45pm
09th Floor - Victory Ballroom (Hilton) 120 W. Market St, Indianapolis, IN

12:45pm

1:00pm

Closing Keynote - Thomas Sterling, Indiana University Pervasive Technology Institute
Best known as the "father of Beowulf," Sterling developed groundbreaking research that dramatically reduced the cost and increased the accessibility of supercomputers. Sterling has performed applied research in parallel computing system structures, semantics and operation -- in industry, government labs and higher education. In 1997, he and his collaborators received the Gordon Bell Prize. Currently, Sterling's research focuses on the ParalleX execution model for extreme scale computing, with the goal of devising a new model of computation to guide the development of next-generation exascale computing systems. ParalleX is the conceptual centerpiece of the XPRESS project, sponsored by the US Department of Energy Office of Science X-stack program. Sterling holds six patents, and is co-author of six books.


Thursday September 26, 2013 1:00pm - 2:00pm
09th Floor - Victory Ballroom (Hilton) 120 W. Market St, Indianapolis, IN

2:00pm

IEEE Cluster 2014 Transition Meeting - Closed Meeting
Limited Capacity seats available

Thursday September 26, 2013 2:00pm - 3:30pm
15th Floor - Circle City 15 (Hilton) 120 W. Market St, Indianapolis, IN

2:00pm

Excursions and Dinner on Own
Thursday September 26, 2013 2:00pm - 7:00pm
X Locations Indianapolis
 
Friday, September 27
 

7:00am

IEEE Cluster 2013 Office
Friday September 27, 2013 7:00am - 6:00pm
09th Floor - Room 911 120 W. Market St, Indianapolis, IN

7:00am

8:00am

Breakfast
Friday September 27, 2013 8:00am - 9:00am
Corydon Room - 2nd Floor (Hilton) 120 W. Market St, Indianapolis, IN

8:15am

Workshop: Science Gateway Institute Workshop
Limited Capacity seats available

(From http://sciencegateways.org/upcoming-events/ieee-workshop-september-2013/) This workshop (and its proceedings and publications) provide a means of exchanging ideas, building on the successful Gateway Computing Environments series (2005-2011, http://www.collab-ogce.org). Proceedings will be published through IEEE digital publications and extended versions of papers will be published in a special journal issue of Concurrency in Computation – Practice and Experience. This special issue will include extended papers from the June 2013 International Workshop on Science Gateways (http://www.amiando.com/iwsg2013.html). IWSG is now in its fifth year. The majority of the speakers will be authors of peer-reviewed papers accepted for the workshop. Speakers will present for approximately 20 minutes, leaving ample time for Q&A and group discussion. In addition, the workshop will feature an invited speaker to bring in a thoughtful outside perspective and an industry round table panel to highlight developments in the rapidly changing gateway technologies. Post-workshop, we envision an informal opportunity to demonstrate gateways and discuss techniques at a reception. For further details or with questions, email us at info@sciencegateways.org.


Friday September 27, 2013 8:15am - 4:15am
15th Floor - Circle City 15 (Hilton) 120 W. Market St, Indianapolis, IN

9:00am

Workshop: 5th Workshop on Interfaces and Architectures for Scientific Data Storage (IASDS)
Limited Capacity seats available

http://www.mcs.anl.gov/events/workshops/iasds13/ High-performance computing simulations and large scientific experiments generate tens of terabytes of data, and these data sizes grow each year. Existing systems for storing, managing, and analyzing data are being pushed to their limits by these applications, and new techniques are necessary to enable efficient data processing for future simulations and experiments. This workshop will provide a forum for engineers and scientists to present and discuss their most recent work related to the storage, management, and analysis of data for scientific workloads. Emphasis will be placed on forward-looking approaches to tackle the challenges of storage at extreme scale or to provide better abstractions for use in scientific workloads.

Moderators
Friday September 27, 2013 9:00am - 5:00pm
12th Floor - Circle City 12 (Hilton) 120 W. Market St, Indianapolis, IN

9:00am

Workshop: LittleFe and the Bootable Cluster CD - An Education and Outreach Appliiance
Limited Capacity seats available

Many institutions have little to no access to parallel computing platforms for in-class computational science or parallel and distributed computing education. Key concepts, motivated by science, are taught more effectively and memorably on an actual parallel computing platform. LittleFe is a complete six node Beowulf style portable cluster. LittleFe is a six node cluster built with multicore CPUs and GPGPU chipsets, supporting shared and distributed memory parallelism, GPGPU parallelism, and hybrid models. By leveraging the Bootable Cluster CD project and curriculum modules, LittleFe is an affordable, powerful, and ready-to-run computational science, parallel programming, and distributed computing educational appliance. The workshop will focus on use-cases for LittleFe in education, outreach and training, and on developing curriculum modules for the LittleFe/BCCD platform. LittleFe is used in both the pre-college and college settings, we plan to have sufficient workshop staff to divide the participants by educational level when appropriate so that each group can focus on techniques and resources specific to their educational level.


Friday September 27, 2013 9:00am - 5:00pm
07th Floor - Circle City 07 (Hilton) 120 W. Market St, Indianapolis, IN

10:30am

Morning Break
Friday September 27, 2013 10:30am - 11:00am
Workshop Rooms 120 W. Market St, Indianapolis, IN

12:30pm

Lunch
Friday September 27, 2013 12:30pm - 1:30pm
Corydon Room - 2nd Floor (Hilton) 120 W. Market St, Indianapolis, IN

3:30pm

Afternoon Break
Friday September 27, 2013 3:30pm - 4:00pm
Workshop Rooms 120 W. Market St, Indianapolis, IN

5:00pm

Excursions and Dinner on Own
Friday September 27, 2013 5:00pm - 5:30pm
X Locations Indianapolis