ADVANCED RUNTIME SYSTEMS AND COMPILER TOOLS FOR HPX-5 Runtime system
HPX-5 (High Performance ParalleX) is an open source, portable, performance-oriented runtime developed at CREST(Indiana University). It is CREST’s current implementation of the ParalleX execution model. HPX-5 provides a distributed programming model allowing programs to run unmodified on systems from a single SMP to large clusters and supercomputers with thousands of nodes.
PARALLEX EXECUTION MODEL
Semantics for the ParalleX Execution Model
The ParalleX Execution Model attempts to address the underlying sources of performance degradation (e.g. latency, overhead, and starvation) and the difficulties of programmer productivity (e.g. explicit locality management and scheduling, performance tuning, fragmented memory, and synchronous global barriers) to dramatically enhance the broad effectiveness of parallel processing for high end computing. ParalleX has two significant benefits: its underlying structure and design, which create the basis for highly scalable software applications running at the exaops level, and a formal description of the model that is sufficiently precise that it allows semantic analysis of that model. This means that we can prove that the ParalleX model assures completion of calculations, avoids race conditions, etc. The current version of the semantic analysis of ParalleX is available online at ftp://www.cs.indiana.edu/pub/techreports/TR726.pdf
For more information on ParalleX in general see https://pdfs.semanticscholar.org/c4dd/4dff3b7e03a372eaee00dba074f3ebf4a4dc.pdf A new paper fully describing the ParalleX Execution Model is in preparation at this time.
Multipole methods contribute to a broad range of end-user science applications extending from molecular dynamics to galaxy formation. Many of these applications describe very dynamic physical processes, both in their time dependence and in their range of relevant spatial scales. However, conventional implementations of multipole methods are essentially static in nature leading to computational inefficiencies. The Dynamic Adaptive System for Hierarchical Multipole Methods (DASHMM) will employ dynamic adaptive execution methods to provide a scalable and efficient multipole method library that is easy to use. This project is sponsored by the National Science Foundation.
Method and Apparatus for 3-D Facial Recognition
Existing commercial 2-D facial recognition algorithms operate with high accuracy under conditions where the subject is looking more or less directly at the camera and where the lighting conditions are good. For commercial surveillance camera operators, however, these conditions are rarely satisfied leading to high false negative outcomes during deployment. Furthermore, while conventional low-light and nighttime facial imagery acquisition relies upon using either reflection dominated infrared bands with active illumination (near infrared or short-wave infrared) or emission dominated infrared bands without active illumination (mid-wave infrared or long-wave infrared), commercial infrared image comparison against visible facial image databases suffers even higher false negatives in the presence of large facial pose variation than visible image comparison alone.
To fully integrate infrared information as well as to address false negatives in the presence of large facial pose variation, this work implements and explores an Indiana University patented algorithm for 3-D multispectral facial recognition.
Simulations of Binary Star Mergers
This work is supported by: Anderson, M. (Co-PI). (Walter Ligon, Brigham Young University, PI). Collaborative Research: Compact Binary Mergers in the Advanced LIGO Era. NSF Award #1607390; $640,000 total, IU subaward $74,251; 09/01/16- 08/31/19.
Binary neutron star mergers exhibit a rich phenomenology, including the effects of matter at nuclear densities, neutrino cooling, and electromagnetic effects. Our work on such systems shows a tight connection between different possible observables and the neutron star equation of state (EOS). The EOS imprints subtle differences on the expected gravitational wave (GW) signals prior to merger, and these differences become much more significant during merger and afterward. For example, softer equations of state lead to mergers that occur at higher frequencies than for stiffer equations of state. Neutrino production and the composition of ejected material are also strongly dependent on the EOS, and these processes have a strong impact on electromagnetic signals from radioactive decay. An analysis of the neutrino production indicates that EOS yielding more compact stars (softer EOS) produce the largest neutrino luminosity with the highest average neutrino energy. Electromagnetic emission is also likely to be significant and may encode information about the binary in its late stages prior to merger. In particular, binary parameters together with the strength of the magnetospheric fields will help determine the amount and direction of the emission. In addition, the strength of the emission post-merger may also be influenced by the initial form, relative strength, and rearrangement of the interior magnetic field configuration. Gravitational waves are essentially unscattered between emission and detection thereby giving direct information about the innermost engines that power these highly energetic astrophysical phenomena. Detecting and observing such waves will open a new perspective on understanding these systems. An even richer realization of the science that can be achieved will come when these observations are combined with other electromagnetic and neutrino observations.
Center for Shock Wave-Processing of Advanced Reactive Materials
This work is funded by: Swany, M. (IU Subcontract PI). K. Matous (PI) Notre Dame. Center for Shock Wave-processing of Advanced Reactive Materials (C-SWARM). $1,600,000 total, $400,000 IU Subaward; 07/01/13- 05/30/18.
CREST is conducting research, in collaboration with the University of Notre Dame and Purdue University, on parallel multiscale and multiphysics computational framework for predictive science using models that are verified and validated with uncertainty quantification on future high-performance Exascale computer platforms. The goal is to predict the behavior of heterogeneous materials, specifically the dynamics of their shock induced chemo-thermo-mechanical transformations and resulting material properties. Through the adaptive Exascale simulations, we aspire to predict conditions for synthesis of novel materials-by-design and provide prognoses of non-equilibrium structures that will form under shock wave processing.
Graduate Student Research of Timur Gilmanov. (Advisor: Thomas Sterling). Project Title: Lower Bound Resource Requirements for Machine Intelligence
Brief summary of the project: Machine intelligence (MI) has seen a lot of attention and significant advancements in past two decades. However, in spite of the advancements in both fields, truly intelligent machine behavior operating in real time is yet an unachieved problem. First, what machine intelligence should be is still unknown, except in some special cases. Second, delivering full machine intelligence is beyond the scope of today’s cutting-edge high performance computing machines. One important aspect of systems possessing machine intelligence is resource requirements and the limitations that such requirements of today’s and future machines could impose on the systems. Present research is concerned with this aspect of machine intelligence and is looking at ways of estimating such requirements. The intellectual merit of this work was introduction of a model for machine intelligence. As the work on this project is still ongoing, this model and its consequent implementation will allow to answer questions like: “What are the functional elements of a machine intelligent system?” and “What are their usage incident rates and costs?”. Combined together in a performance usage, the results obtained by answering these questions and expressed via a set of defined metrics will allow to establish an estimate on resource requirements for machine intelligence. The broader impact of this research is an ability to estimate the scale and complexity of future machines that would be required to perform tasks in areas of machine intelligence and artificial intelligence, thereby allowing to determine precisely the hardware complexities and costs associated with certain MI tasks.
Graduate Student Research of Buddhika Chamith Kahawitage Don. (Advisor: Ryan Newton). Project Title: Lightweight Runtime Binary Instrumentation Techniques and Applications
Brief summary of the project: This project develops new lightweight binary instrumentation techniques for x86 binaries. Runtime instrumentation is heavily used in application monitoring, debugging and JIT compilers. Low overhead lightweight instrumentation is essential for reducing the observer effect in these applications. With these new techniques we try to improve the state of the art in applications which rely on runtime instrumentation of native binaries. Broadly this will enable better application visibility and efficient applications.
Graduate Student Research of Bibrak Qamar Chandio. (Advisor: Thomas Sterling). Project Title: Asynchronous Graph Processing Using Message Driven Systems
Brief summary of the project: High Performance Computing hardware offers large amount of parallelism. This is evident in architectures equipped with modern multicore processors and accelerators such as GPUs. As this hardware parallelism grows so does the gaps between current programming and execution models. One such gap is introduced to by the Bulk Synchronous Parallel (BSP) model that limits the granularity of parallelism. To elevate this one promising exploration space is fine grain event driven execution models, such as ParalleX. This work explores Graphs—structures that inherently have large amount of parallelism. Asynchronously processing graphs using event driven execution models exposes this inherent fine grain parallelism. This work will implement current and new asynchronous graph processing algorithms under HPX+. The knowledge acquired will further help improve HPX+ and future hardware design.
EDUCATION AND WORKFORCE DEVELOPMENT
HPC Education through Formal On-Line On-Demand Curriculum
This project, sponsored by the NSF, will dramatically increase the accessibility of HPC education for this nation’s students independent of socio-economics or demographics and provide an enhanced skill force to strengthen the US capabilities in computational science, engineering, and HPC system administration. The project will make possible the near-term development and dissemination of a new on-demand course in HPC for national distribution. This asynchronous course is realized through advanced Internet based on-line services and on-demand video lectures with supportive instructional materials.
This work has the recent book by Sterling, Anderson, and Brodowicz as its foundation:
Sterling, Thomas, Matthew Anderson, Maciej Brodowicz. 2018. High Performance Computing: Modern Systems and Practices, 1st Edition. Morgan Kaufmann. Cambridge, MA. ISBN-13: 978-0124201583 Available from Amazon at https://www.amazon.com/High-Performance-Computing-Systems-Practices/dp/012420158X/ref=sr_1_1?ie=UTF8&qid=1519140912&sr=8-1&keywords=sterling+high+performance+computing+anderson+brodowicz
Related, openly licensed materials are online at https://tinyurl.com/jgphfn4
CONTINUUM COMPUTER ARCHITECTURE
Continuum Computer Architecture (CCA) is a class of parallel computing architectures that incorporate a large number of identical, small logical structures. While the computational capabilities of the individual structures are limited, machines composed of them can perform complex operations as an emergent property of the system. A simple and long known example of CCA are cellular automata. The project focuses on the development of a specific instance of this architecture class called Simultac, which is a scalable, general-purpose non-von Neumann computer. Simultac is expected to achieve two orders of magnitude better aggregate performance than conventional processors in the same area of silicon, while providing over 95% better energy efficiency and inherent resilience to hardware faults. The intellectual merit of the proposed research will be the exploration, understanding, and possible innovation in parallel computer architecture. The goal of the research is to develop the class of Continuum Computer Architecture to deliver extremely high parallelism at the end of Moore’s Law near nano-scale semiconductor technology and to demonstrate cross-cutting methodologies to employ it for general purpose parallel computing. The new cellular architecture will incorporate local operational rules that in highly replicated arrays will exhibit global general purpose parallel computation as an emergent behavioral property; a new and potentially promising approach if successful. Key problems to be pursued are: global address space for the cellular structure memory, local cellular control, nearest neighbor interoperability, lightweight synchronization mechanisms, broadly distributed medium and coarse-grained parallelism, and cross-array message-driven communication for data and task migration. The broader impacts of this project, if successful, will be significant influence on the design of energy-efficient large scale computers, potentially paving an economically viable way to exascale capability and beyond. As the architecture is co-designed with the next generation runtime system based on the ParalleX execution model, it will serve as an exemplar and focused case study of hardware features necessary to support dynamic adaptive execution environments. These in turn permit substantial increase in execution efficiency of applications that traditionally underperform on conventional platforms, such as simulations on irregular and adaptively reconfigured mesh structures, sparse solvers, time-varying graph problems, and high throughput streaming operations. Efficiency improvements of any of such applications is expected to catalyze dramatic breakthroughs in science, engineering, and industry. Primary components of the design of execution hardware and the related software stack will be disseminated through conference papers and journal articles. Selected aspects of the research will also be incorporated in the introductory course on High Performance Computing currently taught at IU.
NETWORKING AND NETWORK APPLICATIONS
GEMINI: The Global Environment for Network Innovations
GENI is designed for network experimentation, making measurement critical. GEMINI will provide the extensive instrumentation needed for collecting, analyzing, and sharing real network measurements from potentially groundbreaking GENI experiments.
Indiana University researchers, together with partners from the University of Kentucky and Internet2, have partnered on the $1.3 million project, which will build on the success of perfSONARan internationally deployed, network monitoring infrastructure.
Led by CREST’s Martin Swany, GEMINI is housed at IU and is one of only two Instrumentation and Measurement (I&M) awards the largest GENI efforts to date.
GENI home page: www.geni.net
GENI Project wiki: http://groups.geni.net/geni/wiki
perfSONAR: Performance focused Service Oriented Network monitoring ARchitecture
perfSONAR is an international collaboration for network monitoring. Collaborators include ESnet, GÉANT2, RNP (Rede Nacional de Ensino e Pesquisa) in Brazil, and Internet2. perfSONAR is an infrastructure for network performance monitoring, facilitating the ability to solve end-to-end performance problems on paths crossing several networks and to enable network-aware applications. It contains a set of services delivering performance measurements in a federated environment. These services act as an intermediate layer, between the performance measurements tools and the diagnostic or visualization applications. This layer is aimed at making and exchanging performance measurements between networks, using well-defined protocols. It allows for the easy retrieval of the same metrics from multiple administrative domains. perfSONAR is a services-oriented architecture. That means that the set of elementary functions have been isolated and can be provided by different entities called services. All of those services communicate with each other using well-defined protocols. perfSONAR has three contexts:
- A consortium of organizations seeking to build network performance middleware that is interoperable across multiple networks and useful for intra- and inter-network analysis. One of the main goals is to make it easier to solve end-to-end performance problems on paths crossing several networks.
- A protocol. It assumes a set of roles (the various service types), defines the protocol standard (syntax and semantics) by which they communicate, and allows anyone to write a service playing one of those roles. The protocol is based on SOAP XML messages conforming to the Open Grid Forum (OGF) Network measurement Group (NM-WG) schema definitions.
- Several interoperable software packages (implementations of various services) that implement an interoperable performance middleware framework. These packages are developed by different partners. Some parts of the software are “more important” than others because their goal is to ensure interoperability between domains (e.g. the Lookup Service and the Authentication Service). Different subsets of the software are important to each partner, with a great deal of overlap. The services act as an intermediate layer, between the performance measurement tools and the diagnostic or visualization applications.
Programming Languages and Compilers
SPX: Collaborative Research: Multi-Grain Compilers for Parallel Builds at Every Scale
This work is funded by: Newton, R. (PI). SPX: Collaborative Research: Eat your Wheaties: Multi-Grain Compilers for Parallel Builds at Every Scale. NSF Award #1725679; $400,000; 07/01/17- 06/30/21.
Modern software development practices at companies such as Google and Facebook have led to compilation — the process of transforming source programs into executable programs — becoming a significant, time-consuming, resource-intensive process. Unfortunately, even state of the art compilers and build systems do not do a good job of exploiting emerging, high-performance, highly-parallel hardware, so software development is hampered by the still-slow process of compilation. This project aims to develop new techniques to speed up the process of compilation. The intellectual merits are designing new compiler internals, algorithms, and schedulers to enable compilers to take advantage of modern hardware capabilities. The project’s broader significance and importance are that the process of compilation undergirds virtually every aspect of modern software, and hence modern life: speeding up compilation enables any type of software to be developed more quickly, providing new features to users and more quickly squashing potentially catastrophic bugs.
The project revolves around three main thrusts. First, the PIs are developing new representations for compiler internals that better fit the memory hierarchy of modern machines, eschewing pointer-based representations for dense representations. They are designing techniques to allow programmers to write their compiler passes at a high level while automatically converting them to use the dense representation. Second, the PIs are designing new algorithms to optimize compiler passes. These are transformations of internal compiler algorithms to promote locality (by combining passes that operate on similar portions of a program) and to enhance parallelism (by eliminating unnecessary synchronization between passes). Finally, the PIs are creating new scheduling techniques to allow the new highly-parallel compiler algorithms to be effectively mapped to the parallel and distributed hardware on which modern build systems execute. These research thrusts are being carried out using a new research compiler, Gibbon.
NSF Award Abstract: https://nsf.gov/awardsearch/showAward?AWD_ID=1725679
CAREER: Towards Practical Deterministic Parallel Languages
This work is supported by: Newton, R. (PI). CAREER: Towards Practical Deterministic Parallel Languages. NSF Award #1453508; $535,043; 02/01/15- 01/31/20.
Parallel, multicore processors have become ubiquitous, but parallel programming has not. This gap implies that many everyday programs do not fully use the hardware on which they run. The problem persists because traditional parallel programming approaches are high-risk: a parallel program can yield inconsistent answers, or even crash, due to unpredictable interactions between simultaneous tasks. Certain classes of programs, however, admit strong mathematical guarantees that they will behave the same in spite of parallel execution. That is, they enable deterministic parallel programming. Functional programming, extended with “LVars” –shared-state data structures that support commutating operations– is one such model. While this theoretical model has been proven deterministic, significant questions remain regarding practical aspects such as efficiency and scalability. This research addresses those questions by developing new LVar data structures and scaling them to larger distributed memory machines. The intellectual merits are in the development of novel algorithms that support parallel programming. Further, the LVar model provides a new lens through which to view problems in parallel programming, which can lead to downstream discoveries. The project’s broader significance and importance are (1) its potential to lower the cost and risk of parallel programming and (2) its educational goal: to employ deterministic parallel programming in the introductory programming course at both a university level, and in K-12 education. Changing how programming is taught may be necessary for leveraging hardware parallelism to become a normal and unexceptional part of writing software.
Three specific technical challenges are addressed in this research. First, LVars traditionally require more storage space over time, because “delete” operations do not commute with others. Semantically, the state-space of each LVar forms a join semi-lattice and all modifications must move the state “upwards” monotonically. Nevertheless, this project investigates new ways that LVars can free memory, using a concept of Saturating LVars. Second, this research seeks to formalize the relationship of LVar-based parallel programs to their purely functional counterparts, characterizing asymptotic performance advantages. Finally, this project explores the scalability of LVar-based programming abstractions in a distributed memory setting, where they share similarities with recent distributed programming constructs such as concurrent replicated data structures.
NSF Award Abstract: https://nsf.gov/awardsearch/showAward?AWD_ID=1453508