HtComp

Current trends in computer architecture are increasingly moving towards heterogeneous platforms, i.e. systems made of different computational units, including general-purpose processors, special-purpose units, i.e. digital signal processors (DSPs) or graphics processing units (GPUs), co-processors or custom acceleration logic, i.e. application-specific circuits, often implemented on field-programmable gate arrays (FPGAs). In particular, during the very recent years many innovative companies, including Convey, Maxeler, SRC, Nimbix, have introduced FPGA-based heterogeneous platforms, used in a large range of applications, e.g. medical image processing, bioinformatics, genomics research, DNA sequence search, etc., with speedups in the range of 10x to 100x. While the potential for improved performance essentially lies in the platform heterogeneity, programming such next-generation machines is extremely difficult as it requires architecture-specific code, increasing the complexity and decreasing the portability of applications on different machines. For FPGA-based hardware acceleration the challenges at the programmability levels are even tougher, as they require the developer to be, in principle, a highly-skilled hardware designer knowing low-level hardware description languages such as VHDL or Verilog. We collectively indicate these challenges as the programmability wall. By failing to tackle this wall, we will miss the opportunity to completely exploit the computational power offered by future heterogeneous platforms, as we will restrict it to a limited élite of highly-skilled parallel programmers/hardware designers, excluding the vast majority of potential HPC users: biologists, engineers, computational chemistry scientists, etc.

HtComp: Innovating current programming approaches to make heterogeneous computing easier

An essential objective of HtComp is to bring the development of heterogeneous computing applications within the expertise of general parallel programmers possibly coming from any scientific/industrial field, including scientific computing and engineering. HtComp will introduce innovative design flows for next-generation heterogeneous HPC platforms creating an easily accessible entry-point to the development of parallel applications based on FPGA hardware accelerators paired with multi-core CPUs and GPUs. The following figure summarizes the high-level technical approach and strategic actions that will be taken by the HtComp project towards this objective

Project organization

The start-up project will span 22 months, and will be organized in 7 Work Packages. The extended H2020 project is anticipated to be in the form of a transnational research initiative targeting a cooperation call in the framework of Horizon 2020 (similar to the FP7 Cooperation programme).

Following we list the main actions planned in the HtComp proposal, also indicating in parentheses the WP the action is mapped to.

State of the art (WP1). HtComp will build on a composite range of competences. These come from various research areas, e.g. compilers, automated parallelization, best-practices for HPC programming, architectures, high-level synthesis for automated translation from high-level languages to hardware design, in addition to domain-specific knowledge of the targeted application fields. Currently, these approaches are mostly isolated from each other, although they may converge to provide new paths to heterogeneous computing.

Identifying the application domains (WP2 and WP6). To drive the definition of sound technical innovations, at the level of both computing architecture and programming paradigms, the project will be essentially based on a careful analysis of the application domains that may benefit the most from dedicated hardware acceleration. The primary application fields that will be targeted will include Bioinformatics and Cryptanalysis. The study of these application fields will fundamentally rely on the multidisciplinary competences covered by the team. Additional application domains will include Computational Chemistry, Computational Finance, and Geophysics and Seismic applications. For these domains, the presence of external advisors in the area of scientific and high-performance computing will be essential.

Parallelism models and languages (WP1 and WP3). The formalism developed by the HtComp project will allow the developer to express parallelism in C/C++ high-level code at three different levels: explicitly, implicitly, and transparently. Various model of parallelism will be available: data-flow parallelism enclosed in portions of OpenMP parallel code will be supported, Single-Instruction-Multiple-Threads (SIMT) parallelism and block-level parallelism (i.e. groups of thread to be executed on a single streaming multiprocessor), task-level parallelism, fine-grain instruction-level parallelism extracted from a control/data-flow graph (CDFG) representation of the software portion. The programmer will also be provided with suitable directives (pragmas) to drive implicitly the compiler in the parallelization step, in line with the Static Control Parts (SCoP) model. Furthermore, the programmer will be able to include constructs –mostly complier directives/pragmas– to specify data movement and type of parallelism.

Architecture model (WP4). The project will survey and classify emerging heterogeneous architectures (e.g. Convey, Maxeler, SRC, Nimbix). It will also define and prototype a reference architecture to evaluate the interplay between the low-level machine and the programming level. To support a uniform, portable model of parallelism, HtComp will define a Machine Description Format (MDF) used to represent the relevant details of a specific underlying architecture, e.g. the available degree of physical parallelism and the organization of the memory subsystem.

Memory architecture (WP3 and WP3). The formalism exposed by the HtComp methodology will allow the programmer to specify data distribution through a hierarchical memory model capturing the recurrent characteristics of the heterogeneous architectures targeted by the project, particularly GPU- and FPGA-based. Furthermore, the formalism will provide a few directives for controlling software-managed cache memories. The distribution and movement of data through the memory infrastructure can be specified explicitly by the programmer or partly managed automatically by means of static/dynamic data distribution mechanisms and a dynamic interconnect implemented on reconfigurable hardware, relying on some preliminary results already achieved by the PI's group.

On-chip interconnect generation (WP5). HtComp will also investigate an automated approach to searching the design space for interconnection within the FPGA chip, which is a multi-core system itself. The approach will rely on the clustering of the communicating nodes and on scheduling algorithms based on either generic algorithms or ad-hoc heuristics, in line with the preliminary results presented in.

Code transformation (WP3 and WP5). The generation of code for heterogeneous platforms will be based on two complementary approaches. The extraction of instruction and dataflow parallelism will rely on high-level synthesis techniques used to translate high-level C/C++ OpenMP code to hardware descriptions languages like VHDL and Verilog. A different level of code transformation explored by HtComp will target the exploitation of thread/block- and data-level parallelism across all the types of computing units. The input formalism will provide suitable directives to explicitly indicate regions of code compliant with a Static Control Parst (SCoP) model, enabling the use of polyhedral tools (e.g., clan, Graphite gcc 4.2,) to track dependencies automatically. The project will also explore the adoption of techniques to automatically detect SCoP regions.

Code simulation and debug (WP5). HtComp will study innovative approaches to co-simulation, co-emulation, and debug. The framework will support Functional, Software-in-the-loop, and Hardware-in-the-loop simulation approaches, also relying on recent initiatives such as the Standard Co-Emulation Modeling Interface (SCE-MI) introduced by Accellera. The debug will be supported by code annotations and flags that will instruct the HLS tool to generated instrumented hardware description code for tracking at runtime the current state of execution of the hardware block and linking this state to software debug information.

Evaluation and benchmarking (WP6). The evaluation will target popular software packages commonly used by scientists and domain experts in the various fields considered by HtComp: Burrows-Wheeler Aligner (BWA), Velvet, and implementations of the Smith-Waterman algorithm in the field of Bioinformatics; custom tools and packages in the field of Cryptanalysis; BlgDFT, Q-CHEM, and others in the field of Computational Chemistry; parallel implementations of Monte-Carlo simulation for Computational Finance; various RTM packages for Geophysics and Seismic Applications; etc

Key actions to push then new approaches to the user community (WP6 and WP7). Due to the multi-disciplinary nature of the HtComp project and the direct impact of the proposed approaches on the final users, the proposal puts much emphasis on the technology transfer to the industry. This will include the participation in workshops, symposia, and exhibitions, the organization of workshops and brainstorming sessions, the distribution of open reference code, the links with important initiatives at the European level, such as PRACE (Partnership for Advanced Computing in Europe) and the Human Brain Project, specifically two of its sub-projects, Brain Simulation Platform and High Performance Computing Platform, as well as monitoring initiatives from private companies, such as the Maxeler University Program (MAX-UP).

The final outcome of the HtComp project will consist of a set of methodologies, approaches, and experimental tools, part of a prototypical toolchain allowing the assisted development of heterogeneous code: starting from familiar C/C++ high-level code, possibly annotated with parallelization directives, HtComp will allow the automated design of a parallelized heterogeneous code including application-specific, FPGA-based hardware accelerators.