• High Level Synthesis

Accelerating - HLS

HLS (High-Level Synthesis Compiler) is a compilation tool that allows us to create a component (hardware/HDL) from a high-level programming language (in this case, C++). This tool greatly facilitates development and abstracts hardware for software, but you still need hardware knowledge to use it.

Intel

The Intelยฎ HLS Compiler is a high-level synthesis (HLS) tool that takes in untimed C++ as input and generates production-quality register transfer level (RTL) code that is optimized for Intelยฎ FPGAs. This tool accelerates verification time over RTL by raising the abstraction level for FPGA hardware design. Models developed in C++ are typically verified orders of magnitude faster than RTL.

CentOS

Warning

I was only able to get it working on CentOS 6; my solution was to run a docker with CentOS and install the dependencies there. I run HLS via the docker CLI.

Note

To make things easier, we will provide a pre-configured docker image. Check with your instructor on how to get it.

HLS

We will generate a component that applies an offset (proc) to an image. For this, the component will have two Avalon memory access interfaces (AVALON-MM): on the first interface, we will access the original image, and on the other, we will write the processed image.

Our hardware will have the following format:

    |-----|         AXI
    | ARM | ===========================
    |-----|            |             |
                       |             |
                   |-------|     |-------|
                   |  Min  |     |  Mout |
                   |-------|     |-------|
           AVALON-MM   |             ^
                       V             |
                   |-------|         |
                   |  Proc |---------- AVALON-MM
                   | (HLS) |
                   |-------|  
  • Min: FPGA memory where we will save the original image
  • Mout: FPGA memory where we will save the processed image
  • TH: Peripheral created by HLS

For this, we will use a specific HLS syntax that, as in C, defines what type of interface will be used in the component (remember the Avalon interfaces, memory mapped, and streaming).

HLS allows us to validate the code on two distinct layers: the first is by compiling the same code that will be synthesized for the x86 architecture, with this we can validate the algorithm much faster, the second is by generating the HDL of the component and simulating it via ModelSim, all of this is done transparently and automatically by the tool.

Note

Simulating the hardware is costly in terms of processing time and computational power, it should be the last thing to be done before using the component on the hardware. Validate by compiling for x86 and then simulating.

Offset

The function to be accelerated is as follows (imgOffSet):

#define OFFSET 50

typedef ihc::mm_master<unsigned char, ihc::aspace<1>,
                       ihc::awidth<32>,
                       ihc::dwidth<8> > Master1;
typedef ihc::mm_master<unsigned char, ihc::aspace<2>,
                       ihc::awidth<32>,
                       ihc::dwidth<8> > Master2;




// just for a NxN image
inline uint pxToMem(uint x, uint y, uint N){
  return(x+y*N);
}

// px + OFFSET
hls_avalon_slave_component
component void imgOffSet(Master1& imgIn,
                         Master2& imgOut,
                         hls_avalon_slave_register_argument int offSet,
                         hls_avalon_slave_register_argument int N
                         ){
    for(int y=0; y < N; y++){
      #pragma unroll 8
      for (int x=0; x < N; x++){
        int px = pxToMem(x,y,N);
        unsigned int tpx = ((unsigned int) imgIn[px])+offSet;
        if(tpx > 255)
          imgOut[px]= 255;
        else
          imgOut[px]= tpx;
        }
    }
}

Note that the imgOffSet function has four arguments: imgIn, imgOut, offSet, and N. The first two are memory pointers, which are respectively where the component will read the image and where it will write the image. The offSet and N arguments are the value to be applied as an offset to the px and the size of the image in pxs, these arguments are of the type hls_avalon_slave_register_argument, which will be converted to a register bank.

In addition to these inputs and outputs, for each interface of the type mm_master, HLS will create one more conduit, which will be the address offset at which it should access the data (for the function, the address 0 is relative). And two more conduits, one to control the start of processing (function call/ call) and another to inform about the processing status (return).

imgIn , imgOut

The first two arguments are of the type ihc::mm_master< unsigned char, which means they will be translated to a bus of the type Avalon and that they should be treated as unsigned char.

  • ihc::aspace<n>: is a unique identifier of the bus (1,2,3,4,...)
  • ihc::awidth<32>: Defines the size of the address bus, in this case, 32 bits
  • ihc::dwidth<8>: Defines the size of the data bus, in this case, 8 (reading 8 bits)
  • There are other bus configurations that can be made in this declaration: latency/ waitrequest/ burst/ (, ihc::latency<0>, ihc::maxburst<8>, ihc::waitrequest<true>)...

pxToMem()

To facilitate development, the function pxToMem(x,y,N) translates an access to px by address in the matrix to the memory address of the px.

printf()

This function will be removed when the function is compiled for hardware; it is only available for simulation and testing.

offSet, n

We need to remember that we are creating a component that will resolve a C code, and the way we can pass arguments to a component is by creating an internal memory, which we normally call a register bank, and giving functionality to them. This is how the offSet and n parameters will be created. When the component is generated, a memory will be initialized and addresses will be reserved for offSet and n, as in the following example:

/******************************************************************************/
/* Memory Map Summary                                                         */
/******************************************************************************/

/*
  Register  | Access  |   Register Contents      | Description
  Address   |         |      (64-bits)           | 
------------|---------|--------------------------|-----------------------------
        0x0 |       R |         {reserved[62:0], |     Read the busy status of
            |         |               busy[0:0]} |               the component
            |         |                          |  0 - the component is ready
            |         |                          |       to accept a new start
            |         |                          |    1 - the component cannot
            |         |                          |          accept a new start
------------|---------|--------------------------|-----------------------------
        0x8 |       W |         {reserved[62:0], |  Write 1 to signal start to
            |         |              start[0:0]} |               the component
------------|---------|--------------------------|-----------------------------
       0x10 |     R/W |         {reserved[62:0], |      0 - Disable interrupt,
            |         |   interrupt_enable[0:0]} |        1 - Enable interrupt
------------|---------|--------------------------|-----------------------------
       0x18 |  R/Wclr |         {reserved[61:0], | Signals component completion
            |         |               done[0:0], |       done is read-only and
            |         |   interrupt_status[0:0]} | interrupt_status is write 1
            |         |                          |                    to clear
------------|---------|--------------------------|-----------------------------
       0x20 |     R/W |         {reserved[31:0], |             Argument offSet
            |         |            offSet[31:0]} |                            
------------|---------|--------------------------|-----------------------------
       0x28 |     R/W |         {reserved[31:0], |                  Argument N
            |         |                 N[31:0]} |                            
*/

main.c

In order to validate the project, we must create a main function (which will not be compiled for the hardware). In this function, we open an image file in the .pgm format ("in.pgm") and generate another image file with the original image processed ("out.pgm"). In order to validate the component to be generated ( offSetImg() ) we must allocate two continuous memory regions (in[M_SIZE] and out[M_SIZE) that will be used as the component's input (simulating the AVALON bus).

int main(void) {

  int N = IMG_W;
  int M_SIZE = N*N;

  // create memorys
  unsigned char in[M_SIZE];
  unsigned char out[M_SIZE];
  memset(out,0,sizeof(out));

  /* -------------------------- */
  /* reading img to mem */
  /* -------------------------- */
  printf("loading img\n");
  readImgPgm(IMG_IN, in, M_SIZE);

  /* -------------------------- */
  /* create fake memorys components*/
  /* -------------------------- */
  Master1 mm_in(in, M_SIZE);
  Master2 mm_out(out, M_SIZE);

  /* -------------------------- */
  /* process with kernel */
  /* -------------------------- */
  printf("kernel\n");
  imgOffSet(mm_in, mm_out, N);

  /* -------------------------- */
  /* img out */
  /* -------------------------- */
  printf("outputing \n");
  writeImgPgm(IMG_OUT, out)

  return 0;
}

Note

When we execute the imgOffSet function on our hardware, it will not be as simple as just a function call.

Testing (x86)

Note

Must be done on centos (docker)

To test, we will compile our project for x86 (it will not be hardware) and validate if our logic is correct. If it works, we compile for hardware.

To compile, just use the i++ compiler as in the following example:

$ i++ image.cpp -march=x86-64 -o image_x86

And test the generated program:

$ ./image_x86

The result should be the beautiful img.ppm photo of your instructor, processed with an offset (out.ppm):

Tip

To generate a ppm type image you can use Gimp

Note

This execution is like compiling with gcc, it only serves to validate logic

input output
img.pgm image (binary)
image.cpp out.pgm

Accelerating on the FPGA

To accelerate on the FPGA, we will compile the application again, but now with the -march=CycloneV flag which represents our FPGA

$ i++ image.cpp -march=CycloneV -o image-CycloneV

Note

This may take a long time, what it will do is:

  1. Generate an HDL from your function
  2. Create a component for the Platform Designer
input output
img.pgm image-CycloneV.prj (folder)

image-CycloneV.prj (folder)

If you notice in the project folder, there should be a new folder: image-CycloneV.prj, with the following content:

  • components: Folder with the created component (to be used in the Platform designer)
  • quartus: Quartus project folder used to compile the component, we will not use this
  • report: Folder with reports generated by the tool (html)
  • report: Folder to simulate the project

testing

Now we can test our application using the hardware created by HLS, for that just execute the new binary created when we compiled for the CycloneV architecture.

$ ./image-CycloneV

Warning

This will take a long time! On the monster lab Architecture, it took over 1 hour!

This simulation is performed on ModelSim! At the hardware level. The result will be as expected when we embed on the FPGA. With this simulation, we can check for rounding errors, memory access, among others.

Tipa

The image out-CycloneV.pgm that is in the project folder, is the result of this simulation.

report

HLS generates a report of the hardware compilation, it can be found in: reports/report.html. An interesting report to analyze is the Loops analysis, which demonstrates the program loops:

Optimizing

We can apply several parallelization techniques in the software that will impact the created hardware (area and performance), in the HLS manual (Intel High Level Synthesis Compiler: Reference Manual) there is documentation that describes each of the techniques.

We will use the Loop Unrolling one, which allows us to execute a loop in parallel:

 #pragma unroll <N>
 for (int i = 0; i < M; ++i) {
     // Some useful work
 }

Tip

N is the number of loops to be executed in //.

We will parallelize the line scanning by 8 executions in parallel, for that add in the for that scans the line (x):

 for(int y=0; y < N; y++){
      #pragma unroll 8
      for (int x=0; x < N; x++){

Creating a hardware

Now with the component created, we need to add it to the hardware, this will be done via Platform Design. To facilitate development, we will use the example hw project from Terasic: DE10_Standard_FB and modify inserting the component and two memories, as indicated below: