Image Sensor Processing Multistream Pipeline - 2024.1 English

Vitis Libraries

Release Date
2024-05-30
Version
2024.1 English

The ISP multistream pipeline allows you to process input from multiple streams using one instance of ISP. Current multistream pipeline processes four streams in a round-robin method with input TYPE XF_16UC1 and output TYPE XF_8UC3(RGB). After the color conversion from the RGB to the YUV colorspace, the output TYPE is XF_16UC1(YUYV).

This ISP pipeline includes 19 modules, they are as follows:

  • Extract Exposure Frames: The Extract Exposure Frames module returns the Short Exposure Frame and Long Exposure Frame from the input frame using the Digital overlap parameter.
  • HDR Merge: The HDR Merge module generates the HDR image from a set of different exposure frames. Usually, image sensors have limited dynamic range and it is difficult to get HDR image with single image capture. From the sensor, the frames are collected with different exposure times and will get different exposure frames. HDR Merge will generate the HDR frame with those exposure frames.
  • HDR Decompand: This module decompands or decompresses a piecewise linear (PWL) companded data. Companding is performed in image sensors not capable of high bitwidth during data transmission. This decompanding module supports Bayer raw data with four knee point PWL mapping and equations are provided for 12-bit to 16-bit conversion.
  • RGBIR to Bayer (RGBIR): This module converts the input image with R, G, B, IR pixel data into a standard Bayer pattern image along with a full IR data image.
  • Auto Exposure Compensation (AEC): This module automatically attempts to correct the exposure level of captured image and also improves contrast of the image.
  • Black Level Correction (BLC): This module corrects the black and white levels of the overall image. Black level leads to the whitening of image in dark regions and perceived loss of overall contrast.
  • Bad Pixel Correction (BPC): This module removes defective/bad pixels from an image sensor resulting from of manufacturing faults or variations in pixel voltage levels based on temperature or exposure.
  • Degamma: This module linearizes the input from sensor in order to facilitate ISP processing that operates on linear domain.
  • Lens Shading Correction (LSC): This module corrects the darkening toward the edge of the image caused by camera lens limitations. This darkening effect is also known as vignetting.
  • Gain Control: This module improves the overall brightness of the image.
  • Demosaicing: This module reconstructs RGB pixels from the input Bayer image (RGGB, BGGR, RGBG, GRGB).
  • Auto White Balance (AWB): This module improves color balance of the image by using image statistics.
  • Color Correction Matrix (CCM): This module converts the input image color format to output image color format using the Color Correction Matrix provided by the user (CCM_TYPE).
  • Quantization & Dithering (QnD): This module is a tone-mapper that dithers input image using Floyd-Steinberg dithering method. It is commonly used by image manipulation software, for example when an image is converted into GIF format each pixel intensity value is quantized to 8 bits i.e. 256 colors.
  • Global Tone Mapping (GTM): This module is a tone-mapper that reduces the dynamic range from higher range to display range using tone mapping.
  • Local Tone Mapping (LTM): This module is a tone-mapper that takes pixel neighbor statistics into account and produces images with more contrast and brightness.
  • Gamma Correction: This module improve the overall brightness of the image.
  • 3DLUT: The 3D LUT module operates on three independent parameters. This drastically increases the number of mapped indexes to value pairs. For example, a combination of 3 individual 1D LUTs can map 2^n * 3 values where n is the bit depth, whereas a 3D LUT processing 3 channels will have 2^n * 2^n * 2^n possible values.
  • Color Space Conversion (CSC): The CSS module converts RGB image to YUV422(YUYV) image for HDMI display purpose. RGB2YUYV converts the RGB image into Y channel for every pixel and U and V for alternating pixels.

ISP multistream Diagram

../_images/ISP_Multi_Diagram.PNG

Parameter Descriptions

Table 247 Runtime Parameter
Parameter Description
dcp_params_16to12 Params to converts the 16bit input image bit depth to 12bit.
dcp_params_12to16 Params to converts the 12bit input image bit depth to 16bit.
R_IR_C1_wgts 5x5 Weights to calculate R at IR location for constellation1.
R_IR_C2_wgts 5x5 Weights to calculate R at IR location for constellation2.
B_at_R_wgts 5x5 Weights to calculate B at R location.
IR_at_R_wgts 3x3 Weights to calculate IR at R location.
IR_at_B_wgts 3x3 Weights to calculate IR at B location.
sub_wgts Weights to perform weighted subtraction of IR image from RGB image. sub_wgts[0] -> G Pixel, sub_wgts[1] -> R Pixel, sub_wgts[2] -> B Pixel sub_wgts[3] -> calculated B Pixel
wr_hls Lookup table for weight values. Computing the weights LUT in host side and passing as input to the function.
array_params Parameters added in one array for multistream pipeline.
gamma_lut Lookup table for gamma values. First 256 will be R, next 256 values are G and last 256 values are B.
dgam_params Array containing upper limit, slope, and intercept of linear equations for Red, Green, and Blue colour.
c1 To retain the details in bright area using, c1 in the tone mapping.
c2 Efficiency factor, ranges from 0.5 to 1 based on output device dynamic range.
Table 248 Compile Time Parameter
Parameter Description
XF_HEIGHT Maximum height of input and output image.
XF_WIDTH Maximum width of input and output image.
XF_SRC_T Input pixel type. Supported pixel width is 16.
NUM_STREAMS Total number of streams.
STRM1_ROWS Maximum number of rows to be processed for stream 1 in one burst.
STRM2_ROWS Maximum number of rows to be processed for stream 2 in one burst.
STRM3_ROWS Maximum number of rows to be processed for stream 3 in one burst.
STRM4_ROWS Maximum number of rows to be processed for stream 4 in one burst.
NUM_SLICES Number of slices processing in each stream.
BLOCK_WIDTH Maximum block width the image is divided into. This can be any positive integer greater than or equal to 32 and less than input image width.
BLOCK_HEIGHT Maximum block height the image is divided into. This can be any positive integer greater than or equal to 32 and less than input image height.
XF_NPPC Number of pixels processed per cycle.
NO_EXPS Number of exposure frames to be merged in the module.
W_B_SIZE W_B_SIZE is used to define the array size for storing the weight values for wr_hls. W_B_SIZE should be 2^bit depth.
FILTERSIZE1 Filter size for RGB pixels.
FILTERSIZE2 Filter size for IR pixels.
DGAMMA_KP Configurable number of knee points in degamma.
SQLUTDIM Squared value of maximum dimension of input LUT.
LUTDIM 33x33 dimension of input LUT.
Table 249 Descriptions of array_params
Parameter Description
rgain To configure gain value for the red channel.
bgain To configure gain value for the blue channel.
ggain To configure gain value for the green channel.
pawb %top and %bottom pixels are ignored while computing min and max to improve quality.
bayer_p The Bayer format of the RAW input image.
black_level Black level value to adjust overall brightness of the image.
height The number of rows in the image or height of the image.
width The number of columns in the image or width of the image.
blk_height Actual block height.
blk_width Actual block width.
lut_dim Dimension of input LUT.
Table 250 Compile time flags
Parameter Description
USE_HDR_FUSION Flag to enable or disable HDR fusion module.
USE_GTM Flag to enable or disable GTM module.
USE_LTM Flag to enable or disable LTM module.
USE_QND Flag to enable or disable QND module.
USE_RGBIR Flag to enable or disable RGBIR module.
USE_3DLUT Flag to enable or disable 3DLUT module.
USE_DEGAMMA Flag to enable or disable Degamma module.
USE_AEC Flag to enable or disable AEC module.

The following example demonstrates the top-level ISP pipeline:

ISPPipeline_accel(ap_uint<INPUT_PTR_WIDTH>* img_inp1,
               ap_uint<INPUT_PTR_WIDTH>* img_inp2,
               ap_uint<INPUT_PTR_WIDTH>* img_inp3,
               ap_uint<INPUT_PTR_WIDTH>* img_inp4,
               ap_uint<OUTPUT_PTR_WIDTH>* img_out1,
               ap_uint<OUTPUT_PTR_WIDTH>* img_out2,
               ap_uint<OUTPUT_PTR_WIDTH>* img_out3,
               ap_uint<OUTPUT_PTR_WIDTH>* img_out4,
               ap_uint<OUTPUT_PTR_WIDTH>* img_out_ir1,
               ap_uint<OUTPUT_PTR_WIDTH>* img_out_ir2,
               ap_uint<OUTPUT_PTR_WIDTH>* img_out_ir3,
               ap_uint<OUTPUT_PTR_WIDTH>* img_out_ir4,
               short wr_hls[NUM_STREAMS][NO_EXPS * XF_NPPC * W_B_SIZE],
               int dcp_params_12to16[NUM_STREAMS][3][4][3],
               char R_IR_C1_wgts[NUM_STREAMS][25],
               char R_IR_C2_wgts[NUM_STREAMS][25],
               char B_at_R_wgts[NUM_STREAMS][25],
               char IR_at_R_wgts[NUM_STREAMS][9],
               char IR_at_B_wgts[NUM_STREAMS][9],
               char sub_wgts[NUM_STREAMS][4],
               ap_ufixed<32, 18> dgam_params[NUM_STREAMS][3][DGAMMA_KP][3],
               float c1[NUM_STREAMS],
               float c2[NUM_STREAMS],
               unsigned short array_params[NUM_STREAMS][11],
               unsigned char gamma_lut[NUM_STREAMS][256 * 3],
               ap_uint<LUT_PTR_WIDTH>* lut1,
               ap_uint<LUT_PTR_WIDTH>* lut2,
               ap_uint<LUT_PTR_WIDTH>* lut3,
               ap_uint<LUT_PTR_WIDTH>* lut4) {
// clang-format off
#pragma HLS INTERFACE m_axi     port=img_inp1             offset=slave bundle=gmem1
#pragma HLS INTERFACE m_axi     port=img_inp2             offset=slave bundle=gmem2
#pragma HLS INTERFACE m_axi     port=img_inp3             offset=slave bundle=gmem3
#pragma HLS INTERFACE m_axi     port=img_inp4             offset=slave bundle=gmem4
#pragma HLS INTERFACE m_axi     port=img_out1             offset=slave bundle=gmem5
#pragma HLS INTERFACE m_axi     port=img_out2             offset=slave bundle=gmem6
#pragma HLS INTERFACE m_axi     port=img_out3             offset=slave bundle=gmem7
#pragma HLS INTERFACE m_axi     port=img_out4             offset=slave bundle=gmem8

#pragma HLS INTERFACE m_axi     port=img_out_ir1          offset=slave bundle=gmem9
#pragma HLS INTERFACE m_axi     port=img_out_ir2          offset=slave bundle=gmem10
#pragma HLS INTERFACE m_axi     port=img_out_ir3          offset=slave bundle=gmem11
#pragma HLS INTERFACE m_axi     port=img_out_ir4          offset=slave bundle=gmem12
#pragma HLS INTERFACE m_axi     port=wr_hls               offset=slave bundle=gmem13
#pragma HLS INTERFACE m_axi     port=dcp_params_12to16    offset=slave bundle=gmem14
#pragma HLS INTERFACE m_axi     port=R_IR_C1_wgts         offset=slave bundle=gmem15
#pragma HLS INTERFACE m_axi     port=R_IR_C2_wgts         offset=slave bundle=gmem16
#pragma HLS INTERFACE m_axi     port=B_at_R_wgts          offset=slave bundle=gmem17
#pragma HLS INTERFACE m_axi     port=IR_at_R_wgts         offset=slave bundle=gmem18
#pragma HLS INTERFACE m_axi     port=IR_at_B_wgts         offset=slave bundle=gmem19
#pragma HLS INTERFACE m_axi     port=sub_wgts             offset=slave bundle=gmem20
#pragma HLS INTERFACE m_axi     port=dgam_params          offset=slave bundle=gmem21
#pragma HLS INTERFACE m_axi     port=c1                   offset=slave bundle=gmem22
#pragma HLS INTERFACE m_axi     port=c2                   offset=slave bundle=gmem23
#pragma HLS INTERFACE m_axi     port=array_params         offset=slave bundle=gmem24
#pragma HLS INTERFACE m_axi     port=gamma_lut            offset=slave bundle=gmem25
#pragma HLS INTERFACE m_axi     port=lut1                 offset=slave bundle=gmem26
#pragma HLS INTERFACE m_axi     port=lut2                 offset=slave bundle=gmem27
#pragma HLS INTERFACE m_axi     port=lut3                 offset=slave bundle=gmem28
#pragma HLS INTERFACE m_axi     port=lut4                 offset=slave bundle=gmem29
   // clang-format on

   struct ispparams_config params[NUM_STREAMS];

   uint32_t tot_rows = 0;
   int rem_rows[NUM_STREAMS];
   static short wr_hls_tmp[NUM_STREAMS][NO_EXPS * XF_NPPC * W_B_SIZE];
   static unsigned char gamma_lut_tmp[NUM_STREAMS][256 * 3];
   static float c1_tmp[NUM_STREAMS], c2_tmp[NUM_STREAMS];
   static ap_ufixed<32, 18> dgam_params_tmp[NUM_STREAMS][3][DGAMMA_KP][3];
   static int dcp_params_12to16_tmp[NUM_STREAMS][3][4][3];
   static char R_IR_C1_wgts_tmp[NUM_STREAMS][25], R_IR_C2_wgts_tmp[NUM_STREAMS][25],
               B_at_R_wgts_tmp[NUM_STREAMS][25], IR_at_R_wgts_tmp[NUM_STREAMS][9],
               IR_at_B_wgts_tmp[NUM_STREAMS][9], sub_wgts_tmp[NUM_STREAMS][4];

   unsigned short height_arr[NUM_STREAMS], width_arr[NUM_STREAMS];
   constexpr int dg_parms_c1 = 3;
   constexpr int dg_parms_c2 = 3;
   constexpr int dcp_parms1 = 3;
   constexpr int dcp_parms2 = 4;
   constexpr int dcp_parms3 = 3;
DEGAMMA_PARAMS_LOOP:
   for (int n = 0; n < NUM_STREAMS; n++) {
// clang-format off
#pragma HLS LOOP_TRIPCOUNT min=NUM_STREAMS max=NUM_STREAMS
      // clang-format on

      for (int i = 0; i < dg_parms_c1; i++) {
// clang-format off
#pragma HLS LOOP_TRIPCOUNT min=dg_parms_c1 max=dg_parms_c1
        // clang-format on
        for(int j=0; j<DGAMMA_KP; j++){
// clang-format off
#pragma HLS LOOP_TRIPCOUNT min=DGAMMA_KP max=DGAMMA_KP
          // clang-format on
          for(int k=0; k<dg_parms_c2; k++){
// clang-format off
#pragma HLS LOOP_TRIPCOUNT min=dg_parms_c2 max=dg_parms_c2
            // clang-format on
            dgam_params_tmp[n][i][j][k] = dgam_params[n][i][j][k];
            }
         }
        }
    }

DECOMPAND_PARAMS_LOOP:
   for(int n=0; n<NUM_STREAMS; n++){
// clang-format off
#pragma HLS LOOP_TRIPCOUNT min=NUM_STREAMS max=NUM_STREAMS
      // clang-format on

      for (int i = 0; i < dcp_parms1; i++) {
// clang-format off
#pragma HLS LOOP_TRIPCOUNT min=dcp_parms1 max=dcp_parms1
        // clang-format on
        for(int j=0; j<dcp_parms2; j++){
// clang-format off
#pragma HLS LOOP_TRIPCOUNT min=dcp_parms2 max=dcp_parms2
          // clang-format on
          for(int k=0; k<dcp_parms3; k++){
// clang-format off
#pragma HLS LOOP_TRIPCOUNT min=dcp_parms3 max=dcp_parms3
            // clang-format on
            dcp_params_12to16_tmp[n][i][j][k] = dcp_params_12to16[n][i][j][k];
          }
        }
      }
   }


C1_C2_INIT_LOOP:
   for(int i=0; i < NUM_STREAMS; i++){
// clang-format off
#pragma HLS LOOP_TRIPCOUNT min=NUM_STREAMS max=NUM_STREAMS
      // clang-format on
      c1_tmp[i]=c1[i];
      c2_tmp[i]=c2[i];

}
   constexpr int R_B_count=25, IR_count=9, sub_count=4;

RGBIR_INIT_LOOP_1:
   for(int n=0; n < NUM_STREAMS; n++){

   // clang-format off
   #pragma HLS LOOP_TRIPCOUNT min=NUM_STREAMS max=NUM_STREAMS
      // clang-format on

      for (int i = 0; i < R_B_count; i++) {
// clang-format off
#pragma HLS LOOP_TRIPCOUNT min=R_B_count max=R_B_count
      // clang-format on

      R_IR_C1_wgts_tmp[n][i] = R_IR_C1_wgts[n][i];
      R_IR_C2_wgts_tmp[n][i] = R_IR_C2_wgts[n][i];
      B_at_R_wgts_tmp[n][i]  = B_at_R_wgts[n][i];
      }
   }

RGBIR_INIT_LOOP_2:
  for(int n=0; n < NUM_STREAMS; n++){

// clang-format off
#pragma HLS LOOP_TRIPCOUNT min=NUM_STREAMS max=NUM_STREAMS
   // clang-format on

    for (int i = 0; i < IR_count; i++) {
// clang-format off
#pragma HLS LOOP_TRIPCOUNT min=IR_count max=IR_count
      // clang-format on

      IR_at_R_wgts_tmp[n][i] = IR_at_R_wgts[n][i];
      IR_at_B_wgts_tmp[n][i] = IR_at_B_wgts[n][i];
    }
  }

RGBIR_INIT_LOOP_3:
   for(int n=0; n < NUM_STREAMS; n++){

// clang-format off
#pragma HLS LOOP_TRIPCOUNT min=NUM_STREAMS max=NUM_STREAMS
      // clang-format on

      for (int i = 0; i < sub_count; i++) {
// clang-format off
#pragma HLS LOOP_TRIPCOUNT min=sub_count max=sub_count
        // clang-format on

        sub_wgts_tmp[n][i] = sub_wgts[n][i];
      }
   }

ARRAY_PARAMS_LOOP:
   for (int i = 0; i < NUM_STREAMS; i++) {
// clang-format off
#pragma HLS LOOP_TRIPCOUNT min=1 max=NUM_STREAMS
      // clang-format on

      height_arr[i] = array_params[i][6];
      width_arr[i] = array_params[i][7];
      height_arr[i] = height_arr[i] * RD_MULT;
      tot_rows = tot_rows + height_arr[i];
      rem_rows[i] = height_arr[i];
   }
   constexpr int glut_TC = 256 * 3;

GAMMA_LUT_LOOP:
   for (int n = 0; n < NUM_STREAMS; n++) {
// clang-format off
#pragma HLS LOOP_TRIPCOUNT min=NUM_STREAMS max=NUM_STREAMS
      // clang-format on
      for(int i=0; i < glut_TC; i++){
// clang-format off
#pragma HLS LOOP_TRIPCOUNT min=glut_TC max=glut_TC
        // clang-format on

        gamma_lut_tmp[n][i] = gamma_lut[n][i];

      }
   }

WR_HLS_INIT_LOOP:
   for(int n =0; n < NUM_STREAMS; n++) {
// clang-format off
#pragma HLS LOOP_TRIPCOUNT min=NUM_STREAMS max=NUM_STREAMS
   // clang-format on
      for (int k = 0; k < XF_NPPC; k++) {
// clang-format off
#pragma HLS LOOP_TRIPCOUNT min=XF_NPPC max=XF_NPPC
        // clang-format on
        for (int i = 0; i < NO_EXPS; i++) {
// clang-format off
#pragma HLS LOOP_TRIPCOUNT min=NO_EXPS max=NO_EXPS
          // clang-format on
          for (int j = 0; j < (W_B_SIZE); j++) {
// clang-format off
#pragma HLS LOOP_TRIPCOUNT min=W_B_SIZE max=W_B_SIZE
            // clang-format on
            wr_hls_tmp[n][(i + k * NO_EXPS) * W_B_SIZE + j] = wr_hls[n][(i + k * NO_EXPS) * W_B_SIZE + j];
          }
        }
      }
   }

   const uint16_t pt[NUM_STREAMS] = {STRM1_ROWS, STRM2_ROWS, STRM3_ROWS, STRM4_ROWS};
   uint16_t max = STRM1_ROWS;
   for (int i = 1; i < NUM_STREAMS; i++) {
      if (pt[i] > max) max = pt[i];
   }

   const uint16_t TC = tot_rows / max;
   uint32_t addrbound, wr_addrbound, num_rows;

   int strm_id = 0, stream_idx = 0, slice_idx = 0;
   bool eof_awb[NUM_STREAMS] = {0};
   bool eof_tm[NUM_STREAMS] = {0};
   bool eof_aec[NUM_STREAMS] = {0};

   uint32_t rd_offset1 = 0, rd_offset2 = 0, rd_offset3 = 0, rd_offset4 = 0;
   uint32_t wr_offset1 = 0, wr_offset2 = 0, wr_offset3 = 0, wr_offset4 = 0;

TOTAL_ROWS_LOOP:
   for (int r = 0; r < tot_rows;) {
// clang-format off
#pragma HLS LOOP_TRIPCOUNT min=(XF_HEIGHT/STRM_HEIGHT)*NUM_STREAMS max=(XF_HEIGHT/STRM_HEIGHT)*NUM_STREAMS
      // clang-format on

// Compute no.of rows to process
     if (rem_rows[stream_idx] / RD_MULT > pt[stream_idx]) { // Check number for remaining rows of 1 interleaved image
       num_rows = pt[stream_idx];
       eof_awb[stream_idx] = 0; // 1 interleaved image/stream is not done
       eof_tm[stream_idx] = 0;
       eof_aec[stream_idx] = 0;
     } else {
       num_rows = rem_rows[stream_idx] / RD_MULT;
       eof_awb[stream_idx] = 1; // 1 interleaved image/stream done
       eof_tm[stream_idx] = 1;
       eof_aec[stream_idx] = 1;
     }

     strm_id = stream_idx;

     if (stream_idx == 0 && num_rows > 0) {
     Streampipeline(img_inp1 + rd_offset1, img_out1 + wr_offset1, img_out_ir1 + wr_offset1, lut1, num_rows,
                   height_arr[stream_idx], width_arr[stream_idx], STRM1_ROWS, dgam_params_tmp, hist0_awb,
                   hist1_awb, igain_0, igain_1, flag_awb, eof_awb, array_params, gamma_lut_tmp, wr_hls_tmp,
                   R_IR_C1_wgts_tmp, R_IR_C2_wgts_tmp, B_at_R_wgts_tmp, IR_at_R_wgts_tmp, IR_at_B_wgts_tmp,
                   sub_wgts_tmp, dcp_params_12to16_tmp, hist0_aec, hist1_aec, flag_aec, eof_aec, omin_r, omax_r,
                   omin_w, omax_w, mean1, mean2, L_max1, L_max2, L_min1, L_min2, c1_tmp, c2_tmp, flag_tm,
                   eof_tm, stream_idx, slice_idx);
     rd_offset1 += (RD_MULT * num_rows * ((width_arr[stream_idx] + RD_ADD) >> XF_BITSHIFT(XF_NPPC))) / 4;
     wr_offset1 += (num_rows * (width_arr[stream_idx] >> XF_BITSHIFT(XF_NPPC))) / 4;

     } else if (stream_idx == 1 && num_rows > 0) {
     Streampipeline(img_inp2 + rd_offset2, img_out2 + wr_offset2, img_out_ir2 + wr_offset2, lut2, num_rows,
                   height_arr[stream_idx], width_arr[stream_idx], STRM2_ROWS, dgam_params_tmp, hist0_awb,
                   hist1_awb, igain_0, igain_1, flag_awb, eof_awb, array_params, gamma_lut_tmp, wr_hls_tmp,
                   R_IR_C1_wgts_tmp, R_IR_C2_wgts_tmp, B_at_R_wgts_tmp, IR_at_R_wgts_tmp, IR_at_B_wgts_tmp,
                   sub_wgts_tmp, dcp_params_12to16_tmp, hist0_aec, hist1_aec, flag_aec, eof_aec, omin_r, omax_r,
                   omin_w, omax_w, mean1, mean2, L_max1, L_max2, L_min1, L_min2, c1_tmp, c2_tmp, flag_tm,
                   eof_tm, stream_idx, slice_idx);

     rd_offset2 += (RD_MULT * num_rows * ((width_arr[stream_idx] + RD_ADD) >> XF_BITSHIFT(XF_NPPC))) / 4;
     wr_offset2 += (num_rows * (width_arr[stream_idx] >> XF_BITSHIFT(XF_NPPC))) / 4;

     } else if (stream_idx == 2 && num_rows > 0) {
     Streampipeline(img_inp3 + rd_offset3, img_out3 + wr_offset3, img_out_ir3 + wr_offset3, lut3, num_rows,
                   height_arr[stream_idx], width_arr[stream_idx], STRM3_ROWS, dgam_params_tmp, hist0_awb,
                   hist1_awb, igain_0, igain_1, flag_awb, eof_awb, array_params, gamma_lut_tmp, wr_hls_tmp,
                   R_IR_C1_wgts_tmp, R_IR_C2_wgts_tmp, B_at_R_wgts_tmp, IR_at_R_wgts_tmp, IR_at_B_wgts_tmp,
                   sub_wgts_tmp, dcp_params_12to16_tmp, hist0_aec, hist1_aec, flag_aec, eof_aec, omin_r, omax_r,
                   omin_w, omax_w, mean1, mean2, L_max1, L_max2, L_min1, L_min2, c1_tmp, c2_tmp, flag_tm,
                   eof_tm, stream_idx, slice_idx);
     rd_offset3 += (RD_MULT * num_rows * ((width_arr[stream_idx] + RD_ADD) >> XF_BITSHIFT(XF_NPPC))) / 4;
     wr_offset3 += (num_rows * (width_arr[stream_idx] >> XF_BITSHIFT(XF_NPPC))) / 4;

     } else if (stream_idx == 3 && num_rows > 0) {
     Streampipeline(img_inp4 + rd_offset4, img_out4 + wr_offset4, img_out_ir4 + wr_offset4, lut4, num_rows,
                   height_arr[stream_idx], width_arr[stream_idx], STRM4_ROWS, dgam_params_tmp, hist0_awb,
                   hist1_awb, igain_0, igain_1, flag_awb, eof_awb, array_params, gamma_lut_tmp, wr_hls_tmp,
                   R_IR_C1_wgts_tmp, R_IR_C2_wgts_tmp, B_at_R_wgts_tmp, IR_at_R_wgts_tmp, IR_at_B_wgts_tmp,
                   sub_wgts_tmp, dcp_params_12to16_tmp, hist0_aec, hist1_aec, flag_aec, eof_aec, omin_r, omax_r,
                   omin_w, omax_w, mean1, mean2, L_max1, L_max2, L_min1, L_min2, c1_tmp, c2_tmp, flag_tm,
                   eof_tm, stream_idx, slice_idx);

     rd_offset4 += (RD_MULT * num_rows * ((width_arr[stream_idx] + RD_ADD) >> XF_BITSHIFT(XF_NPPC))) / 4;
     wr_offset4 += (num_rows * (width_arr[stream_idx] >> XF_BITSHIFT(XF_NPPC))) / 4;
     }
     // Update remaining rows to process
     rem_rows[stream_idx] = rem_rows[stream_idx] - num_rows * RD_MULT;

     // Next stream selection
     if (stream_idx == NUM_STREAMS - 1) {
       stream_idx = 0;
       slice_idx++;

     } else {
       stream_idx++;
     }

     // Update total rows to process
     r += num_rows * RD_MULT;
   } // TOTAL_ROWS_LOOP

 return;

}

Create and Launch Kernel in the Testbench:

The histogram needs two frames to populate the histogram and to get correct results in the auto exposure frame. Auto white balance, GTM and other tone-mapping functions need one extra frame in each to populate its parameters and apply those parameters to get a correct image. For the specific example below, four iterations are needed because the AEC, AWB, and LTM modulea are selected.

// Create a kernel:
OCL_CHECK(err, cl::Kernel kernel(program, "ISPPipeline_accel", &err));

for (int i = 0; i < 4; i++) {

  OCL_CHECK(err, q.enqueueWriteBuffer(buffer_inVec_Weights,  // buffer on the FPGA
                                    CL_TRUE,                 // blocking call
                                    0,                       // buffer offset in bytes
                                    vec_weight_size_bytes,   // Size in bytes
                                    wr_hls));

  OCL_CHECK(err, q.enqueueWriteBuffer(buffer_decompand_params,  // buffer on the FPGA
                                    CL_TRUE,                  // blocking call
                                    0,                        // buffer offset in bytes
                                    dcp_params_in_size_bytes, // Size in bytes
                                    dcp_params_12to16));

   OCL_CHECK(err, q.enqueueWriteBuffer(buffer_R_IR_C1,        // buffer on the FPGA
                                    CL_TRUE,               // blocking call
                                    0,                     // buffer offset in bytes
                                    filter1_in_size_bytes, // Size in bytes
                                    R_IR_C1_wgts));
  OCL_CHECK(err, q.enqueueWriteBuffer(buffer_R_IR_C2,        // buffer on the FPGA
                                    CL_TRUE,               // blocking call
                                    0,                     // buffer offset in bytes
                                    filter1_in_size_bytes, // Size in bytes
                                    R_IR_C2_wgts));

  OCL_CHECK(err, q.enqueueWriteBuffer(buffer_B_at_R,         // buffer on the FPGA
                                    CL_TRUE,               // blocking call
                                    0,                     // buffer offset in bytes
                                    filter1_in_size_bytes, // Size in bytes
                                    B_at_R_wgts));

  OCL_CHECK(err, q.enqueueWriteBuffer(buffer_IR_at_R,        // buffer on the FPGA
                                    CL_TRUE,               // blocking call
                                    0,                     // buffer offset in bytes
                                    filter2_in_size_bytes, // Size in bytes
                                    IR_at_R_wgts));

  OCL_CHECK(err, q.enqueueWriteBuffer(buffer_IR_at_B,        // buffer on the FPGA
                                    CL_TRUE,               // blocking call
                                    0,                     // buffer offset in bytes
                                    filter2_in_size_bytes, // Size in bytes
                                    IR_at_B_wgts));

  OCL_CHECK(err, q.enqueueWriteBuffer(buffer_sub_wgts,        // buffer on the FPGA
                                    CL_TRUE,                // blocking call
                                    0,                      // buffer offset in bytes
                                    sub_wgts_in_size_bytes, // Size in bytes
                                    sub_wgts));
  OCL_CHECK(err, q.enqueueWriteBuffer(buffer_dgam_params,        // buffer on the FPGA
                                    CL_TRUE,                   // blocking call
                                    0,                         // buffer offset in bytes
                                    dgam_params_in_size_bytes, // Size in bytes
                                    dgam_params));

  OCL_CHECK(err, q.enqueueWriteBuffer(buffer_c1,     // buffer on the FPGA
                                    CL_TRUE,       // blocking call
                                    0,             // buffer offset in bytes
                                    c1_size_bytes, // Size in bytes
                                    c1));
  OCL_CHECK(err, q.enqueueWriteBuffer(buffer_c2,     // buffer on the FPGA
                                    CL_TRUE,       // blocking call
                                    0,             // buffer offset in bytes
                                    c2_size_bytes, // Size in bytes
                                    c2));

  OCL_CHECK(err, q.enqueueWriteBuffer(buffer_array,     // buffer on the FPGA
                                    CL_TRUE,            // blocking call
                                    0,                  // buffer offset in bytes
                                    array_size_bytes,   // Size in bytes
                                    array_params));

  OCL_CHECK(err, q.enqueueWriteBuffer(buffer_inVec,      // buffer on the FPGA
                                    CL_TRUE,             // blocking call
                                    0,                   // buffer offset in bytes
                                    vec_in_size_bytes,   // Size in bytes
                                    gamma_lut));

  OCL_CHECK(err, q.enqueueWriteBuffer(buffer_inLut1,     // buffer on the FPGA
                                    CL_TRUE,           // blocking call
                                    0,                 // buffer offset in bytes
                                    lut_in_size_bytes, // Size in bytes
                                    casted_lut1,       // Pointer to the data to copy
                                    nullptr));

  OCL_CHECK(err, q.enqueueWriteBuffer(buffer_inLut2,     // buffer on the FPGA
                                    CL_TRUE,           // blocking call
                                    0,                 // buffer offset in bytes
                                    lut_in_size_bytes, // Size in bytes
                                    casted_lut2,       // Pointer to the data to copy
                                    nullptr));

  OCL_CHECK(err, q.enqueueWriteBuffer(buffer_inLut3,     // buffer on the FPGA
                                    CL_TRUE,           // blocking call
                                    0,                 // buffer offset in bytes
                                    lut_in_size_bytes, // Size in bytes
                                    casted_lut3,       // Pointer to the data to copy
                                    nullptr));

  OCL_CHECK(err, q.enqueueWriteBuffer(buffer_inLut4,     // buffer on the FPGA
                                    CL_TRUE,           // blocking call
                                    0,                 // buffer offset in bytes
                                    lut_in_size_bytes, // Size in bytes
                                    casted_lut4,       // Pointer to the data to copy
                                    nullptr));


  if(HDR_FUSION) {
    OCL_CHECK(err, q.enqueueWriteBuffer(buffer_inImage1, CL_TRUE, 0, image_in_size_bytes, interleaved_img1.data));
    OCL_CHECK(err, q.enqueueWriteBuffer(buffer_inImage2, CL_TRUE, 0, image_in_size_bytes, interleaved_img2.data));
    OCL_CHECK(err, q.enqueueWriteBuffer(buffer_inImage3, CL_TRUE, 0, image_in_size_bytes, interleaved_img3.data));
    OCL_CHECK(err, q.enqueueWriteBuffer(buffer_inImage4, CL_TRUE, 0, image_in_size_bytes, interleaved_img4.data));

  }
  else {
    OCL_CHECK(err, q.enqueueWriteBuffer(buffer_inImage1, CL_TRUE, 0, image_in_size_bytes, out_img1_12bit.data));
    OCL_CHECK(err, q.enqueueWriteBuffer(buffer_inImage2, CL_TRUE, 0, image_in_size_bytes, out_img1_12bit.data));
    OCL_CHECK(err, q.enqueueWriteBuffer(buffer_inImage3, CL_TRUE, 0, image_in_size_bytes, out_img1_12bit.data));
    OCL_CHECK(err, q.enqueueWriteBuffer(buffer_inImage4, CL_TRUE, 0, image_in_size_bytes, out_img1_12bit.data));
  }

  // Profiling Objects
  cl_ulong start = 0;
  cl_ulong end = 0;
  double diff_prof = 0.0f;
  cl::Event event_sp;

  // Launch the kernel
  OCL_CHECK(err, err = q.enqueueTask(kernel, NULL, &event_sp));
  clWaitForEvents(1, (const cl_event*)&event_sp);

  event_sp.getProfilingInfo(CL_PROFILING_COMMAND_START, &start);
  event_sp.getProfilingInfo(CL_PROFILING_COMMAND_END, &end);
  diff_prof = end - start;
  std::cout << (diff_prof / 1000000) << "ms" << std::endl;
  // Copying Device result data to Host memory
  q.enqueueReadBuffer(buffer_outImage1, CL_TRUE, 0, image_out_size_bytes, out_img1.data);
  q.enqueueReadBuffer(buffer_outImage2, CL_TRUE, 0, image_out_size_bytes, out_img2.data);
  q.enqueueReadBuffer(buffer_outImage3, CL_TRUE, 0, image_out_size_bytes, out_img3.data);
  q.enqueueReadBuffer(buffer_outImage4, CL_TRUE, 0, image_out_size_bytes, out_img4.data);

  if (USE_RGBIR) {
    q.enqueueReadBuffer(buffer_IRoutImage1, CL_TRUE, 0, image_out_ir_size_bytes, out_img_ir1.data);
    q.enqueueReadBuffer(buffer_IRoutImage2, CL_TRUE, 0, image_out_ir_size_bytes, out_img_ir2.data);
    q.enqueueReadBuffer(buffer_IRoutImage3, CL_TRUE, 0, image_out_ir_size_bytes, out_img_ir3.data);
    q.enqueueReadBuffer(buffer_IRoutImage4, CL_TRUE, 0, image_out_ir_size_bytes, out_img_ir4.data);
  }
}

Resource Utilization

The following table summarizes the resource utilization of ISP multistream generated using the Vitis HLS 2024.1 tool on a ZCU102 board.

Table 251 ISP Multistream Resource Utilization Summary
Operating Mode
Operating Frequency
(MHz)
Utilization Estimate
BRAM DSP CLB Registers CLB LUT
1 Pixel 150 209.5 325 60142 63718

Performance Estimate

The following table summarizes the performance of the ISP multistream in 1-pixel mode as generated using the Vitis HLS 2024.1 tool on a ZCU102 board.

Estimated average latency is obtained by running the accel with four iterations. The input to the accel is a 12bit non-linearized full-HD (1920x1080) image.

Table 252 ISP Multistream Performance Estimate Summary
Operating Mode Latency Estimate
Average latency(ms)
1 pixel operation (150 MHz) 62.742