MAIX KPU揭秘:手写你的第一层网络

我们将介绍Sipeed MAIX最关键的KPU部分的原理,并动手实现最简单的一层神经网络。

何谓KPU?为什么需要KPU?

KPU,Neural Network Processor,或称为 Knowledge Processing Unit,是MAIX的AI处理部分的核心。那么KPU是如何处理AI算法的呢?首先,目前(2019Q1)所谓的AI算法, 主要是基于神经网络 算法衍生的各种结构的神经网络模型,如VGG,ResNet,Inception, Xception, SeqeezeNet,MobileNet, etc.那为什么不使用普通CPU/MCU进行神经网络算法的计算呢?因为对多数应用场景来说,神经网络的运算量太大:例如640×480像素的RGB图片分析,假设第一层网络每颜色通道有16个3×3卷积核,那么仅第一层就要进行640x480x3x16=15M次卷积运算,而一次3×3矩阵的计算时间,就是9次乘加的时间,加载两个操作数到寄存器,各需要3周期,乘法一个周期,加法一个周期,比较是否到9一个周期,跳转一个周期,那么大致需要9x(3+3+1+1+1+1)=90周期所以计算一层网络就用时15M*90=1.35G个周期!我们去掉零头,1G个周期,那么100MHz主频运行的STM32就要10s计算一层,1GHz主频运行的Cortex-A7需要1s计算一层!而通常情况下,一个实用的神经网络模型需要10层以上的计算!那对于没有优化的CPU就需要秒级甚至分钟级的时间去运算!所以,一般来说,CPU/MCU来计算神经网络是非常耗时,不具有实用性的。而神经网络运算的应用场景,又分为训练侧 与 推断侧。对于训练模型所需要的高运算力,我们已经有了NVIDIA的各种高性能显卡来加速运算。对于模型推断,通常是在消费电子/工业电子终端上,也即AIoT,对于体积,能耗会有要求,所以,我们必须引入专用的加速模块来加速模型推断运算,这时候,KPU就应运而生了!

KPU的基础架构

让我们回顾下经典神经网络的基础运算操作:1. 卷积(Convolution):1×1卷积,3×3卷积,5×5及更高的卷积2. 批归一化(Batch Normalization)3. 激活(Activate)4. 池化(Pooling)5. 矩阵运算(Matrix Calculate):矩阵乘,加 对于基础的神经网络结构,仅具备1,2,3,4 四种操作;对于新型网络结构,比如ResNet,在卷积结果后会加一个变量,就需要使用第五种操作,矩阵运算。 对于MAIX的主控芯片K210来说,它内置实现了 卷积,批归一化,激活,池化 这4钟基础操作的硬件加速,但是没有实现一般的矩阵运算,所以在实现的网络结构上有所限制。对于需要额外操作的网络结构,用户必须在硬件完成基础操作后,手工插入CPU干预的处理层实现,会导致帧数降低,所以建议用户优化自己的网络结构到基础网络形式。所幸的是,该芯片的第二代将支持通用矩阵计算,并固化更多类型的网络结构。 在KPU中,上述提到的4种基础操作并非是单独的加速模块,而是合成一体的加速模块,有效避免了CPU干预造成的损耗,但也丧失了一些操作上的灵活性。我们从standalone sdk/demo 以及 Model Compiler 中分析出 KPU加速模块的原理框图如下,大家看图即懂。

KPU寄存器配置说明

芯片厂家没有给出寄存器手册,我们从kpu.c, kpu.h, Model Compiler中分析各寄存器定义。KPU的寄存器配置写在 kpu_layer_argument_t 结构体中,我们取standalone demo中的kpu demo中的gencode.c来分析.(https://github.com/kendryte/kendryte-standalone-demo/blob/master/kpu/gencode_output.c)

//层参数列表,共16层kpu_layer_argument_t la[] __attribute__((aligned(128))) = {
// 第0层{
 .kernel_offset.data = {
  .coef_row_offset = 0,		//固定为0
  .coef_column_offset = 0	//固定为0
 },
 .image_addr.data = {		//图像输入输出地址,一个在前,一个在后,下一层运算的时候翻过来,可以避免拷贝工作。
  .image_dst_addr = (uint64_t)0x6980,	//图像输出地址,int((0 if idx & 1 else (img_ram_size - img_output_size)) / 64)
  .image_src_addr = (uint64_t)0x0		//图像加载地址
 },
 .kernel_calc_type_cfg.data = {
  .load_act = 1,			//使能激活函数,必须使能(硬件设计如此),不使能则输出全为0
  .active_addr = 0,			//激活参数加载首地址,在kpu_task_init里初始化为激活折线表
  .row_switch_addr = 0x5,	//图像宽占用的单元数,一个单元64Byte.  ceil(width/64)=ceil(320/64)=5
  .channel_switch_addr = 0x4b0,			//单通道占用的单元数.  row_switch_addr*height=5*240=1200=0x4b0
  .coef_size = 0,			//固定为0
  .coef_group = 1			//一次可以计算的组数,因为一个单元64字节,
							//所以宽度>32,设置为1;宽度17~32,设置为2;宽度<=16,设置为4
 },
 .interrupt_enabe.data = {
  .depth_wise_layer = 0,	//常规卷积层,设置为0
  .ram_flag = 0,			//固定为0
  .int_en = 0,				//失能中断
  .full_add = 0				//固定为0
 },
 .dma_parameter.data = {	//DMA传输参数
  .dma_total_byte = 307199,		//该层输出16通道,即 19200*16=308200
  .send_data_out = 0,			//使能输出数据
  .channel_byte_num = 19199		//输出单通道的字节数,因为后面是2x2 pooling, 所以大小为160*120=19200
 },
 .conv_value.data = {		//卷积参数,y = (x*arg_x)>>shr_x
  .arg_x = 0x809179,		//24bit	乘法参数
  .arg_w = 0x0,
  .shr_x = 8,				//4bit	移位参数
  .shr_w = 0
 },
 .conv_value2.data = {		//arg_add = kernel_size * kernel_size * bw_div_sw * bx_div_sx =3x3x?x?
  .arg_add = 0
 },
 .write_back_cfg.data = {	//写回配置
  .wb_row_switch_addr = 0x3,		//ceil(160/64)=3
  .wb_channel_switch_addr = 0x168,	//120*3=360=0x168
  .wb_group = 1						//输入行宽>32,设置为1
 },
 .image_size.data = {	//输入320*240,输出160*120
  .o_col_high = 0x77,
  .i_col_high = 0xef,
  .i_row_wid = 0x13f,
  .o_row_wid = 0x9f
 },
 .kernel_pool_type_cfg.data = {
  .bypass_conv = 0,		//硬件不能跳过卷积,固定为0
  .pad_value = 0x0,		//边界填充0
  .load_para = 1,		//硬件不能跳过归一化,固定为1
  .pad_type = 0,		//使用填充值
  .kernel_type = 1,		//3x3设置为1, 1x1设置为0
  .pool_type = 1,		//池化类型,步长为2的2x2 max pooling
  .dma_burst_size = 15,	//dma突发传送大小,16字节;脚本中固定为16
  .bwsx_base_addr = 0,	//批归一化首地址,在kpu_task_init中初始化
  .first_stride = 0		//图像高度不超过255;图像高度最大为512。
 },
 .image_channel_num.data = {
  .o_ch_num_coef = 0xf,	//一次性参数加载可计算的通道数,16通道。4K/单通道卷积核数
						//o_ch_num_coef = math.floor(weight_buffer_size / o_ch_weights_size_pad)	
  .i_ch_num = 0x2,		//输入通道,3通道 RGB
  .o_ch_num = 0xf		//输出通道,16通道
 },
 .kernel_load_cfg.data = {
  .load_time = 0,		//卷积加载次数,不超过72KB,只加载一次
  .para_size = 864,		//卷积参数大小864字节,864=3(RGB)*9(3x3)*2*16
  .para_start_addr = 0,	//起始地址
  .load_coor = 1		//允许加载卷积参数
 }
},
   //第0层参数结束……
};

上表中还有些结构体内容没有填充,是在KPU初始化函数中填充:```kpu_task_t* kpu_task_init(kpu_task_t* task){
 la[0].kernel_pool_type_cfg.data.bwsx_base_addr = (uint64_t)&bwsx_base_addr_0;	//初始化批归一化表
 la[0].kernel_calc_type_cfg.data.active_addr = (uint64_t)&active_addr_0;		//初始化激活表
 la[0].kernel_load_cfg.data.para_start_addr = (uint64_t)¶_start_addr_0; 	//初始化参数加载
 ……	//共16层参数,逐层计算
 task->layers = la;
 task->layers_length = sizeof(la)/sizeof(la[0]);	//16层
 task->eight_bit_mode = 0;					//16bit模式
 task->output_scale = 0.12349300010531557;	//输出的缩放,偏置
 task->output_bias = -13.528212547302246;
 return task;
}```

 
可以看到这里初始化了批归一化表,激活表,及卷积参数加载地址。
激活函数折点表,16区间:
```
//y=(uint8_t)((((uint64_t)(x - x_start) * y_mul) >> shift) + bias);
kpu_activate_table_t active_addr_0 __attribute__((aligned(128))) = {
.activate_para = { //shift_number 8bit, y_mul 16bit, x_start 36bit, 8+16+36=60, pack into 64bit reg
{.data = {.shift_number=0, .y_mul=0, .x_start=0x800000000 }},
{.data = {.shift_number=39, .y_mul=29167, .x_start=0xfe1a77234 }},
{.data = {.shift_number=39, .y_mul=29167, .x_start=0xff4c0c897 }},
{.data = {.shift_number=35, .y_mul=18229, .x_start=0xfffffafbb }},
{.data = {.shift_number=35, .y_mul=18229, .x_start=0xc90319 }},
{.data = {.shift_number=35, .y_mul=18229, .x_start=0x2b1f223 }},
{.data = {.shift_number=35, .y_mul=18229, .x_start=0x49ae12d }},
{.data = {.shift_number=35, .y_mul=18229, .x_start=0x683d037 }},
{.data = {.shift_number=35, .y_mul=18229, .x_start=0x86cbf41 }},
{.data = {.shift_number=35, .y_mul=18229, .x_start=0xa55ae4b }},
{.data = {.shift_number=35, .y_mul=18229, .x_start=0xc3e9d54 }},
{.data = {.shift_number=35, .y_mul=18229, .x_start=0xe278c5e }},
{.data = {.shift_number=35, .y_mul=18229, .x_start=0x10107b68 }},
{.data = {.shift_number=35, .y_mul=18229, .x_start=0x11f96a72 }},
{.data = {.shift_number=35, .y_mul=18229, .x_start=0x13e2597c }},
{.data = {.shift_number=35, .y_mul=18229, .x_start=0x15cb4886 }}
},
.activate_para_bias0.data = { //bias 8bit, 8 bais pack into 64bit reg
.result_bias = {0,0,17,27,34,51,68,85}
},
.activate_para_bias1.data = {
.result_bias = {102,119,136,153,170,187,204,221}
}
};
 
批归一化表,16个 通道:
```
//y = (x*norm_mul)>>norm_shift + norm_add
//生成时shift写死为了15
kpu_batchnorm_argument_t bwsx_base_addr_0[] __attribute__((aligned(128))) = {
{.batchnorm.data = {.norm_mul = 0x4c407, .norm_add = 0x23523f0, .norm_shift = 15}},
{.batchnorm.data = {.norm_mul = 0x79774, .norm_add = 0x493a3e, .norm_shift = 15}},
{.batchnorm.data = {.norm_mul = 0x4bd72, .norm_add = 0xf58bae, .norm_shift = 15}},
{.batchnorm.data = {.norm_mul = 0x10a7ae, .norm_add = 0x99cf06, .norm_shift = 15}},
{.batchnorm.data = {.norm_mul = 0xe1ea4, .norm_add = 0x289634, .norm_shift = 15}},
{.batchnorm.data = {.norm_mul = 0x150a0, .norm_add = 0x2428afc, .norm_shift = 15}},
{.batchnorm.data = {.norm_mul = 0xa72e4, .norm_add = 0xffd850ff, .norm_shift = 15}},
{.batchnorm.data = {.norm_mul = 0x7b54b, .norm_add = 0x71a3b5, .norm_shift = 15}},
{.batchnorm.data = {.norm_mul = 0x1cb84b, .norm_add = 0x13fef34, .norm_shift = 15}},
{.batchnorm.data = {.norm_mul = 0x1b8a86, .norm_add = 0x342b07, .norm_shift = 15}},
{.batchnorm.data = {.norm_mul = 0x5dd03, .norm_add = 0x965b43, .norm_shift = 15}},
{.batchnorm.data = {.norm_mul = 0xb2607, .norm_add = 0x259e2c0, .norm_shift = 15}},
{.batchnorm.data = {.norm_mul = 0xa1abb, .norm_add = 0x1b68398, .norm_shift = 15}},
{.batchnorm.data = {.norm_mul = 0x25a89, .norm_add = 0x202e81c, .norm_shift = 15}},
{.batchnorm.data = {.norm_mul = 0x54d31, .norm_add = 0x61c1e20, .norm_shift = 15}},
{.batchnorm.data = {.norm_mul = 0x62b56, .norm_add = 0x6cd3fc, .norm_shift = 15}}
};
```
//卷积核参数
表```uint16_t para_start_addr_0[] __attribute__((aligned(128))) = {
0x51d4, 0x560f, 0x4496, 0x555b, 0x5119, 0x5a03, 0x566f, 0x53c6, 0x498f, 0xb5ef, 0xbf72, 0xa7ab, 0x9d7e, 0x9035, 0xa15d, 0x8e32, 0x9507, 0x85d2, 0x70b1, 0x806f, 0x79c0, 0x8b4d, 0x98fe, 0x95ee, 0x9c96, 0x9bfc, 0x9f36, 0xdb30, 0x33ef, 0x6032, 0xebe6, 0x39d3, 0x633b, 0xd744, 0x4194, 0x6707, 0xcb4e, 0x34ba, 0x7687, 0xdfb0, 0x30bb, 0x7927, 0xb97d, 0x40d3, 0x7fe4, 0xb72b, 0x523d, 0x7104, 0xc994, 0x50be, 0x70e3, 0xb16a, 0x58dd, 0x6914, 0x8afb, 0x7f23, 0x7e6f, 0x7fdc, 0x4bf7, 0x7835, 0x80bf, 0x7dc3, 0x7ba0, 0x70db,
0x774a, 0x7f8f, 0x791c, 0x5f55, 0x82b8, 0x8066, 0x83f0, 0x820b, 0x825d, 0x8649, 0x7df9, 0x7a0e, 0x558a, 0x8ae2, 0x7f27, 0x7f64, 0x79a9, 0x615e, 0x6635, 0x65f2, 0x824f, 0x816a, 0x8680, 0x98e6, 0x9884, 0x933f, 0x680a, 0x6a0d, 0x6b9e, 0x9035, 0x87a4, 0x8779, 0x87f4, 0x8c33, 0x84bb, 0x6415, 0x7002, 0x6db9, 0x99cc, 0x8e8d, 0x9150, 0x8556, 0x8298, 0x82e6, 0x872e, 0x7ff5, 0x7c8a, 0x81e7, 0x4df1, 0xadaf, 0xb520, 0xc1b9, 0x0, 0x8093, 0x812b, 0x82d4, 0x7b23, 0x53f7, 0xb5e5, 0xa308, 0xc0fc, 0xd2e, 0x7f08, 0x8090,
0x7ac9, 0x7b27, 0x5049, 0xb1f0, 0xa683, 0xc544, 0x1633, 0x73b7, 0x6d6e, 0x7597, 0x7b5c, 0x71c0, 0x7b5d, 0x7561, 0x7153, 0x7ec1, 0x74af, 0x6acf, 0x7898, 0x7ee8, 0x73be, 0x7e1a, 0x856e, 0x7fe0, 0x8b5d, 0x78f3, 0x77b6, 0x7fd6, 0x77d0, 0x73c8, 0x8384, 0x70ab, 0x7638, 0x8448, 0x5e13, 0x41d6, 0x5742, 0xd6fd, 0xf185, 0xd8ff, 0x52ac, 0x3afd, 0x531b, 0x674c, 0x4db2, 0x5a31, 0xc677, 0xe222, 0xbd9b, 0x64ce, 0x494b, 0x5a67, 0x82e9, 0x721e, 0x7b5b, 0xae49, 0xbedb, 0xac77, 0x5161, 0x41bb, 0x56f4, 0xb5e4, 0xb0e6, 0x942f,
0x8681, 0x8714, 0x8395, 0x4160, 0x4763, 0x5e49, 0xbae2, 0xb877, 0x940d, 0x9473, 0x9238, 0x91d7, 0x3023, 0x33e8, 0x56ec, 0xa9d7, 0xa6de, 0x8f28, 0x94c0, 0x9261, 0x8ba5, 0x452b, 0x4c9c, 0x5ad7, 0x93df, 0x80e4, 0x685c, 0x887f, 0x85e8, 0x5ae7, 0x6a0a, 0x715e, 0xb7fb, 0x8c45, 0x7f99, 0x6077, 0x8768, 0x8bed, 0x6308, 0x70c2, 0x72cf, 0xb400, 0x7731, 0x7b42, 0x76eb, 0x7f80, 0x899d, 0x68f0, 0x7aec, 0x7948, 0xa766, 0x6cf7, 0x9a9c, 0x848c, 0x8f6a, 0x8f23, 0x64ce, 0x9288, 0x6d6e, 0x779b, 0x6d4b, 0x986d, 0x81ce, 0x9b3c,
0x8ee0, 0x64bb, 0x8cda, 0x5922, 0x6a11, 0x596b, 0x9142, 0x86e6, 0x9107, 0x95c2, 0x7b8a, 0x9113, 0x73df, 0x6fc0, 0x4482, 0x5aef, 0xddf4, 0x43b3, 0x39a5, 0xffff, 0x43db, 0x4dc9, 0xe663, 0x50eb, 0x5bea, 0xd0a1, 0x5395, 0x42ce, 0xeb37, 0x5f02, 0x54b9, 0xc84f, 0x4b78, 0x697c, 0xc693, 0x5686, 0x4e78, 0xdd55, 0x53c2, 0x6351, 0xc0fe, 0x8eb1, 0x817c, 0x7590, 0x7a66, 0x7168, 0x74f3, 0x7d86, 0x6f2d, 0x8b15, 0x7f21, 0x80a5, 0x6c26, 0x7561, 0x7661, 0x726d, 0x8272, 0x7d32, 0x87e9, 0x90a0, 0x85e5, 0x7229, 0x7ff5, 0x7c3c,
0x7095, 0x83f7, 0x7424, 0x7eac, 0x81b8, 0x7245, 0xa0b1, 0x777e, 0x73e2, 0x74b5, 0x7f83, 0x73c2, 0x68b1, 0x85b2, 0x715e, 0x957b, 0x83d2, 0x7c75, 0x71d2, 0x8525, 0x830d, 0x6fc2, 0x76f8, 0x7454, 0x8f1f, 0x7cbb, 0x7867, 0x714e, 0x82bb, 0x80af, 0x705a, 0x4ef2, 0x492d, 0x487b, 0x5ed4, 0x5c4a, 0x60f8, 0x9158, 0x8a70, 0x90a5, 0x6cdd, 0x7c1d, 0x78a6, 0x71fe, 0x6fae, 0x680d, 0x59e7, 0x4e69, 0x6926, 0xafcb, 0xbffc, 0xbaa5, 0xb21c, 0xbaa3, 0xa6f3, 0x98f3, 0x9715, 0x96ff, 0x823e, 0x80ce, 0x77d4, 0x80c3, 0x74d0, 0x6a80,
0x8556, 0x6202, 0x7250, 0x860a, 0x8417, 0x8168, 0x892b, 0x7612, 0x6c7b, 0x8c31, 0x6669, 0x7b0f, 0x7f76, 0x835f, 0x7188, 0x842f, 0x7e1c, 0x7227, 0x7ef1, 0x678d, 0x7b64, 0x4bbd, 0x37fa, 0x4cf3, 0xa1cf, 0x819b, 0x699b, 0xc2c3, 0xc53e, 0x94da, 0x5049, 0x354e, 0x553e, 0xa78b, 0x8ccc, 0x647e, 0xba65, 0xbd12, 0x8b34, 0x4b5b, 0x35b1, 0x4562, 0xa49e, 0x8aec, 0x703c, 0xbb96, 0xc214, 0xa3f5};```

三组重要参数的生成

 ### conv_value
卷积参数值,我们训练网络时使用的是浮点数训练,但是K210只支持int16定点计算,所以需要先转换为定点计算。主要方法即为,估计动态范围,缩放,参见kpu_conv例程的static void conv_float2u16(float* data, uint16_t* data_u16, int len)
### kpu_batch
norm_argument_t K210_layer.py:bn_mean, bn_var, bn_gamma, bn_beta, bn_epsilon = bn_mean_var_gamma_beta_epsilon
### kpu_activate_table_t
16段折线拟合激活函数,按照你的需求生产激活函数即可。

手写你的第一层网络

阅读了以上代码后,让我们来手写一层最简单的网络吧!我们来写一个计算WxHxRGB图像进行3×3卷积的demo,代码已发布在https://github.com/sipeed/LicheeDan_K210_examples/src/kpu_conv

这里注意下多通道输入时候的卷积参数排放和输出方式:

卷积参数0~2 分别与输入通道0~2卷积,然后相加,得到输出通道0的值

卷积参数3~5 分别与输入通道0~2卷积,然后相加,得到输出通道1的值

以此类推。输入通道的图像,都是需要经过64字节宽度补齐的图像。

另外,输入图像宽度必须>=4, 输入图像高必须>=2

最后,在400M CPU频率,400M KPU频率的情况下,我们测试计算一层conv的时间:`RGB QVGA image0.2M tick CPU 0.18M tick KPU`

也即计算一层神经网络的时间为1ms以内,速度可达1000fps。

讨论

https://bbs.sipeed.com/t/topic/502

发表评论