PowerPoint 演示文稿 - PDF Free Download

深度学习在移动端的优化实践黄文波 ( 鬼谷 ) 美丽联合集团

集团简介美丽联合集团是专注服务女性的时尚消费平台, 成立于 2016 年 6 月 15 日美丽联合集团旗下包括 : 蘑菇街美丽说 uni 锐鲨 MOGU STATION 等产品与服务覆盖时尚消费的各个领域, 满足不同年龄层消费力和审美品位的女性用户日常时尚资讯与时尚消费所需

整体数据时尚红人 120,000+ 日活用户 10,000,000+ 女性用户占比 95%+ 注册用户数 200,000,000+ 成交规模 20,000,000,000+ 移动用户占比 95%+

主要内容背景与现状模型压缩与设计移动端实践总结

01 背景及现状

深度学习 : 从云端到边缘计算

蘑菇街为什么做深度学习优化? 服务器减少训练预测的时间节约 GPU 资源, 节约电移动端实时响应需求本地化运行, 减少服务器压力保护用户隐私

CNN 基础

Challenge 深度学习 : 网络越来越深, 准确率越来越高模型越来越大越多的存储和计算耗费越多能量移动设备 : 内存有限计算性能有限功耗有限

02 模型压缩与设计

Model Compression Pruning Quantization Huffman Encoding

Pruning Weight-Level Pruning for the sparse connections Han et al, Learning both weights and connections for efficient neural networks, NIPS 2015

Pruning Channel-Level Pruning and retraining iteratively Li et al, Pruning filter for efficient convnets, ICLR 2017

Pruning Channel-Level Pruning with L1 regularization Liu et al, Learning efficient convolutional networks through network slimming, ICCV 2017

Quantization Han et al, Deep Compression: Compressing deep neural networks with pruning, trained quantization and huffman coding,

Huffman Encoding Han et al, Deep Compression: Compressing deep neural networks with pruning, trained quantization and huffman coding,

Summary of model compression Pruning: less number of channels channel-level pruning and retraining iteratively channel-level pruning with L1 regularization

Smaller CNNs architecture design SqueezeNet MobileNet ShuffleNet

SqueezeNet Input 64 1x1 Conv Squeeze 16 1x1 Conv Expand 3x3 Conv Expand 64 64 Output Concat/Eltwis e 128 Iandola et al, SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and < 0.5MB model size, arxiv 2016

MobileNets Howard et al, MobileNets: Efficient convolutional neural networks for mobile vision applications, arxiv

ShuffleNet Zhang et al, ShuffleNet: An extremely efficient convolutional neural network for mobile devices, arxiv 2017

Our practice Overall Performance of Pruning ResNet50 on ImageNet Model strategy Top-1 Top-5 Model Size Original - 75% 92.27% 98M Pruned-50 Pruning 72.5% 90.9% 49M Pruned-Q-50 Pruning + Quantization 72.4% 90.6% 15M

Our practice Performance of Pruning ResNet-34 on Our Dataset Model Top-1 Top-5 Inference Time Model Size Original 48.92% 82.2% 96ms 86M Pruned-64 48.27% 81.5% 45ms 31M (2319 categories, 1200W samples)

Our practice ParseNet 18 类 ( 基础网络 :MobileNet) Model miou Pixel-Level- Accuracy Model Size ParseNet 56% 93.5% 13M

03 移动端工程实践

移动端服务端分工 Training Inference

DL frameworks Caffe Caffe2 MXNet Tensorflow Torch. NCNN MDL CoreML Tensorflow Lite

From training to inference Convolution BN Convolution Relu

优化卷积计算 25*9 9*1 Direct convolution im2col-based convolution

优化卷积计算 Cho et al, MEC: Memory-efficient convolution for deep neural network,

浮点运算定点化 Input(float) Min Max Quantize 8 Bit Min Max QuantizedRelu 8 Bit Min Max Dequantize Output(float)

卷积计算还能怎么进化? 再牛逼的优化算法, 都不如硬件实现来得直接通用卷积 VS 特定卷积

Android 端深度学习框架 NCNN vs MDL FrameWork 单线程四线程内存 NCNN 370ms 200ms 25M MDL 360ms 190ms 30M MobileNet on HuaweiP9 Tensorflow Lite Quantize MobileNet 85ms Float Mobilenet 400ms

ios 上的 DL CoreM L 可扩展性不强, 不适合部署新算法 ; 需要 ios 11+

MPSCNN 充分利用 GPU 资源, 不用抢占 CPU 利用 Metal 开发新的层很方便 Tips: 半精度计算 ; 权重存储格式为 NHWC

MPSCNN Slice1 Slice2 Slice0 MPSImage The layout of a 9-channel CNN image with a width of 3 and a height of 2.

Metal Performance Shader kernel void eltwisesum_array( texture2d_array<half, access::sample> intexture1 [[texture(0)]], texture2d_array<half, access::sample> intexture2 [[texture(1)]], texture2d_array<half, access::write> outtexture [[texture(2)]], ushort3 gid [[thread_position_in_grid]]) { if (gid.x >= outtexture.get_width() gid.y >= outtexture.get_height() gid.z >= outtexture.get_array_size()) return; constexpr sampler s(coord::pixel, filter::nearest, address::clamp_to_zero); const ushort2 pos = gid.xy; const ushort slice = gid.z; half4 in[2]; in[0] = intexture1.sample(s, float2(pos.x, pos.y), slice); in[1] = intexture2.sample(s, float2(pos.x, pos.y), slice); float4 out = float4(0.0f); out =float4( in[0]+in[1]); outtexture.write(half4(out), gid.xy, gid.z); }

MPSCNN VS NCNN on iphone FrameWork Time NCNN 110ms MPSCNN 45ms Device: iphone 6s

How to create a new framework 优化 inference 网络结构多线程 GPU 加速指令集加速内存布局优化 NCHW >NHWC 浮点运算定点化

Mogu Deep Learning Toolkit For professional Highly Flexible Mogu DL Toolkit High Cohesion & Low Coupling Easy to use

Mogu Deep Learning Toolkit Classification Detection Create layer Init layer Inferenc e Segmentation

Mogu DL Toolkit-Example class MobileNet{ public: Input input; Convolution fc7; MobileNet int Init(const char* modelpath); int infer(mat &input,mat &output); private: Convolution conv1_s2; ReLU relu1; ConvolutionDepthWise conv2_1_dw; ReLU relu2_1_dw; Convolution conv2_1_s1; ReLU relu2_1_s1; ConvolutionDepthWise conv2_2_dw;. }

Demo

总结模型压缩的两类方式移动端优化实践 Mogu DL Toolkit 深度学习优化在蘑菇街业务中的尝试

致谢感谢蘑菇街图像算法部门深度学习优化小组全体成员的共同努力!!

Thanks!

敬请关注蘑菇街技术博客公众号敬请关注美丽联合数据技术公众号