泰迪杯全国数据挖掘挑战赛 www.tipdm.org 第四届 泰迪杯 全国数据挖掘挑战赛 优 秀 作 品 作品名称 : 基于深度学习和语言模型的印刷文字 OCR 系统 荣获奖项 : 特等并获企业冠名奖 作品单位 : 华南师范大学 作品成员 : 苏剑林曾玉婷
泰迪杯全国数据挖掘挑战赛 www.tipdm.org OCR 2016 5 15 (CNN) OCR(). +.... CNN 140 99.7% 92.1% 15% 90%. Viterbi. OCR..,,,,,
泰迪杯全国数据挖掘挑战赛 www.tipdm.org Abstract In this article, we design a series of algorithm to extract features and position text. Next we use convolutional neural network to train a character recognition system. And then we use language model to improve recognition effect. Based to the above steps, we achieve a complete OCR (Optical Character Recognition) system. For feature extraction, we discover a new approach better than traditional way which is based on boundary detection and dilation-erosion. According to some fundamental assumptions, we gain excellent text features via grey clustering, layer decomposition, noise reduction, and so on. The features we gain can not only be use for text poistioning at step II, but also text recognition at step III. For text positioning, we integrate the feature patches via neighbor searching, and gain the features of single line texts. Then we use a statistic way to cut the single line into single character. Our result show that this way can work well even if Chinese and English mixed in the one line. And for optical recognition, we use convolutional neural network to build up our model for single character and train it with 1.4 milion samples produced by ourselves. Fortunately, we gain a good model which has a 99.7% train accurary, 92.7% test accurary, even a 90% accurary for the samples who has 15% noise. Finally, for the better result, we use language model to improve our work. We calculate the probability transition matrix from hundreds of thousands wechat articles, and use Viterbi algorithm to dynamicly produce the optimal result. Combined the above works, we gain a complete OCR system. And the result show that our system work well for the printed text recognition. Keywords OCR, feature extraction, text positioning, CNN, deep learning, language model
1 1 2 1 2.1................................................ 1 2.2................................................ 1 2.3................................................ 2 3 2 3.1............................................... 3 3.2................................................ 3 3.2.1........................................... 4 3.2.2......................................... 5 3.3................................................ 5 3.3.1.............................................. 6 3.3.2........................................... 6 3.3.3............................................ 8 3.3.4............................................ 8 3.3.5........................................... 9 4 9 4.1................................................ 10 4.1.1............................................... 10 4.1.2............................................... 10 4.1.3............................................... 11 4.2................................................ 11 4.2.1............................................ 12 4.2.2............................................ 12 4.2.3............................................ 12 5 12 5.1................................................ 12 5.2................................................ 13 5.3................................................ 13 5.4................................................ 14 5.5................................................ 15 5.5.1........................................... 15 5.5.2........................................... 16 3
6 16 6.1................................................ 16 6.2................................................ 17 6.2.1.......................................... 18 6.2.2 Viterbi........................................... 18 6.3................................................ 19 7 19 7.1................................................ 19 7.2................................................ 19 7.3................................................ 19 21
2 1 (Optical Character Recognition, OCR) OCR OCRABBYY FineReaderTesseract OCR. ABBYY FineReader ( ) OCR. OCR Tesseract OCR. Google Tesseract OCR. OCR. OCR OCR.. 2.1 2 1. 2. 3. 4. 5. 6... 2.2 聚类分解 特征提取 字定位 光学识别 语 模型 去噪池化 碎 整合 单字切割 样本构建 模型训练 测试检验 转移概率 动态规划 1: 1
2.3 3 2.3 CentOS 7 + Python 2.7. Numpy SciPyPandasPillowKerasTheano. 5.4. 3 OCR.. OCR. + + [1]. ().. 2. 2: 3... 3: 2
3.1 3 3.1 m n M m, n. RGB. RGB 3 4(a). Y = 0.299R + 0.587G + 0.114B (1)..... 2. x x r (2) x M r 2. [0, 255] x x M min 255 (3) M max M min M max, M min M. 4(b). 3.2 (a) 4: (b). 1. 40 254 255 2.. [0, 255]. 3
3.2 3. KMeans MeanShift.. 3.2.1 5 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0.00 0 50 100 150 200 250 5:.. (kernel density estimation). Rosenblatt Parzen [2].. () ˆp(x) = 1 nh n ( ) x xi K h K(x). h 1 K(x) 1, x = 0 K (x) = 0, x 0 i=1. K(x) h x i x K ( x x i ). h h (bandwidth). h K(x) K(x). K(x) = 1 2π e x2 /2 (4) (5) (6). scott h 0.2. 6. 4
3.3 3 3.2.2 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0.00 0 50 100 150 200 250 6: 6. x = 10, 57, 97, 123, 154 25, 71, 121, 142.. 3 5. 7 1 0. (a) 1 (b) 2 (c) 3 (d) 4 (e) 5 7: 5 () 2.1.. 3.3 2.1. 5
3.3 3 3.3.1. 8 8. 9. 3.3.2 1 2 4 5 1 2 3... 9:... ( 8: 8 1 ) 1 0 8.... 7(e) 7(d). = (7) [0.1, 0.9]. 5 10. 6
3.3 3 (a) 1 (b) 2 (c) 3 (d) 4 (e) 5 10: [0.1, 0.9] 7
3.3 3 3.3.3 5 5... 5 ()... 1. 2.. 11. 11: 11.. /. 3.3.4. = (8) 16. 8
4. 16... *.. 0. 1+- < 16 (9) 1 0. 1.. Google Tesseract OCR. 3.3.5.. 9 ( 3 12). 12:... 0.75( π/4). 4. 1 2. 9
4.1 4 4.1. 13.. 4.1.1 13:... () *.... 4.1.2 (x,y) (x,y ) S (w,z ) (w,z) S 14: 10
4.2 4 14 (x, y) (z, w). ( x+z, ) y+w 2 2. S S ( x + z (x c, y c ) = x + z, y + w y + w ) (10) 2 2 2 2 x 2 c + y 2 c. (x c, y c ). ( (x c, y c) = x c w x w x, y c z y ) z y 2 2 2 2 (11) d(s, S ) = [max(x c, 0)] 2 + [max(y c, 0)] 2 (12) *.. 4.1.3. 4. 15 15: 4.2.. 11
5 4.2.1. 16(a). (a) 16: (b). 16(b). 4.2.2 15. 0... 4.2.3. 1. 2. 1.2 3.. 1 60%2 1.2.. 7.1. 5. 5.1. 12
5.2 5.. 1. 2. MNIST 99% 3. () OCR. 5.2.. 1. 48 48 2. 3000 () 26 () 10 3062 3. 45 4. 5 (46 50) 2. 5%. 3062 45 5 2 = 1377900 5.3. MNIST 28 28. MNIST 17. 17 99%. 10 3062.. 13
5.4 5 原始图像 28x28 卷积层 1 32 个 3x3 卷积核 ReLu 函数激活 2x2 最 值池化 Dropout 0.25 卷积层 2 32 个 3x3 卷积核 ReLu 函数激活 2x2 最 值池化 Dropout 0.25 全连接层 隐藏层 128 神经元 ReLu 函数激活 Dropout 0.5 softmax 层 10 神经元 原始图像 48x48 卷积层 1 64 个 4x4 卷积核 17: MNIST ReLu 函数激活 2x2 最 值池化 Dropout 0.25 卷积层 2 64 个 4x4 卷积核 ReLu 函数激活 2x2 最 值池化 Dropout 0.25 全连接层 18: 隐藏层 1024 神经元 ReLu 函数激活 Dropout 0.5 softmax 层 3062 神经元 28x28 48x48. 18. RuLe x, x > 0 ReLu(x) = 0, x 0 sigmoidtanh [3][4] Dropout [5].. 5121024204840968192 1024.. 512 1024. 5.4 CentOS 7 (24 CPU+96G +GTX960 ) Python 2.7 Keras Theano GPU 1. Adam batch size 1024 30 700. (13) 1 Tensorflow. 14
5.5 5... 19. 5.5 3.0 2.5 2.0 1.5 1.0 0.5 Loss Acc 0.0 0.4 0 5 10 15 20 25 30 19: Loss() Acc(). Google OCR Tesseract. 5.5.1 1. 99.70% 140 99.85% 15 1: 1 99.7% state of the art 2 1.0 0.9 0.8 0.7 0.6 0.5 2 Arial unicode MS. 15
6 5.5.2 5 ( 30620 153100 ) 92.11%. 2. 82.83% 92.15% 92.65% 99.95% 92.97% 2: (5% ). 15%( 48 48 ) 3. 78.14% 85.34% 88.17% 99.81% 86.52% 3: (15% ) 87.59% 90%.. 6. OCR. 6.1.... 16
6.2 6 s 1 () W (s 1 ) 0.999960.00004 s 2 () W (s 2 ) 0.878380.121480.00012.. P (s 1 s 2 ) s 1 s 2. 10 145001 0 124267 1980 0018 20 P ( ) = 0 145001 = 0 P ( ) = 12426 145001 P ( ) = 0 = 0 P ( ) = 0 1980 0.99996 0.00004 0 1980 0.00005 0 0.00909 0 7 0.08570 P ( ) = 0.00005 145001 18 = 0 P ( ) = 0.00909 1980 0.08570 20: 0.12148 0.00012 0.87838 s 1, s 2 (14) f = W (s 1 )P (s 1 s 2 )W (s 2 ) (14) s 1, s 2.. 6.2 21 n s 1, s 2,..., s n f = W (s 1 )P (s 1 s 2 )W (s 2 )P (s 2 s 3 )W (s 3 )... W (s n 1 )P (s n 1 s n )W (s n ) (15) [6]. (1) P (s i s i+1 ) (2) P (s i s i+1 ) f. 17
6.2 6 第 个区域 第 个区域 第三个区域 第四个区域 W11 W21 W31 W41 W12 W13 W22 W23 21: 6.2.1 s i #s i s i, s i+1 #(s i, s i+1 ) P (s i s i+1 ) = #(s i, s i+1 ) (16) #s i. 3062 3062 3062.. #(s i, s i+1 ) = 0. P (s i s i+1 ) = 0. #(s i, s i+1 ) = 0 0.. ( 0 ) α( 1). 3. 160. 6.2.2 Viterbi s 1, s 2,..., s n Viterbi [6]. Viterbi Python. s i 1 s i 1 s i P (s i 1 s i ) P (s i s i+1 ). s i s i 1 s i. l 2. Viterbi O(n l 2 )l s i n. 3 T. W32 W33 W42 W43 18
6.3 7 6.3..... 4: Viterbi. 7.1 7.. OCR (1) (2).. box 1 1. 2 0. 1. 0.84 7.2 OCR..... 7.3 h. 19
7.3 7...... CNN+LSTM.. 20
[1] Gabor BP 2007 [2] https://zh.wikipedia.org/zh-cn/ [3] Xavier Glorot, Antoine Bordes, Yoshua Bengio ; Deep Sparse Rectifier Neural Networks [4] Alex Krizhevsky,Ilya Sutskever,Geoffrey E. Hinton ; ImageNet Classification with Deep Convolutional Neural Networks [5] Dropout: A Simple Way to Prevent Neural Networks from Overfitting [6] () 3 [7] () 26 21