國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,類神經網路加速器之軟硬體系統整合,Software and Hardware System Integration of Neural Network Accelerator

論文名稱 Title	類神經網路加速器之軟硬體系統整合 Software and Hardware System Integration of Neural Network Accelerator
系所名稱 Department	資訊工程學系 Department of Computer Science and Engineering
畢業學年期 Year, semester	108 學年度第 2 學期 The spring semester of Academic Year 108	語文別 Language	中文 Chinese
學位類別 Degree	碩士 Master	頁數 Number of pages	85
研究生 Author	陳俊良 Jyun-Liang Chen
指導教授 Advisor	蕭勝夫 Shen-Fu Hsiao
召集委員 Convenor	謝東佑 Tong-Yu Hsieh
口試委員 Advisory Committee	張雲南, 陳坤志 Yun-Nan Chang; Kun-Chih Chen
口試日期 Date of Exam	2020-02-27	繳交日期 Date of Submission	2020-03-24
關鍵字 Keywords	軟硬體整合、AXI匯流排協定、深度神經網路、硬體加速、FPGA、機器學習 Hardware accelerator, deep neural network(DNN), AXI bus protocol, FPGA, Software and Hardware integration, Machine learning
統計 Statistics	本論文已被瀏覽 5635 次，被下載 0 次 The thesis/dissertation has been browsed 5635 times, has been downloaded 0 times.

中文摘要
目前與類神經網路硬體加速的相關論文多數針對硬體內部做深入探討，但對於軟硬體整合及系統平台建置甚少著墨，如何將類神經網路的硬體加速器在與嵌入式系統整合後不使硬體效能耗損過多，以及開發易於使用者操作的軟體函式庫是本論文的研究方向。本論文承接現有之硬體架構，為了整合該硬體與FPGA上的SoC，需要擴張並修改該硬體架構，包含以AXI匯流排協定為基礎設計DMA，提出中介控制器管理擴張後的硬體架構等。而本論文於FPGA上使用之作業系統為Ubuntu，基於Linux作業系統，本論文將以C++實作軟硬體系統整合的部分，其中包含DNN模型解析、探討並提出指令集架構，DRAM使用率分析暨映射的流程，以及設計Linux Driver實現CPU與硬體的溝通介面，根據不同運算層及資料流實現指令產生器控制硬體。
Abstract
Most of relative journals or theses explore how to design efficient hardware accelerators of deep neural networks, but rarely consider the issue of hardware and software integration. This thesis’s target is to maintain most of hardware performance with only slight degradation after integrating with embedded systems, and to design the corresponding software API for hardware. In order to integrate the SoC environment on FPGA and the existing DNN hardware accelerator, this thesis expands the hardware architecture including DMA based on AXI bus protocol, inter-controller for managing the hardware, and so on. This thesis uses Ubuntu as the FPGA operating system, and implements the required software and hardware based on C++ and Linux. The software and hardware system integration contains the following features, including DNN model analysis, instruction-set design, DRAM usage analysis and mapping flow, a Linux driver for communication between CPU and hardware, implementation of the instruction generator, according to different layer and dataflow architectures, and so on.

目次 Table of Contents
論文審定書 i 論文公開授權書 ii 摘要 iii Abstract iv 目錄　(Table of Contents) v 圖目錄 viii 表目錄 x 第1章緒論 1 1.1 研究動機 1 1.2 論文貢獻 3 1.3 論文大綱 3 第2章研究背景與相關研究 4 2.1 類神經網路框架 4 2.2 類神經網路模型 4 2.2.1 AlexNet [1] 4 2.2.2 VGG [10] 5 2.2.3 GoogleNet (Inception-V1~V4) [11-14] 5 2.2.4 ResNet, ResNeXt [15, 16] 6 2.2.5 MobileNet-V1~V2 [17, 18] 7 2.2.6 Comparison Table 8 2.2.7 Yolo Series [19-21] 9 2.3 類神經網路硬體加速器 10 2.3.1 Cambricon [22] 10 2.3.2 FP-DNN [23] 11 2.3.3 Angel-Eye [5, 6] 11 2.3.4 Thinker [24] 11 2.3.5 Bit-Fusion [25] 11 2.3.6 NVDLA [26] 12 第3章軟、硬體暨SoC介紹與分類 13 3.1 各式指令集架構分析與比較 13 3.1.1 直接控制導向指令集 13 3.1.2 RISC導向指令集 13 3.1.3 IR(Intermediate Representation)導向指令集 14 3.1.4 整理與小結 15 3.2 現有之硬體架構介紹 16 3.2.1 資料重用暨平行方式 16 3.2.2 硬體架構 16 3.2.3 現有規格暨功能 22 3.3 AXI匯流排協定 [32, 33] 23 第4章增強式硬體設計 27 4.1 Reshape (with Padding) 28 4.2 DMA (AXI-Full Master Wrapper) 31 4.3 Inter-Controller 33 第5章軟硬體系統整合 35 5.1 Overview 35 5.2 DNN模型解析 36 5.3 指令集架構設計 38 5.4 DRAM使用率分析暨映射流程 41 5.5 Linux Driver 43 5.5.1 DMA 44 5.5.2 Interrupt 46 5.6 DNN常見運算層移植 48 5.7 影像前、後處理 49 5.8 軟硬體整合系統之使用流程 50 第6章實驗數據分析暨比較 54 6.1 實驗環境暨方法介紹 54 6.2 Cell-Based Design Flow 56 6.3 FPGA 63 6.4 論文比較 65 6.4.1 Cell-Based 65 6.4.2 FPGA 66 第7章結論與未來展望 67 7.1 結論 67 7.2 未來展望 67 附錄A 68 參考文獻 (Reference) 70

參考文獻 References
[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, "Imagenet classification with deep convolutional neural networks," in Advances in Neural Information Processing Systems, 2012. [2] J. Deng, W. Dong, R. Socher, L. Li, L. Kai, and F.-F. Li, "ImageNet: A large-scale hierarchical image database," in 2009 IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 248-255. [3] Y. Chen, J. Emer, and V. Sze, "Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks," in 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), 2016, pp. 367-379. [4] Y. Chen, T. Krishna, J. S. Emer, and V. Sze, "Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks," in IEEE Journal of Solid-State Circuits vol. 52, 2017, pp. 127-138. [5] K. Guo et al., "Angel-Eye: A Complete Design Flow for Mapping CNN Onto Embedded FPGA," in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems vol. 37, 2018, pp. 35-47. [6] J. Qiu et al., "Going Deeper with Embedded FPGA Platform for Convolutional Neural Network," 2016, pp. 26-35. [7] Y. Jia et al., "Caffe: Convolutional Architecture for Fast Feature Embedding," in Proceedings of the 22nd ACM international conference on Multimedia, 2014, pp. 675-678. [8] M. Abadi et al., "TensorFlow: a system for large-scale machine learning," in Proceedings of the 12th USENIX conference on Operating Systems Design and Implementation, 2016, pp. 265-283. [9] A. Paszke et al., "Automatic differentiation in PyTorch," 2017. [10] K. Simonyan and A. Zisserman, "Very Deep Convolutional Networks for Large-Scale Image Recognition," in International Conference on Learning Representations, 2015. [11] C. Szegedy et al., "Going deeper with convolutions," in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 1-9. [12] S. Ioffe and C. Szegedy, "Batch normalization: accelerating deep network training by reducing internal covariate shift," in Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37, 2015, pp. 448-456. [13] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, "Rethinking the Inception Architecture for Computer Vision," in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 2818-2826. [14] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi, "Inception-v4, inception-ResNet and the impact of residual connections on learning," in Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, 2017, pp. 4278-4284. [15] K. He, X. Zhang, S. Ren, and J. Sun, "Deep Residual Learning for Image Recognition," in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770-778. [16] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, "Aggregated Residual Transformations for Deep Neural Networks," in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 5987-5995. [17] A. G. Howard et al., "MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications," in arXiv e-prints, 2017. [18] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen, "MobileNetV2: Inverted Residuals and Linear Bottlenecks," in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 4510-4520. [19] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, "You Only Look Once: Unified, Real-Time Object Detection," in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 779-788. [20] J. Redmon and A. Farhadi, "YOLO9000: Better, Faster, Stronger," in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 6517-6525. [21] J. Redmon and A. Farhadi, "YOLOv3: An Incremental Improvement," in arXiv e-prints, 2018. [22] S. Liu et al., "Cambricon: An Instruction Set Architecture for Neural Networks," in 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), 2016, pp. 393-405. [23] Y. Guan et al., "FP-DNN: An Automated Framework for Mapping Deep Neural Networks onto FPGAs with RTL-HLS Hybrid Templates," in 2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), 2017, pp. 152-159. [24] S. Yin et al., "A High Energy Efficient Reconfigurable Hybrid Neural Network Processor for Deep Learning Applications," in IEEE Journal of Solid-State Circuits vol. 53, 2018, pp. 968-982. [25] H. Sharma et al., "Bit Fusion: Bit-Level Dynamically Composable Architecture for Accelerating Deep Neural Network," in 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), 2018, pp. 764-775. [26] "Nvdla deep learning accelerator." [Online]. Available: http://nvdla.org/ [27] S. Yin et al., "An Energy-Efficient Reconfigurable Processor for Binary-and Ternary-Weight Neural Networks With Flexible Data Bit Width," in IEEE Journal of Solid-State Circuits vol. 54, 2019, pp. 1120-1136. [28] N. Rotem et al., "Glow: Graph Lowering Compiler Techniques for Neural Networks," in arXiv e-prints, 2018. [29] T. Chen et al., "TVM: an automated end-to-end optimizing compiler for deep learning," in Proceedings of the 12th USENIX conference on Operating Systems Design and Implementation, 2018, pp. 579-594. [30] W.-F. Lin et al., "ONNC: A Compilation Framework Connecting ONNX to Proprietary Deep Learning Accelerators," in IEEE International Conference on Artificial Intelligence Circuits and Systems, 2019. [31] 陳育鴻, "具高重組性之多位元精確度卷積神經網路硬體設計與實作," 國立中山大學, 高雄, 2019. [32] Xilinx. "AXI Reference Guide." [Online]. Available: https://www.xilinx.com/support/documentation/ip_documentation/ug761_axi_reference_guide.pdf [33] Arm. "AMBA® AXI and ACE Protocol Specification." [Online]. Available: https://static.docs.arm.com/ihi0022/g/IHI0022G_amba_axi_protocol_spec.pdf [34] B. Moons and M. Verhelst, "An Energy-Efficient Precision-Scalable ConvNet Processor in 40-nm CMOS," in IEEE Journal of Solid-State Circuits vol. 52, 2017, pp. 903-914. [35] D. S. Miller, R. Henderson, and J. Jelinek. "Dynamic DMA mapping Guide." [Online]. Available: https://www.kernel.org/doc/Documentation/DMA-API-HOWTO.txt [36] 林長毅譯, J. Corbet, A. Rubini, and G. Kroah-Hartman, "Linux 驅動程式第三版," 2006. 台北市: 歐萊禮. [37] S. Han et al., "EIE: Efficient Inference Engine on Compressed Deep Neural Network," in 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), 2016, pp. 243-254. [38] D. Shin, J. Lee, J. Lee, J. Lee, and H. Yoo, "DNPU: An Energy-Efficient Deep-Learning Processor with Heterogeneous Multi-Core Architecture," in IEEE Micro vol. 38, 2018, pp. 85-93. [39] J. Lee, C. Kim, S. Kang, D. Shin, S. Kim, and H. Yoo, "UNPU: An Energy-Efficient Deep Neural Network Accelerator With Fully Variable Weight Bit Precision," in IEEE Journal of Solid-State Circuits vol. 54, 2019, pp. 173-185.

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：自定論文開放時間 user define 開放時間 Available：校內 Campus：開放下載的時間 available 2025-03-24 校外 Off-campus：開放下載的時間 available 2025-03-24 您的 IP(校外) 位址是 3.144.96.159 現在時間是 2024-04-19 論文校外開放下載的時間是 2025-03-24 Your IP address is 3.144.96.159 The current date is 2024-04-19 This thesis will be available to you on 2025-03-24.
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 2025-03-24

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2452 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2452 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS