- PDF 免费下载

中国科学技术大学保密博士学位论文多核环境中系统软件访存与同步优化问题的研究作者姓名 : 学科专业 : 导师姓名 : 林传文计算机应用技术吴曼青研究员顾乃杰教授完成时间 : 二一二年五月

Secret University of Science and Technology of China A dissertation for doctor s degree The System Software Optimization of Memory Access and Synchronization in Multicore Environment Author s Name: Lin Chuanwen Speciality: Supervisor: Computer Application Technology Researcher Wu Manqing Professor Gu Naijie Finished time: May 2012

中国科学技术大学学位论文原创性声明本人声明所呈交的学位论文, 是本人在导师指导下进行研究工作所取得的成果除已特别加以标注和致谢的地方外, 论文中不包含任何他人已经发表或撰写过的研究成果与我一同工作的同志对本研究所做的贡献均已在论文中作了明确的说明作者签名 : 签字日期 : 中国科学技术大学学位论文授权使用声明作为申请学位的条件之一, 学位论文著作权拥有者授权中国科学技术大学拥有学位论文的部分使用权, 即 : 学校有权按有关规定向国家有关部门或机构送交论文的复印件和电子版, 允许论文被查阅和借阅, 可以将学位论文编入有关数据库进行检索, 可以采用影印缩印或扫描等复制手段保存汇编学位论文本人提交的电子文档的内容和纸质论文的内容相一致保密的学位论文在解密后也遵守此规定公开保密 ( 年 ) 作者签名 : 导师签名 : 签字日期 : 签字日期 :

摘要摘要多核架构以其性能和功耗方面的综合优势, 已经成为微处理器的主流结构多核架构通过在单个芯片上集成多个处理器核来提升处理器的性能多核处理器强大的计算能力需要通过并行程序充分利用但是, 由于访存和同步等因素的制约使得并行应用程序很难充分利用多核处理器的计算能力其中, 存储器系统的延迟和带宽很难匹配多核处理器强大的计算能力 ; 低效的同步机制则往往会导致某些处理器核等待而产生停顿, 进而降低了多核处理器的计算资源利用率针对上述问题, 系统软件需要在访存和同步方面进行优化, 充分利用多核处理器提供的硬件资源, 提高并行应用程序的运行性能针对系统软件的访存和同步优化问题, 本文主要研究编译器和 Java 虚拟机中的访存优化问题, 以及 Java 虚拟机中的同步机制优化问题本文的主要工作和创新如下 : 1) 分簇结构数字信号处理器的 SIMD 编译优化针对数字信号处理应用, 本文提出分簇结构数字信号处理器的 SIMD 编译优化框架主要工作包括 : 针对数字信号处理应用的特点, 提出基于访存指令的 SIMD 指令识别算法 ; 针对分簇结构 SIMD 指令的特点, 提出基于 SIMD 指令的指令分簇和寄存器分配算法 ; 并且在 BWDSP100 编译器中实现了上述优化算法实验结果表明, 本文提出的 SIMD 优化方法能够在分簇结构上识别并生成高效的 SIMD 代码, 可以极大地提高 BWDSP100 处理器上应用程序的带宽利用率和运行性能 2) Java 虚拟机中的动态锁 cache 优化基于编译方法的调用规律, 本文给出 Java 虚拟机中的动态锁 cache 优化方法主要工作包括 : 通过分析 Java 虚拟机中编译方法的调用规律, 得到编译方法的活跃时间段平均大小和内存分布情况 ; 根据编译方法的上述规律, 在 Java 虚拟机进行动态的锁 cache 优化, 将活跃的编译方法锁在 cache 中实验结果表明, 本文提出的锁 cache 优化方法只需要将较小的内存区域锁在 cache 中, 就能够在 Java 虚拟机运行时获得较大的 cache 性能提升 3) Java 虚拟机中的只读锁优化针对只读临界区的特点, 本文提出 Java 虚拟机中的只读锁优化框架主要工作包括 : 提出即时编译器中的只读临界区识别算法 ; 提出基于 MIPS 体系结构 LL/SC 同步指令的轻量级只读锁优化算法 ; 提出重量级只读锁优化算法轻量级只读锁优化算法可以在没有线程竞争的情况下降低同步操作的开销 ; 重量 I

摘要级只读锁优化算法则可以允许多个线程同时进入只读临界区, 提高线程竞争情况下同步操作的性能实验结果表明, 本文提出的只读锁优化方法可以极大降低线程进入和退出只读临界区的开销, 提高 Java 虚拟机的同步性能基于国产处理器的软硬件平台, 本文在系统软件的访存和同步优化问题研究中取得了一系列有价值的成果, 有效地提高了国产处理器上应用软件的性能, 进而推动国产处理器芯片的市场化关键词 : 多核处理器分簇结构编译优化单指令流多数据流 Java 虚拟机锁 cache 即时编译器只读锁同步机制 II

Abstract Abstract With the advantages of performance and power, the multicore architecture has become the mainstream structure of microprocessors. Multicore architecture integrates multiple processor cores on a single chip to improve the performance of processor. The parallel program is used to take advantage of the computing power provided by multicore processors. However, the constraints of memory access and synchronization make it difficult for parallel programs to play the full effectiveness of multicore processors. Latency and bandwidth of the memory system are difficult to match the powerful computing performance of multicore processors; and inefficient synchronization mechanisms often lead to some processor cores waiting, thus reducing the resource utilization of multicore processor. In order to overcome the above problems, it is necessary to do the optimization of memory access and synchronization in system softwares. This can take full advantage of hardware resources provided by multicore processors and improve the performance of parallel applications. For the optimization of memory access and synchronization in system softwares, this paper focuses on the research of memory access optimization in the compiler and Java Virtual Machine, and synchronization optimization in the Java Virtual Machine. The main work and innovations are as follows: 1) SIMD optimization for clustered VLIW DSP For digital signal processing applications, this paper proposes a SIMD compiler optimization framework based on clustered DSP. The main work include: a SIMD instruction identification algorithm is presented for the features of digital signal processing applications; a new cluster assignment algorithm and register allocation algorithm are given for the features of SIMD instructions on the cluster architecture and the above algorithms have been implemented on the BWDSP100 compiler. The experimental results show that the SIMD optimization methods mentioned above can effectively identify and generate efficient SIMD code on the clustered DSP, and greatly improve the bandwidth utilization and performance of the applications on BWDSP100 processor. 2) Dynamic cache locking optimization in Java Virtual Machine Based on the calling parttern of compiled methods, a dynamic cache locking optimization algorithm in JVM is presented. The main work include: according to III

Abstract analyzing the calling parttern of the compiled methods in JVM, the calling distribution parttern, average size and memory distribution of compiled methods can be obtained; based on the above partterns of compiled methods, the dynamic cache locking optimization is implemented in JVM to lock the active compiled methods in cache. The experimental results show that this cache locking method improves the run-time cache performance of JVM, just by locking a small memory area in the cache. 3) Read-Only Lock Optimization in Java Virtual Machine For the features of read-only critical section, this paper presents a read-only lock optimization framework in JVM. The main work include: a recognition algorithm of read-only critical sections in JIT; a lightweight read-only lock optimization algorithm, based on the LL/SC synchronization instructions; and a heavyweight read-only lock optimization algorithm. The read-only optimization algorithm of the lightweight lock can reduce the overhead of synchronous operations, in case there is no competition between threads; and the read-only optimization algorithm of the heavyweight lock can allow multiple threads simultaneously access to the read-only critical sections, when several threads compete at the same time. The experimental results show that the read-only lock optimization method significantly reduces the overhead when the threads enter and exit read-only critical section, and improve the synchronization performance of JVM. Based on the hardware and software platforms of domestic processors, this thesis achieves some valuable innovations in the memory access and synchronization optimization of system softwares. These can effectively improve the performance of applications in the domestic processors, and promote the marketization of the domestic processors. Key Words: Multicore Processors, Cluster Architecture, Compiler Optimization, SIMD, Java Virtual Machine, Cache Locking, Just-In-Time Compiler, Read-Only Lock, Synchronization Mechanism IV

目录目录摘要... I 第 1 章绪论... 1 1.1 研究背景... 2 1.1.1 多核处理器... 2 1.1.2 多核处理器的存储系统... 3 1.1.3 多核处理器的核间同步... 6 1.2 国内外研究现状... 7 1.2.1 系统软件中访存优化相关研究... 7 1.2.2 系统软件中同步机制优化相关研究... 9 1.3 本文的研究内容... 11 1.4 本文的组织结构... 13 第 2 章 BWDSP100 编译器与龙芯 JAVA 虚拟机研究平台... 15 2.1 引言... 15 2.2 BWDSP100 体系结构... 15 2.3 BWDSP100 编译器... 17 2.3.1 IMPACT 编译器... 17 2.3.2 BWDSP100 编译器的开发... 18 2.4 龙芯高性能处理器体系结构... 22 2.4.1 GS464 处理器核的基本结构... 22 2.4.2 龙芯 3 号 4 核处理器的基本结构... 23 2.5 龙芯 JAVA 虚拟机... 24 2.5.1 Openjdk Java 虚拟机... 24 2.5.2 龙芯 Java 虚拟机的开发... 26 2.6 本文使用的性能测试程序... 27 2.7 本章小结... 28 第 3 章分簇 VLIW DSP 的 SIMD 编译优化... 31 3.1 引言... 31 3.2 相关工作... 32 3.3 BWDSP100 的分簇结构 SIMD 指令... 33 V

目录 3.4 基于访存指令的 SIMD 指令识别算法... 35 3.4.1 循环检测... 36 3.4.2 循环展开... 36 3.4.3 常规优化... 36 3.4.4 变量重命名... 36 3.4.5 循环不变量和累加变量扩展... 37 3.4.6 合成 SIMD 指令... 37 3.4.7 实例分析... 39 3.5 基于 SIMD 指令的指令分簇算法... 40 3.6 基于 SIMD 指令的寄存器分配算法... 41 3.7 实验结果... 42 3.7.1 循环展开因子的确定... 43 3.7.2 SIMD 优化效果... 44 3.8 本章小结... 46 第 4 章 JAVA 虚拟机中的动态锁 CACHE 优化... 47 4.1 引言... 47 4.2 相关工作... 48 4.2.1 提高程序确定性的锁 cache 优化相关研究... 48 4.2.2 提高程序性能的锁 cache 优化相关研究... 49 4.3 龙芯 3A 的锁 CACHE 机制... 50 4.4 JAVA 虚拟机的即时编译系统... 51 4.5 JAVA 虚拟机的编译方法调用规律... 53 4.5.1 编译方法的调用分布... 53 4.5.2 编译方法的大小... 54 4.5.3 编译方法的内存分布... 55 4.6 JAVA 虚拟机中的动态锁 CACHE 优化算法... 56 4.7 实验结果和分析... 58 4.7.1 Java 虚拟机的运行时 cache 命中率提升... 58 4.7.2 Java 虚拟机的运行时性能提升... 59 4.8 本章小结... 60 第 5 章 JAVA 虚拟机中的只读锁优化... 61 5.1 引言... 61 5.2 相关工作... 62 5.3 JAVA 虚拟机 HOTSPOT 的锁机制简介... 63 5.4 即时编译器中的只读临界区识别算法... 65 VI

目录 5.5 JAVA 虚拟机中的只读锁优化方法... 66 5.5.1 Java 虚拟机的只读锁优化框架... 66 5.5.2 基于 LL/SC 同步指令的轻量级只读锁优化算法... 68 5.5.3 Java 虚拟机中的重量级只读锁优化算法... 70 5.6 实验结果和分析... 71 5.6.1 单线程 Java 程序的性能提升... 72 5.6.2 多线程 Java 程序的性能提升... 73 5.6.3 与 Openjdk 读写锁的性能对比... 74 5.6.4 SPECjvm2008 测试用例的性能提升... 75 5.7 本章小结... 76 第 6 章总结... 77 6.1 引言... 77 6.2 本文工作总结... 77 6.3 本文的主要创新... 78 6.4 下一步研究工作... 79 参考文献... 81 致谢... 91 在读期间发表的学术论文与取得的研究成果... 93 VII

目录图目录图 1.1 处理器中晶体管数量的增长曲线... 2 图 1.2 处理器与存储器的性能差距... 4 图 1.3 存储器系统的层次结构... 4 图 1.4 龙芯 4 核架构的 cache 层次结构... 5 图 1.5 Intel 4 核 Sandy Bridge 架构的 cache 层次结构... 5 图 1.6 英特尔的单芯片云计算机架构... 7 图 2.1 BWDSP100 的基本结构... 16 图 2.2 IMPACT 编译器的基本框架... 18 图 2.3 BWDSP100 编译器的代码生成模块... 20 图 2.4 GS464 处理器核的基本结构... 23 图 2.5 龙芯 3A 处理器的基本结构... 24 图 2.6 HotSpot 虚拟机的基本体系结构... 25 图 2.7 HotSpot 虚拟机的执行引擎基本结构... 26 图 3.1 BWDSP100 的 SIMD 汇编代码段... 34 图 3.2 BWDSP100 中 SIMD 指令的执行过程... 35 图 3.3 BWSIMD 算法构架... 35 图 3.4 SIMD 指令识别的实例... 39 图 3.5 SIMD_RegAlloc 算法流程... 42 图 3.6 convolution 实验结果... 44 图 3.7 dot_product 实验结果... 44 图 3.8 DSPstone 实验结果... 45 图 4.1 Java 虚拟机的即时编译系统... 52 图 4.2 Java 虚拟机中编译方法的执行频率... 54 图 4.3 Java 虚拟机中编译方法的本地代码段大小... 55 图 4.4 Java 虚拟机运行时编译方法的内存分布... 56 图 4.5 Java 虚拟机中的动态锁 cache 优化算法... 57 图 4.6 动态锁 cache 优化前后直接读内存操作对比... 59 图 4.7 动态锁 cache 优化前后 SPECjvm2008 的性能对比... 60 图 5.1 HotSpot 中标记字的状态... 63 图 5.2 HotSpot 虚拟机中的同步状态转换... 64 图 5.3 只读临界区识别算法... 66 图 5.4 HotSpot 虚拟机锁操作的层次结构... 67 图 5.5 Java 虚拟机只读锁操作的层次结构... 68 图 5.6 轻量级锁只读锁申请... 69 图 5.7 轻量级锁只读锁释放... 69 图 5.8 重量级锁只读优化算法... 71 图 5.9 重量级锁只读优化算法... 71 图 5.10 单线程 Java 测试用例运行时间对比... 72 图 5.11 只读锁优化前后多线程 Java 测试用例的性能加速比... 73 图 5.12 Java 虚拟机只读锁与 Openjdk 读写锁的性能加速比... 74 VIII

目录图 5.13 SPECjvm2008 中锁操作的频率... 75 图 5.14 只读锁优化前后 SPECjvm2008 性能对比... 76 IX

目录表目录表 2.1 DSPstone 测试程序的功能描述... 29 表 2.2 SPECjvm2008 测试程序功能描述... 29 表 4.1 二级 Cache 的锁窗口寄存器组... 51 X