model-driven level 3 blas performance optimization on loongson 3a processor | |
Zhang Xianyi ; Wang Qian ; Zhang Yunquan | |
2012 | |
会议名称 | 18th IEEE International Conference on Parallel and Distributed Systems, ICPADS 2012 |
会议日期 | December 17, 2012 - December 19, 2012 |
会议地点 | Singapore, Singapore |
关键词 | Cache memory Computer systems Microprocessor chips |
页码 | 684-691 |
中文摘要 | Every mainstream processor vendor provides an optimized BLAS implementation for its CPU, as BLAS is a fundamental math library in scientific computing. The Loongson 3A CPU is a general-purpose 64-bit MIPS64 quad-core processor, developed by the Institute of Computing Technology, Chinese Academy of Sciences. To date, there has not been a sufficiently optimized BLAS on the Loongson 3A CPU. The purpose of this research is to optimize level 3 BLAS performance on the Loongson 3A CPU. We analyzed the Loongson 3A architecture and built a performance model to highlight the key point, L1 data cache misses, which is different from level 3 BLAS optimization on the mainstream ×86 CPU. Therefore, we employed a variety of methods to avoid L1 cache misses in single thread optimization, including cache and register blocking, the Loongson 3A 128-bit memory accessing extension instructions, software prefetching, and single precision floating-point SIMD instructions. Furthermore, we improved parallel performance by reducing bank conflicts among multiple threads in the shared L2 cache. We created an open source BLAS project, OpenBLAS, to demonstrate the performance improvement on the Loongson 3A quad-core processor. © 2012 IEEE. |
英文摘要 | Every mainstream processor vendor provides an optimized BLAS implementation for its CPU, as BLAS is a fundamental math library in scientific computing. The Loongson 3A CPU is a general-purpose 64-bit MIPS64 quad-core processor, developed by the Institute of Computing Technology, Chinese Academy of Sciences. To date, there has not been a sufficiently optimized BLAS on the Loongson 3A CPU. The purpose of this research is to optimize level 3 BLAS performance on the Loongson 3A CPU. We analyzed the Loongson 3A architecture and built a performance model to highlight the key point, L1 data cache misses, which is different from level 3 BLAS optimization on the mainstream ×86 CPU. Therefore, we employed a variety of methods to avoid L1 cache misses in single thread optimization, including cache and register blocking, the Loongson 3A 128-bit memory accessing extension instructions, software prefetching, and single precision floating-point SIMD instructions. Furthermore, we improved parallel performance by reducing bank conflicts among multiple threads in the shared L2 cache. We created an open source BLAS project, OpenBLAS, to demonstrate the performance improvement on the Loongson 3A quad-core processor. © 2012 IEEE. |
收录类别 | EI |
会议录 | Proceedings of the International Conference on Parallel and Distributed Systems - ICPADS |
语种 | 英语 |
ISSN号 | 1521-9097 |
ISBN号 | 9780769549033 |
内容类型 | 会议论文 |
源URL | [http://ir.iscas.ac.cn/handle/311060/15910] |
专题 | 软件研究所_软件所图书馆_会议论文 |
推荐引用方式 GB/T 7714 | Zhang Xianyi,Wang Qian,Zhang Yunquan. model-driven level 3 blas performance optimization on loongson 3a processor[C]. 见:18th IEEE International Conference on Parallel and Distributed Systems, ICPADS 2012. Singapore, Singapore. December 17, 2012 - December 19, 2012. |
个性服务 |
查看访问统计 |
相关权益政策 |
暂无数据 |
收藏/分享 |
除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。
修改评论