ORNL SHOC（CUDA & OpenCL）编译简记

ruilinruirui

浏览: 1050102 次

最近访客更多访客>>

u012363178

Torero

hmf1235789

声色_

博主相关

博客

微博

相册

留言

关于我

文章分类

全部博客 (1479)

社区版块

存档分类

2011-11 ( 113)
2011-10 ( 66)
2011-09 ( 66)
更多存档...

先说结果，我编译结果是CUDA未编译成功，OpenCL反而成功了。若有编译CUDA成功的可以交流一下。
不过貌似这不大影响我测试NVIDIA GPU及集群的结果吧。

一、ORNL简介：

ORNL是橡树岭国家实验室（Oak Ridge National Laboratory，简称ORNL）是美国能源部所属最大的科学和能源研究实验室，成立于1943年，现由田那西大学和Battelle纪念研究所共同管理。20世纪50、60年代，ORNL主要从事核能、物理及生命科学的相关研究。70年代成立了能源部后，使得ORNL的研究计划扩展到能源产生、传输和保存领域等。

目前，ORNL的任务是开展基础和应用项目的研发，提供知识和技术上的创新方法，增强美国在主要科学领域里的领先地位；提高洁净能源的利用率；恢复和保护环境以及为国家安全作贡献。ORNL在许多科学领域中都处于国际领先地位。它主要从事6个科学领域方面的研究，包括中子科学、能源、高性能计算、复杂生物系统、先进材料和国家安全。

我关注的站点是其中一个子项目中的内容（ORNL Future Technologies Group）：http://ft.ornl.gov/doku/ft/start

二、ORNL SHOC简介

地址：http://ft.ornl.gov/doku/shoc/start

SHOC（Scalable HeterOgeneous Computing (SHOC) Benchmark Suite），中文我暂时翻译为“可扩展的异构计算基准套件”。
SHOC是基准程序的集合，用来测试系统的性能、稳定性以及测试编写这些在系统的软件。注意，这些系统采用的是非传统架构并用于通用计算。
说的很拗口吧，其实目前而言，针对的就是GPU（CUDA和OpenCL）的程序及系统，不仅单机，还可用于他们的集群测试。

SHOC编译完成后，有CUDA的编译版本和OpenCL的编译版本。

三、特点

用OpenCL和CUDA编写的多个基准测试程序。
支持MPI的集群级并行
支持每节点多GPU的节点级并行
易用，测试报告清晰（与excel兼容）
可做大规模集群弹性测试（resiliency testing）的稳定性评测

四、基准程序

SHOC基准程序集主要分为两个类别：压力测试和性能测试。

压力测试用computationally demanding kernels来确定OpenCL设备是否有内存问题、冷却不足、或其他部件的问题。

性能测试根据复杂度和设备的性能本身细分为level 0，level 1和level 2，有点类似BLAS API。具体看看下面解释，我就不翻译了。

Level 0: Very low level device characteristics (so-called “feeds and speeds”) such as bandwidth across the bus connecting the GPU to the host or peak floating point operations per second（每秒浮点峰值）
Level 1: Device performance for low-level operations such as vector dot products and sorting operations（点积或者排序等稍微低端的操作）
Level 2: Device performance for real application kernels（应用级）

五、编译简记

整个过程可以参考：http://ft.ornl.gov/doku/shoc/gettingstarted。本身不是多复杂，比CMAQ的编译要简单太多。。。
我翻译一部分，用自己的话写一点。

Step 1: 解压源文件

这步应该都会。

tar -xvzf shoc-1.1.0.tar.gz

Step 2: 配置编译环境

这里主要可以选择是只编译CUDA，OpenCL，还是都编译，要不要MPI版本，或者全部都编译等等。
我们一般应该都是Linux吧，有其他系统的在对应的config目录里修改模板文件即可。一般来说不用修改。

比如我在Linux下编译，就要执行：

sh ./config/conf-linux-openmpi.sh

编译是自动根据默认路径设置来寻找系统中是CUDA还是OpenCL，还是都有等。MPI默认是mpicc，mpic++等。

这步没什么问题

若有问题就加参数 --with-opencl
或者加上CPPFLAGS="-I/usr/local/cuda/include"，因为有时会提示
configure: WARNING: cuda.h: present but cannot be compiled
之类的错误。

Step 3: 编译基准程序

开始make:

make

在我的Linux服务器中，编译出现这个问题：

nvcc fatal : redefinition of argument 'optimize'

弄了半天没解决，后来放弃了。

编译完成后，就算make没有完全成功，应该能看到：

cd bin
ls
EP Serial TP

应该也能看到：

ls ./Serial/
CUDA OpenCL

ls ./Serial/OpenCL/
BusSpeedDownload  FFT       Reduction  SGEMM  Stability
BusSpeedReadback  MaxFlops  S3D        Sort   Stencil2D
DeviceMemory      MD        Scan       Spmv   Triad

Step #4 运行基准程序

官方是建议用perl脚本，在tools目录里

cd tools

此时MPI的程序要在path里已经写好。其实这些应该很早就写好才是：

$ export PATH=$PATH:/path/to/mpi/bin/dir
$ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/path/to/mpi/lib/dir

参数里因为我cuda编译失败，只能加 -opencl。

还可以设置问题的size（1-4），意思是：

1 - CPUs / Debugging

2 - Mobile/Integrated GPUs

3 - Discrete GPUs (e.g. GeForce or Radeon series)

4 - HPC-Focused or Large Memory GPUs (e.g. Tesla or Firestream Series)

最后给一个我这边的测试例子。除了编译的问题，做到这步应该不会耗很长时间。

[michaelchen@xi03 tools]$ perl driver.pl -opencl -s 4 -d 0
--- Welcome To The SHOC Benchmark Suite version 1.1.1 ---
Hostname: xi03.clustertech.com
Number of available devices: 1
Device 0: GeForce GTX 580
Specified 1 device IDs: 0
Using size class: 4

--- Starting Benchmarks ---
Running benchmark BusSpeedDownload
result for bspeed_download: 5.9849 GB/sec
Running benchmark BusSpeedReadback
result for bspeed_readback: 6.6484 GB/sec
Running benchmark MaxFlops
result for maxspflops: 1550.7800 GFLOPS
result for maxdpflops: 197.6300 GFLOPS
Running benchmark DeviceMemory
result for gmem_readbw: 168.0100 GB/s
result for gmem_readbw_strided: 21.1776 GB/s
result for gmem_writebw: 161.5090 GB/s
result for gmem_writebw_strided: 8.5248 GB/s
result for lmem_readbw: 571.8110 GB/s
result for lmem_writebw: 738.6410 GB/s
result for tex_readbw: 138.8840 GB/sec
Running benchmark KernelCompile
result for ocl_kernel: 0.0014 sec
Running benchmark QueueDelay
result for ocl_queue: 0.0083 ms
Running benchmark FFT
result for fft_sp: 91.4614 GFLOPS
result for fft_sp_pcie: 24.5719 GFLOPS
result for ifft_sp: 87.9637 GFLOPS
result for ifft_sp_pcie: 24.3122 GFLOPS
result for fft_dp: 33.3871 GFLOPS
result for fft_dp_pcie: 11.1825 GFLOPS
result for ifft_dp: 29.5340 GFLOPS
result for ifft_dp_pcie: 10.7143 GFLOPS
Running benchmark SGEMM
result for sgemm_n: 633.4490 GFLOPS
result for sgemm_t: 623.6030 GFLOPS
result for sgemm_n_pcie: 526.7400 GFLOPS
result for sgemm_t_pcie: 519.8910 GFLOPS
result for dgemm_n: 190.5370 GFLOPS
result for dgemm_t: 189.6440 GFLOPS
result for dgemm_n_pcie: 153.1140 GFLOPS
result for dgemm_t_pcie: 152.5320 GFLOPS
Running benchmark MD
result for md_sp_bw: 33.5349 GB/s
result for md_sp_bw_pcie: 13.4008 GB/s
result for md_dp_bw: 38.6537 GB/s
result for md_dp_bw_pcie: 18.8973 GB/s
Running benchmark Reduction
result for reduction: 162.8440 GB/s
result for reduction_pcie: 5.7001 GB/s
result for reduction_dp: 158.4610 GB/s
result for reduction_dp_pcie: 5.6931 GB/s
Running benchmark Scan
result for scan: 42.5838 GB/s
result for scan_pcie: 2.9070 GB/s
result for scan_dp: 41.3105 GB/s
result for scan_dp_pcie: 2.8994 GB/s
Running benchmark Sort
result for sort: 0.5838 GB/s
result for sort_pcie: 0.4916 GB/s
Running benchmark Spmv
result for spmv_csr_scalar_sp: 1.3167 Gflop/s
result for spmv_csr_scalar_sp_pcie: 0.3716 Gflop/s
result for spmv_csr_scalar_dp: 1.2942 Gflop/s
result for spmv_csr_scalar_dp_pcie: 0.2638 Gflop/s
result for spmv_csr_scalar_pad_sp: 1.3141 Gflop/s
result for spmv_csr_scalar_pad_sp_pcie: 0.3829 Gflop/s
result for spmv_csr_scalar_pad_dp: 1.3008 Gflop/s
result for spmv_csr_scalar_pad_dp_pcie: 0.2698 Gflop/s
result for spmv_csr_vector_sp: 6.5258 Gflop/s
result for spmv_csr_vector_sp_pcie: 0.4796 Gflop/s
result for spmv_csr_vector_dp: 5.1990 Gflop/s
result for spmv_csr_vector_dp_pcie: 0.3115 Gflop/s
result for spmv_csr_vector_pad_sp: 6.9073 Gflop/s
result for spmv_csr_vector_pad_sp_pcie: 0.5011 Gflop/s
result for spmv_csr_vector_pad_dp: 5.7004 Gflop/s
result for spmv_csr_vector_pad_dp_pcie: 0.3212 Gflop/s
result for spmv_ellpackr_sp: 13.9193 Gflop/s
result for spmv_ellpackr_dp: 6.9415 Gflop/s
Running benchmark Stencil2D
result for stencil: 2.3865 s
result for stencil_dp: 3.8767 s
Running benchmark Triad
result for triad_bw: 5.7513 GB/s
Running benchmark S3D
result for s3d: 80.1375 GFLOPS
result for s3d_pcie: 46.2109 GFLOPS
result for s3d_dp: 32.3825 GFLOPS
result for s3d_dp_pcie: 20.1699 GFLOPS

分享到：

MySql分页存储过程1 | SQL的存储过程3

2011-11-18 18:07
浏览 1470
评论(0)
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论