导言:
首先,英伟达公司的NSIGHT实在是个无品的产品,至今产品已经出到Parallel Nsight 1.51 released。虽然产品本身已经免费了【1】,但是安装它的话需要先安装VS的补丁,您一旦安装补丁,好好的绿色VS瞬间进入了试用期。更重要的是,您如果真买了正版VS,恭喜您,您想调试CUDA程序的话要买两块显卡【2】!!!NVIDIA显卡什么价钱,我就不说了。对于我们这些初学者而言简直就是噩梦。所以,作为替代品我们选择SDK进行开发,虽然麻烦一点,很多东西需要手工来配置,但是,省钱!!!对于软件的了解也不无益处。
第二,我发现很少有文章针对SDK开发动态连接库(及dll)程序有详尽而入门级的指导,所以小e同学特别好心地将摆弄了好几天的经验在这里和同志们交流,实践中遇到问题,也欢迎在这里提问哦~
准备工作:
正是进行工作之前,你需要已经完成了以下准备工作:1,已经成功安装Toolkit 并运行成功;2,成功安装SDK;3,成功编译运行SDK中的C语言例子程序(如果运行不了请检查项目属性中是否:连接器->其他程式库目录是否配置有误,如果有误会产生.lib的连接错误);4,对dll文件的使用有一定了解(喂,至少知道它是做什么的吧,还有和lib输入库文件的关系)。
没有完成以上步骤的同学,请先在【3】下载哦~
正式工作:
激动人心的正式步骤就要开始了!!!我们在这里将会解决一下几个主要困难:1,如何将一个实例中生成exe的工程改为生成dll的工程;2,如何测试dll工程;3,如何release最终的产品。
第一步:
我们在sdk C语言例子程序中找到一个比较简单的只有一个文件的工程(偷懒的行为。。),小e在这里用的是“cudaOpenMP”。然后,记得复制一个它的备份哦(不然,等到您把它改得面目全非后,再想看原来的例子工程你就泪目吧~)。打开工程,运行一下,可以看到结果如下:(Fig.1)
Fig.1
我们这次工作的目的就是将此工程拆成两个工程,一个生成dll和lib输入库文件,另一个负责生成exe文件加载测试我们的dll文件。而原工程中只有一个.cu文件,列在这里:
#include <omp.h> #include <stdio.h> // stdio functions are used since C++ streams aren't necessarily thread safe #include <cutil_inline.h> using namespace std; // a simple kernel that simply increments each array element by b __global__ void kernelAddConstant(int *g_a, const int b) { int idx = blockIdx.x * blockDim.x + threadIdx.x; g_a[idx] += b; } // a predicate that checks whether each array elemen is set to its index plus b int correctResult(int *data, const int n, const int b) { for(int i = 0; i < n; i++) if(data[i] != i + b) return 0; return 1; } int main(int argc, char *argv[]) { int num_gpus = 0; // number of CUDA GPUs / // determine the number of CUDA capable GPUs // cudaGetDeviceCount(&num_gpus); if(num_gpus < 1) { printf("no CUDA capable devices were detected/n"); return 1; } / // display CPU and GPU configuration // printf("number of host CPUs:/t%d/n", omp_get_num_procs()); printf("number of CUDA devices:/t%d/n", num_gpus); for(int i = 0; i < num_gpus; i++) { cudaDeviceProp dprop; cudaGetDeviceProperties(&dprop, i); printf(" %d: %s/n", i, dprop.name); } printf("---------------------------/n"); / // initialize data // unsigned int n = num_gpus * 8192; unsigned int nbytes = n * sizeof(int); int *a = 0; // pointer to data on the CPU int b = 3; // value by which the array is incremented a = (int*)malloc(nbytes); if(0 == a) { printf("couldn't allocate CPU memory/n"); return 1; } for(unsigned int i = 0; i < n; i++) a[i] = i; // run as many CPU threads as there are CUDA devices // each CPU thread controls a different device, processing its // portion of the data. It's possible to use more CPU threads // than there are CUDA devices, in which case several CPU // threads will be allocating resources and launching kernels // on the same device. For example, try omp_set_num_threads(2*num_gpus); // Recall that all variables declared inside an "omp parallel" scope are // local to each CPU thread // omp_set_num_threads(num_gpus); // create as many CPU threads as there are CUDA devices //omp_set_num_threads(2*num_gpus);// create twice as many CPU threads as there are CUDA devices #pragma omp parallel { unsigned int cpu_thread_id = omp_get_thread_num(); unsigned int num_cpu_threads = omp_get_num_threads(); // set and check the CUDA device for this CPU thread int gpu_id = -1; CUDA_SAFE_CALL(cudaSetDevice(cpu_thread_id % num_gpus)); // "% num_gpus" allows more CPU threads than GPU devices CUDA_SAFE_CALL(cudaGetDevice(&gpu_id)); printf("CPU thread %d (of %d) uses CUDA device %d/n", cpu_thread_id, num_cpu_threads, gpu_id); int *d_a = 0; // pointer to memory on the device associated with this CPU thread int *sub_a = a + cpu_thread_id * n / num_cpu_threads; // pointer to this CPU thread's portion of data unsigned int nbytes_per_kernel = nbytes / num_cpu_threads; dim3 gpu_threads(128); // 128 threads per block dim3 gpu_blocks(n / (gpu_threads.x * num_cpu_threads)); CUDA_SAFE_CALL(cudaMalloc((void**)&d_a, nbytes_per_kernel)); CUDA_SAFE_CALL(cudaMemset(d_a, 0, nbytes_per_kernel)); CUDA_SAFE_CALL(cudaMemcpy(d_a, sub_a, nbytes_per_kernel, cudaMemcpyHostToDevice)); kernelAddConstant<<<gpu_blocks, gpu_threads>>>(d_a, b); CUDA_SAFE_CALL(cudaMemcpy(sub_a, d_a, nbytes_per_kernel, cudaMemcpyDeviceToHost)); CUDA_SAFE_CALL(cudaFree(d_a)); } printf("---------------------------/n"); if(cudaSuccess != cudaGetLastError()) printf("%s/n", cudaGetErrorString(cudaGetLastError())); // check the result // if(correctResult(a, n, b)) printf("PASSED/n"); else printf("FAILED/n"); free(a); // free CPU memory cudaThreadExit(); cutilExit(argc, argv); return 0; }
第二步:
修改我们的工程属性,使其变为一个dll文件工程,需要修改的主要是两个方面:1,dll文件是windows子系统下的产品不是console子系统的;2,设置输出文档为dll。不多说果断先上图Fig.2:
Fig.2
1,右键我们“方案总管”中的工程,选择“属性”->组态属性->一般,出现Fig.2,找到“组态类别”选项,将.exe修改为.dll。
2,选择连接器如图Fig.3修改“输出档”选项,将cudaOpenMP.exe改为cudaOpenMP.dll。
Fig.3
3,选择连接器->系统,修改子系统选项,由console改为windows(见图Fig.4)。
Fig.4
4,选择CUDA Build Rule v3.0.xx->Preprocessor将 Preprocessor definitions 中加入WIN32;_WINDOWS;_DEBUG;_USRDL;L_WINDLL(见图Fig.5)
Fig.5
至此,dll文件的工程属性配置完成。
第三步:
修改源程序,我们将原cudaOpenMP.cu中的main函数拿掉,而只保留两个功能函数,另建立.h文件保存函数目录。此步骤与建立普通的cpp dll工程并无多大区别。直接给出修改后的源代码及其关系(Fig.6):
Fig.6
cudaOpenMP.cu(修改后的)
#include <windows.h> #include <omp.h> #include <stdio.h> // stdio functions are used since C++ streams aren't necessarily thread safe #include <cutil_inline.h> #include <cudaOpenheader.h> int WINAPI DllMain(HINSTANCE hInstance, DWORD fdwReason, PVOID pvReserved) { return TRUE; } // a simple kernel that simply increments each array element by b EXPORT __global__ void kernelAddConstant(int *g_a, const int b) { int idx = blockIdx.x * blockDim.x + threadIdx.x; g_a[idx] += b; } // a predicate that checks whether each array elemen is set to its index plus b EXPORT int correctResult(int *data, const int n, const int b) { for(int i = 0; i < n; i++) if(data[i] != i + b) return 0; return 1; }
cudaOpenheader.h
//in .h file we don't need any build tool such as "CUDA Rule xxxx" #ifdef __cplusplus #define EXPORT extern "C" __declspec(dllexport) #else #define EXPORT __declspec(dllexport) #endif EXPORT __global__ void kernelAddConstant(int *g_a, const int b); EXPORT int correctResult(int *data, const int n, const int b);
好了现在我们编译工程,是不是成功了呢?如果没有成功,欢迎讨论。
第四步:
加入测试.exe程序。大家知道,光有dll文件,程序可是跑不起来的,还要有调用它的.exe文件才可以。那么我们现在就开动吧~
1,备份下sdk中的例子工程template
2,加入工程。点击左上角,档案->加入->现有专案,选择tempale。好了我们现在应看到工程被加入到了我们原有的方案中,如图Fig.7
Fig.7
4,只保留,template工程中文件,template.cu,其他删除,再在同目录下建立cudaOpenheader.h,将cudaOpenMP工程中的同名源程序复制进去。如图Fig.8所示。
Fig.8
将template.cu文件中的代码替换为,原cudaOpenMP.cu中的main函数部分,然后小修饰一下:)代码给出:
template.cu
#include <omp.h> #include <stdio.h> // stdio functions are used since C++ streams aren't necessarily thread safe #include <cutil_inline.h> #include <cudaOpenheader.h> using namespace std; int main(int argc, char *argv[]) { int num_gpus = 0; // number of CUDA GPUs / // determine the number of CUDA capable GPUs // cudaGetDeviceCount(&num_gpus); if(num_gpus < 1) { printf("no CUDA capable devices were detected/n"); return 1; } / // display CPU and GPU configuration // printf("number of host CPUs:/t%d/n", omp_get_num_procs()); printf("number of CUDA devices:/t%d/n", num_gpus); for(int i = 0; i < num_gpus; i++) { cudaDeviceProp dprop; cudaGetDeviceProperties(&dprop, i); printf(" %d: %s/n", i, dprop.name); } printf("---------------------------/n"); / // initialize data // unsigned int n = num_gpus * 8192; unsigned int nbytes = n * sizeof(int); int *a = 0; // pointer to data on the CPU int b = 3; // value by which the array is incremented a = (int*)malloc(nbytes); if(0 == a) { printf("couldn't allocate CPU memory/n"); return 1; } for(unsigned int i = 0; i < n; i++) a[i] = i; // run as many CPU threads as there are CUDA devices // each CPU thread controls a different device, processing its // portion of the data. It's possible to use more CPU threads // than there are CUDA devices, in which case several CPU // threads will be allocating resources and launching kernels // on the same device. For example, try omp_set_num_threads(2*num_gpus); // Recall that all variables declared inside an "omp parallel" scope are // local to each CPU thread // omp_set_num_threads(num_gpus); // create as many CPU threads as there are CUDA devices //omp_set_num_threads(2*num_gpus);// create twice as many CPU threads as there are CUDA devices #pragma omp parallel { unsigned int cpu_thread_id = omp_get_thread_num(); unsigned int num_cpu_threads = omp_get_num_threads(); // set and check the CUDA device for this CPU thread int gpu_id = -1; CUDA_SAFE_CALL(cudaSetDevice(cpu_thread_id % num_gpus)); // "% num_gpus" allows more CPU threads than GPU devices CUDA_SAFE_CALL(cudaGetDevice(&gpu_id)); printf("CPU thread %d (of %d) uses CUDA device %d/n", cpu_thread_id, num_cpu_threads, gpu_id); int *d_a = 0; // pointer to memory on the device associated with this CPU thread int *sub_a = a + cpu_thread_id * n / num_cpu_threads; // pointer to this CPU thread's portion of data unsigned int nbytes_per_kernel = nbytes / num_cpu_threads; dim3 gpu_threads(128); // 128 threads per block dim3 gpu_blocks(n / (gpu_threads.x * num_cpu_threads)); CUDA_SAFE_CALL(cudaMalloc((void**)&d_a, nbytes_per_kernel)); CUDA_SAFE_CALL(cudaMemset(d_a, 0, nbytes_per_kernel)); CUDA_SAFE_CALL(cudaMemcpy(d_a, sub_a, nbytes_per_kernel, cudaMemcpyHostToDevice)); kernelAddConstant<<<gpu_blocks, gpu_threads>>>(d_a, b); CUDA_SAFE_CALL(cudaMemcpy(sub_a, d_a, nbytes_per_kernel, cudaMemcpyDeviceToHost)); CUDA_SAFE_CALL(cudaFree(d_a)); } printf("---------------------------/n"); if(cudaSuccess != cudaGetLastError()) printf("%s/n", cudaGetErrorString(cudaGetLastError())); // check the result // if(correctResult(a, n, b)) printf("PASSED/n"); else printf("FAILED/n"); free(a); // free CPU memory cudaThreadExit(); cutilExit(argc, argv); return 0; }
第五步:
设定此方案中两个工程,template,cudaOpenMP之间的关系。方案总管中保持template工程的选中状态,菜单->专案->设定为起始专案如图Fig.9所示。
Fig.9
然后,不要忘记点击下面的设定相依性,设定template依赖于cudaOpenMP。
第六步:
编译~~~~~~~~~成功了吗?如果运行结果如Fig.1所示,那么恭喜你!!!
不过,没有成功的话也不要气馁,可以和小e交流哦~
最后如果大家想release 出程序的话,那需要方案总管中保持总方案的选中状态,然后点击菜单->专案->属性->将里面的debug统统变成release。小e在这里要提醒大家的是,很有可能变成release后,所有的菜单->专案->属性都需要从新设定一遍,它又自动回归成了默认值。
想获得最后的结果的话,请参考你Fig.3中的设定,嘿嘿,我就不说得太详细了。小e有点懒。
后记:
今天,是元宵佳节~那么小e在这里,先祝大家团团圆圆,和和美美,如果呢大家觉得写得还有一点点用处,就给小e留个言,鼓励鼓励我吧~如果有什么不清楚的地方,大家也可以指出来,我们一起交流。
参考:
【1】http://www.nvidia.com/object/parallel-nsight.html
【2】http://www.nvidia.com/object/parallel-nsight-requirements.html
【3】http://developer.nvidia.com/object/cuda_3_2_downloads.html
【4】Windows程序设计 【美】 Charles Petzold 北京大学出版社
