【CUDA】利用NVIDIA SDK生成dll程序详解

技术2026-03-31 21

导言：

首先，英伟达公司的NSIGHT实在是个无品的产品，至今产品已经出到Parallel Nsight 1.51 released。虽然产品本身已经免费了【1】，但是安装它的话需要先安装VS的补丁，您一旦安装补丁，好好的绿色VS瞬间进入了试用期。更重要的是，您如果真买了正版VS，恭喜您，您想调试CUDA程序的话要买两块显卡【2】！！！NVIDIA显卡什么价钱,我就不说了。对于我们这些初学者而言简直就是噩梦。所以，作为替代品我们选择SDK进行开发，虽然麻烦一点，很多东西需要手工来配置，但是，省钱！！！对于软件的了解也不无益处。

第二，我发现很少有文章针对SDK开发动态连接库（及dll）程序有详尽而入门级的指导，所以小e同学特别好心地将摆弄了好几天的经验在这里和同志们交流，实践中遇到问题，也欢迎在这里提问哦～

准备工作：

正是进行工作之前，你需要已经完成了以下准备工作：1，已经成功安装Toolkit 并运行成功；2，成功安装SDK;3，成功编译运行SDK中的C语言例子程序（如果运行不了请检查项目属性中是否：连接器->其他程式库目录是否配置有误，如果有误会产生.lib的连接错误）；4，对dll文件的使用有一定了解（喂，至少知道它是做什么的吧，还有和lib输入库文件的关系）。

没有完成以上步骤的同学，请先在【3】下载哦～

正式工作：

激动人心的正式步骤就要开始了！！！我们在这里将会解决一下几个主要困难：1，如何将一个实例中生成exe的工程改为生成dll的工程；2，如何测试dll工程；3，如何release最终的产品。

第一步：

我们在sdk C语言例子程序中找到一个比较简单的只有一个文件的工程（偷懒的行为。。），小e在这里用的是“cudaOpenMP”。然后，记得复制一个它的备份哦（不然，等到您把它改得面目全非后，再想看原来的例子工程你就泪目吧～）。打开工程，运行一下，可以看到结果如下：(Fig.1)

Fig.1

我们这次工作的目的就是将此工程拆成两个工程，一个生成dll和lib输入库文件，另一个负责生成exe文件加载测试我们的dll文件。而原工程中只有一个.cu文件，列在这里：

#include <omp.h> #include <stdio.h> // stdio functions are used since C++ streams aren't necessarily thread safe #include <cutil_inline.h> using namespace std; // a simple kernel that simply increments each array element by b __global__ void kernelAddConstant(int *g_a, const int b) { int idx = blockIdx.x * blockDim.x + threadIdx.x; g_a[idx] += b; } // a predicate that checks whether each array elemen is set to its index plus b int correctResult(int *data, const int n, const int b) { for(int i = 0; i < n; i++) if(data[i] != i + b) return 0; return 1; } int main(int argc, char *argv[]) { int num_gpus = 0; // number of CUDA GPUs / // determine the number of CUDA capable GPUs // cudaGetDeviceCount(&num_gpus); if(num_gpus < 1) { printf("no CUDA capable devices were detected/n"); return 1; } / // display CPU and GPU configuration // printf("number of host CPUs:/t%d/n", omp_get_num_procs()); printf("number of CUDA devices:/t%d/n", num_gpus); for(int i = 0; i < num_gpus; i++) { cudaDeviceProp dprop; cudaGetDeviceProperties(&dprop, i); printf(" %d: %s/n", i, dprop.name); } printf("---------------------------/n"); / // initialize data // unsigned int n = num_gpus * 8192; unsigned int nbytes = n * sizeof(int); int *a = 0; // pointer to data on the CPU int b = 3; // value by which the array is incremented a = (int*)malloc(nbytes); if(0 == a) { printf("couldn't allocate CPU memory/n"); return 1; } for(unsigned int i = 0; i < n; i++) a[i] = i; // run as many CPU threads as there are CUDA devices // each CPU thread controls a different device, processing its // portion of the data. It's possible to use more CPU threads // than there are CUDA devices, in which case several CPU // threads will be allocating resources and launching kernels // on the same device. For example, try omp_set_num_threads(2*num_gpus); // Recall that all variables declared inside an "omp parallel" scope are // local to each CPU thread // omp_set_num_threads(num_gpus); // create as many CPU threads as there are CUDA devices //omp_set_num_threads(2*num_gpus);// create twice as many CPU threads as there are CUDA devices #pragma omp parallel { unsigned int cpu_thread_id = omp_get_thread_num(); unsigned int num_cpu_threads = omp_get_num_threads(); // set and check the CUDA device for this CPU thread int gpu_id = -1; CUDA_SAFE_CALL(cudaSetDevice(cpu_thread_id % num_gpus)); // "% num_gpus" allows more CPU threads than GPU devices CUDA_SAFE_CALL(cudaGetDevice(&gpu_id)); printf("CPU thread %d (of %d) uses CUDA device %d/n", cpu_thread_id, num_cpu_threads, gpu_id); int *d_a = 0; // pointer to memory on the device associated with this CPU thread int *sub_a = a + cpu_thread_id * n / num_cpu_threads; // pointer to this CPU thread's portion of data unsigned int nbytes_per_kernel = nbytes / num_cpu_threads; dim3 gpu_threads(128); // 128 threads per block dim3 gpu_blocks(n / (gpu_threads.x * num_cpu_threads)); CUDA_SAFE_CALL(cudaMalloc((void**)&d_a, nbytes_per_kernel)); CUDA_SAFE_CALL(cudaMemset(d_a, 0, nbytes_per_kernel)); CUDA_SAFE_CALL(cudaMemcpy(d_a, sub_a, nbytes_per_kernel, cudaMemcpyHostToDevice)); kernelAddConstant<<<gpu_blocks, gpu_threads>>>(d_a, b); CUDA_SAFE_CALL(cudaMemcpy(sub_a, d_a, nbytes_per_kernel, cudaMemcpyDeviceToHost)); CUDA_SAFE_CALL(cudaFree(d_a)); } printf("---------------------------/n"); if(cudaSuccess != cudaGetLastError()) printf("%s/n", cudaGetErrorString(cudaGetLastError())); // check the result // if(correctResult(a, n, b)) printf("PASSED/n"); else printf("FAILED/n"); free(a); // free CPU memory cudaThreadExit(); cutilExit(argc, argv); return 0; }

第二步：

修改我们的工程属性，使其变为一个dll文件工程，需要修改的主要是两个方面：1，dll文件是windows子系统下的产品不是console子系统的；2，设置输出文档为dll。不多说果断先上图Fig.2：

Fig.2

1，右键我们“方案总管”中的工程，选择“属性”->组态属性->一般，出现Fig.2，找到“组态类别”选项，将.exe修改为.dll。

2，选择连接器如图Fig.3修改“输出档”选项，将cudaOpenMP.exe改为cudaOpenMP.dll。

Fig.3

3，选择连接器->系统，修改子系统选项，由console改为windows(见图Fig.4)。

Fig.4

4，选择CUDA Build Rule v3.0.xx->Preprocessor将 Preprocessor definitions 中加入WIN32；_WINDOWS；_DEBUG；_USRDL；L_WINDLL（见图Fig.5）

Fig.5

至此，dll文件的工程属性配置完成。

第三步：

修改源程序，我们将原cudaOpenMP.cu中的main函数拿掉，而只保留两个功能函数，另建立.h文件保存函数目录。此步骤与建立普通的cpp dll工程并无多大区别。直接给出修改后的源代码及其关系（Fig.6）：

Fig.6

cudaOpenMP.cu(修改后的)

#include <windows.h> #include <omp.h> #include <stdio.h> // stdio functions are used since C++ streams aren't necessarily thread safe #include <cutil_inline.h> #include <cudaOpenheader.h> int WINAPI DllMain(HINSTANCE hInstance, DWORD fdwReason, PVOID pvReserved) { return TRUE; } // a simple kernel that simply increments each array element by b EXPORT __global__ void kernelAddConstant(int *g_a, const int b) { int idx = blockIdx.x * blockDim.x + threadIdx.x; g_a[idx] += b; } // a predicate that checks whether each array elemen is set to its index plus b EXPORT int correctResult(int *data, const int n, const int b) { for(int i = 0; i < n; i++) if(data[i] != i + b) return 0; return 1; }

cudaOpenheader.h

//in .h file we don't need any build tool such as "CUDA Rule xxxx" #ifdef __cplusplus #define EXPORT extern "C" __declspec(dllexport) #else #define EXPORT __declspec(dllexport) #endif EXPORT __global__ void kernelAddConstant(int *g_a, const int b); EXPORT int correctResult(int *data, const int n, const int b);

好了现在我们编译工程，是不是成功了呢？如果没有成功，欢迎讨论。

第四步：

加入测试.exe程序。大家知道，光有dll文件，程序可是跑不起来的，还要有调用它的.exe文件才可以。那么我们现在就开动吧～

1，备份下sdk中的例子工程template

2，加入工程。点击左上角，档案->加入->现有专案，选择tempale。好了我们现在应看到工程被加入到了我们原有的方案中，如图Fig.7

Fig.7

4，只保留，template工程中文件，template.cu,其他删除，再在同目录下建立cudaOpenheader.h，将cudaOpenMP工程中的同名源程序复制进去。如图Fig.8所示。

Fig.8

将template.cu文件中的代码替换为，原cudaOpenMP.cu中的main函数部分，然后小修饰一下：）代码给出：

template.cu

#include <omp.h> #include <stdio.h> // stdio functions are used since C++ streams aren't necessarily thread safe #include <cutil_inline.h> #include <cudaOpenheader.h> using namespace std; int main(int argc, char *argv[]) { int num_gpus = 0; // number of CUDA GPUs / // determine the number of CUDA capable GPUs // cudaGetDeviceCount(&num_gpus); if(num_gpus < 1) { printf("no CUDA capable devices were detected/n"); return 1; } / // display CPU and GPU configuration // printf("number of host CPUs:/t%d/n", omp_get_num_procs()); printf("number of CUDA devices:/t%d/n", num_gpus); for(int i = 0; i < num_gpus; i++) { cudaDeviceProp dprop; cudaGetDeviceProperties(&dprop, i); printf(" %d: %s/n", i, dprop.name); } printf("---------------------------/n"); / // initialize data // unsigned int n = num_gpus * 8192; unsigned int nbytes = n * sizeof(int); int *a = 0; // pointer to data on the CPU int b = 3; // value by which the array is incremented a = (int*)malloc(nbytes); if(0 == a) { printf("couldn't allocate CPU memory/n"); return 1; } for(unsigned int i = 0; i < n; i++) a[i] = i; // run as many CPU threads as there are CUDA devices // each CPU thread controls a different device, processing its // portion of the data. It's possible to use more CPU threads // than there are CUDA devices, in which case several CPU // threads will be allocating resources and launching kernels // on the same device. For example, try omp_set_num_threads(2*num_gpus); // Recall that all variables declared inside an "omp parallel" scope are // local to each CPU thread // omp_set_num_threads(num_gpus); // create as many CPU threads as there are CUDA devices //omp_set_num_threads(2*num_gpus);// create twice as many CPU threads as there are CUDA devices #pragma omp parallel { unsigned int cpu_thread_id = omp_get_thread_num(); unsigned int num_cpu_threads = omp_get_num_threads(); // set and check the CUDA device for this CPU thread int gpu_id = -1; CUDA_SAFE_CALL(cudaSetDevice(cpu_thread_id % num_gpus)); // "% num_gpus" allows more CPU threads than GPU devices CUDA_SAFE_CALL(cudaGetDevice(&gpu_id)); printf("CPU thread %d (of %d) uses CUDA device %d/n", cpu_thread_id, num_cpu_threads, gpu_id); int *d_a = 0; // pointer to memory on the device associated with this CPU thread int *sub_a = a + cpu_thread_id * n / num_cpu_threads; // pointer to this CPU thread's portion of data unsigned int nbytes_per_kernel = nbytes / num_cpu_threads; dim3 gpu_threads(128); // 128 threads per block dim3 gpu_blocks(n / (gpu_threads.x * num_cpu_threads)); CUDA_SAFE_CALL(cudaMalloc((void**)&d_a, nbytes_per_kernel)); CUDA_SAFE_CALL(cudaMemset(d_a, 0, nbytes_per_kernel)); CUDA_SAFE_CALL(cudaMemcpy(d_a, sub_a, nbytes_per_kernel, cudaMemcpyHostToDevice)); kernelAddConstant<<<gpu_blocks, gpu_threads>>>(d_a, b); CUDA_SAFE_CALL(cudaMemcpy(sub_a, d_a, nbytes_per_kernel, cudaMemcpyDeviceToHost)); CUDA_SAFE_CALL(cudaFree(d_a)); } printf("---------------------------/n"); if(cudaSuccess != cudaGetLastError()) printf("%s/n", cudaGetErrorString(cudaGetLastError())); // check the result // if(correctResult(a, n, b)) printf("PASSED/n"); else printf("FAILED/n"); free(a); // free CPU memory cudaThreadExit(); cutilExit(argc, argv); return 0; }

第五步：

设定此方案中两个工程，template,cudaOpenMP之间的关系。方案总管中保持template工程的选中状态，菜单->专案->设定为起始专案如图Fig.9所示。

Fig.9

然后，不要忘记点击下面的设定相依性，设定template依赖于cudaOpenMP。

第六步：

编译～～～～～～～～～成功了吗？如果运行结果如Fig.1所示，那么恭喜你！！！

不过，没有成功的话也不要气馁，可以和小e交流哦～

最后如果大家想release 出程序的话，那需要方案总管中保持总方案的选中状态，然后点击菜单->专案->属性->将里面的debug统统变成release。小e在这里要提醒大家的是，很有可能变成release后，所有的菜单->专案->属性都需要从新设定一遍，它又自动回归成了默认值。

想获得最后的结果的话，请参考你Fig.3中的设定，嘿嘿，我就不说得太详细了。小e有点懒。

后记：

今天，是元宵佳节～那么小e在这里，先祝大家团团圆圆，和和美美，如果呢大家觉得写得还有一点点用处，就给小e留个言，鼓励鼓励我吧～如果有什么不清楚的地方，大家也可以指出来，我们一起交流。

参考：

【1】http://www.nvidia.com/object/parallel-nsight.html

【2】http://www.nvidia.com/object/parallel-nsight-requirements.html

【3】http://developer.nvidia.com/object/cuda_3_2_downloads.html

【4】Windows程序设计【美】 Charles Petzold 北京大学出版社

最新回复(0)