CUDA 5.0 with Visual C++ 2010 Express-The very first program

This post is my sharing about how to config CUDA 5.0 (exactly 5.0.35) with Visual C++ Express 2010 on Windows 7. Besides, some other issues are mentioned including how to compile a CUDA program, how to measure runtime of a function or part of code, how to make Visual C++ and Visual Assist X aware of CUDA C++ code, and the last thing is the answer to the questtion: ” Is it possible to program (write code, compile only) on a non CUDA machine?”.

1. Installation

You need 2 program, Visual C++ 2010 Express and CUDA 5 (32 bit or 64 bit based on your system). After downloading them, install the Visual C++ first, then the CUDA library (choose all the options). There is nothing special about this step.

2. Write your first program

The files of a CUDA program are classified as two types: the normal C++ source file (*.cpp and *.h, ect.) and the CUDA C++ file (*.cu and *.cuh). The CUDA source file must be compiled by NVCC program (a compiler from Nvidia) and the resulted binary code will be combined with the code from the normal C++ file, which is compiled by VS C++ compiler. So the problem is that how to make this compilation run smoothly. And here are steps of writing a CUDA program:

+ Open VS C++ 2010 Express.

+ File->New->Project->Empty Project, enter the name of the project, Exp1.

+ In the Solution Explorer Tab, add new source file for your project, choose the C++ File (.cpp) type and type the name of the file as main.cu.

config include6

+ Write your code:

/**
* A matrix multiplication using the cuBLAS library.
*/

#include <cstdlib>
#include <iostream>
#include <string>

#include <time.h>

#include <cublas.h>

typedef float ScalarT;

// Some helper functions //
/**
* Calculates 1D index from row-major order to column-major order.
*/
#define index(r,c,rows) (((c)*(rows))+(r))

#define CudaSafeCall( err ) __cudaSafeCall( err, __FILE__, __LINE__ )
inline void __cudaSafeCall( cublasStatus err, const char *file, const int line )
{
if( err != CUBLAS_STATUS_SUCCESS )
{
std::cerr << “CUDA call failed at ” << file << “:” << line << std::endl;
exit (EXIT_FAILURE);
}
}

#define AllocCheck( err ) __allocCheck( err, __FILE__, __LINE__ )
inline void __allocCheck( void* err, const char *file, const int line )
{
if( err == 0 )
{
std::cerr << “Allocation failed at ” << file << “:” << line << std::endl;
exit (EXIT_FAILURE);
}
}

void printMat( const ScalarT* const mat, size_t rows, size_t columns, std::string prefix = “Matrix:” )
{
// Maximum to print
const size_t max_rows = 5;
const size_t max_columns = 16;

std::cout << prefix << std::endl;
for( size_t r = 0; r < rows && r < max_rows; ++r )
{
for( size_t c = 0; c < columns && c < max_columns; ++c )
{
std::cout << mat[index(r,c,rows)] << ” “;
}
std::cout << std::endl;
}
}
// Main program //
int main( int argc, char** argv )
{
size_t HA = 4200;
size_t WA = 23000;
size_t WB = 1300;
size_t HB = WA;
size_t WC = WB;
size_t HC = HA;

size_t r, c;

cudaEvent_t tAllStart, tAllEnd;
cudaEvent_t tKernelStart, tKernelEnd;
float time;

// Prepare host memory and input data //
ScalarT* A = ( ScalarT* )malloc( HA * WA * sizeof(ScalarT) );
AllocCheck( A );
ScalarT* B = ( ScalarT* )malloc( HB * WB * sizeof(ScalarT) );
AllocCheck( B );
ScalarT* C = ( ScalarT* )malloc( HC * WC * sizeof(ScalarT) );
AllocCheck( C );

for( r = 0; r < HA; r++ )
{
for( c = 0; c < WA; c++ )
{
A[index(r,c,HA)] = ( ScalarT )index(r,c,HA);
}
}

for( r = 0; r < HB; r++ )
{
for( c = 0; c < WB; c++ )
{
B[index(r,c,HB)] = ( ScalarT )index(r,c,HB);
}
}

// Initialize cuBLAS //

cublasStatus status;
cublasInit();

// Prepare device memory //
ScalarT* dev_A;
ScalarT* dev_B;
ScalarT* dev_C;

status = cublasAlloc( HA * WA, sizeof(ScalarT), ( void** )&dev_A );
CudaSafeCall( status );

status = cublasAlloc( HB * WB, sizeof(ScalarT), ( void** )&dev_B );
CudaSafeCall( status );

status = cublasAlloc( HC * WC, sizeof(ScalarT), ( void** )&dev_C );
CudaSafeCall( status );

cudaEventCreate(&tAllStart);
cudaEventCreate(&tAllEnd);
cudaEventRecord(tAllStart, 0);

status = cublasSetMatrix( HA, WA, sizeof(ScalarT), A, HA, dev_A, HA );
CudaSafeCall( status );

status = cublasSetMatrix( HB, WB, sizeof(ScalarT), B, HB, dev_B, HB );
CudaSafeCall( status );

// Call cuBLAS function //
cudaEventCreate(&tKernelStart);
cudaEventCreate(&tKernelEnd);
cudaEventRecord(tKernelStart, 0);

// Use of cuBLAS constant CUBLAS_OP_N produces a runtime error!
const char CUBLAS_OP_N = ‘n’; // ‘n’ indicates that the matrices are non-transposed.
cublasSgemm( CUBLAS_OP_N, CUBLAS_OP_N, HA, WB, WA, 1, dev_A, HA, dev_B, HB, 0, dev_C, HC ); // call for float
// cublasDgemm( CUBLAS_OP_N, CUBLAS_OP_N, HA, WB, WA, 1, dev_A, HA, dev_B, HB, 0, dev_C, HC ); // call for double
status = cublasGetError();
CudaSafeCall( status );

cudaEventRecord(tKernelEnd, 0);
cudaEventSynchronize(tKernelEnd);

cudaEventElapsedTime(&time, tKernelStart, tKernelEnd);
std::cout << “time (kernel only): ” << time << “ms” << std::endl;

// Load result from device //
cublasGetMatrix( HC, WC, sizeof(ScalarT), dev_C, HC, C, HC );
CudaSafeCall( status );

cudaEventRecord(tAllEnd, 0);
cudaEventSynchronize(tAllEnd);

cudaEventElapsedTime(&time, tAllStart, tAllEnd);

std::cout << “time (incl. data transfer): ” << time << “ms” << std::endl;

// Print result //
//printMat( A, HA, WA, “\nMatrix A:” );
//printMat( B, HB, WB, “\nMatrix B:” );
//printMat( C, HC, WC, “\nMatrix C:” );

// Free CUDA memory //
status = cublasFree( dev_A );
CudaSafeCall( status );

status = cublasFree( dev_B );
CudaSafeCall( status );

status = cublasFree( dev_C );
CudaSafeCall( status );

status = cublasShutdown();
CudaSafeCall( status );

// Free host memory //
free( A );
free( B );
free( C );

return EXIT_SUCCESS;
}
+ Config the project as a CUDA project. In the Solution Explorer, right click on the name of the project and choose Build Customizations, in the dialog appeared, check the CUDA 5.0 option, then OK.

config include6
config include6

+ Right click on the CUDA code file (main.cu in this example), choose Properties. In the dialog appeared, choose CUDA C/C++ as the image below:

config include6

+ In the Property Manager tab (View->Property Manager), right click on the Microsoft.Cpp.Win32.user as the image below and choose Properties.

config include6

+ In the VC++ Directories, you have to add some paths (to folders) of CUDA include files, reference folder, library files, like in the images below (do not close the dialog after this step):

config include6
config include6
config include6

+ In the Linker tree, choose Input and add the library files needed for CUDA programs as in the image below:

config include6

+ You will be asked to save the configuration (for all CUDA programs), choose Yes. The configuration steps (start with the operations in the Property Manager above) are needed only one time.
Now you can build your program (use Release option).

3. Timing measurement

In earlier time of CUDA (version <5.0) there are two ways that can be used to measure the time of a program, a function or a part of the proram. But in CUDA 5 (or in the best of my knowledge with CUDA 5), only one way: using cudaEvent_t.

+ Declaration:

cudaEvent_t tAllStart, tAllEnd;
float time;

+ Start recording time information:

cudaEventCreate(&tAllStart);
cudaEventCreate(&tAllEnd);
cudaEventRecord(tAllStart, 0);

+ Stop recording time information:

cudaEventRecord(tAllEnd, 0);
cudaEventSynchronize(tAllEnd);

+ Get the time and output:

cudaEventElapsedTime(&time, tAllStart, tAllEnd);
std::cout << “time (incl. data transfer): ” << time << “ms” << std::endl;

4. How to make Visual C++ and Visual Assist X be aware of the CUDA source files?

You can get this information from the links below:

+ Link 1

+ Link 2

5. Is it possible to program (write code and compile only) on a non CUDA machine?

This question is related to my circumstance because I have one CUDA desktop machine at the lab, which can be remoted controlled from my house, so I would like to write and compile the program on my labtop, then copy the program file to the desktop machine to run. Fortunately, the question is YES. We can write and compile CUDA program on a non CUDA machine. You install the Visual tool first, then the CUDA toolkit but do not select the CUDA driver option since your machine does not have any CUDA device. The same steps should be followed with the laptop for getting things done.

Your comments on this topic are welcome!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: