[CMake] CMake CUDA 3.8+/9 support as a first class language with out Visual Studio Support... err what?

Sun Jul 30 20:15:20 EDT 2017

Saga novella continues:

 >> Next I am going to remove all NVIDA drivers and try reinstall of 
CUDA 7.5  see if I can get deviceQuery to report 7.5/7.5.

Nvidia 352.65 driver removal from Add/Remove Programs
Device Manager -> NVIDIA GeForce GTX 960M -> General reports "device has 
been disabled"

Device Query:

C:\ProgramData\NVIDIA Corporation\CUDA Samples\v7.5\bin\win64\Debug>rem 
start "Device Query" deviceQuery.exe

C:\ProgramData\NVIDIA Corporation\CUDA 
Samples\v7.5\bin\win64\Debug>deviceQuery.exe
deviceQuery.exe Starting...

  CUDA Device Query (Runtime API) version (CUDART static linking)

cudaGetDeviceCount returned 35
-> CUDA driver version is insufficient for CUDA runtime version
Result = FAIL

Ok so great no driver installed!

Reinstall of CUDA 7.5.18

Run of DeviceQuery:

C:\ProgramData\NVIDIA Corporation\CUDA Samples\v7.5\bin\win64\Debug>rem 
start "Device Query" deviceQuery.exe

C:\ProgramData\NVIDIA Corporation\CUDA 
Samples\v7.5\bin\win64\Debug>deviceQuery.exe
deviceQuery.exe Starting...

  CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "GeForce GTX 960M"
   CUDA Driver Version / Runtime Version          7.5 / 7.5
   CUDA Capability Major/Minor version number:    5.0
   Total amount of global memory:                 4096 MBytes 
(4294967296 bytes)
   ( 5) Multiprocessors, (128) CUDA Cores/MP:     640 CUDA Cores
   GPU Max Clock rate:                            1176 MHz (1.18 GHz)
   Memory Clock rate:                             2505 Mhz
   Memory Bus Width:                              128-bit
   L2 Cache Size:                                 2097152 bytes
   Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 
65536), 3D=(4096, 4096, 4096)
   Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
   Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 
2048 layers
   Total amount of constant memory:               65536 bytes
   Total amount of shared memory per block:       49152 bytes
   Total number of registers available per block: 65536
   Warp size:                                     32
   Maximum number of threads per multiprocessor:  2048
   Maximum number of threads per block:           1024
   Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
   Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
   Maximum memory pitch:                          2147483647 bytes
   Texture alignment:                             512 bytes
   Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
   Run time limit on kernels:                     Yes
   Integrated GPU sharing Host Memory:            No
   Support host page-locked memory mapping:       Yes
   Alignment requirement for Surfaces:            Yes
   Device has ECC support:                        Disabled
   CUDA Device Driver Mode (TCC or WDDM):         WDDM (Windows Display 
Driver Model)
   Device supports Unified Addressing (UVA):      Yes
   Device PCI Domain ID / Bus ID / location ID:   0 / 2 / 0
   Compute Mode:
      < Default (multiple host threads can use ::cudaSetDevice() with 
device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 7.5, CUDA 
Runtime Version = 7.5, NumDevs = 1, Device0 = GeForce GTX 960M
Result = PASS

Ok return to sanity with 7.5/7.5

Return to insanity as NBody still does not work with:

Run "nbody -benchmark [-numbodies=<numBodies>]" to measure performance.
         -fullscreen       (run n-body simulation in fullscreen mode)
         -fp64             (use double precision floating point values 
for simulation)
         -hostmem          (stores simulation data in host memory)
         -benchmark        (run benchmark to measure performance)
         -numbodies=<N>    (number of bodies (>= 1) to run in simulation)
         -device=<d>       (where d=0,1,2.... for the CUDA device to use)
         -numdevices=<i>   (where i=(number of CUDA devices > 0) to use 
for simulation)
         -compare          (compares simulation results running once on 
the default GPU and once on the CPU)
         -cpu              (run n-body simulation on the CPU)
         -tipsy=<file.bin> (load a tipsy model file for simulation)

NOTE: The CUDA Samples are not meant for performance measurements. 
Results may vary when GPU Boost is enabled.

 > Windowed mode
 > Simulation data stored in video memory
 > Single precision floating point simulation
 > 1 Devices used for simulation
 > Compute 5.0 CUDA device: [GeForce GTX 960M]
CUDA error at c:\programdata\nvidia corporation\cuda 
samples\v7.5\5_simulations\nbody\bodysystemcuda_impl.h:160 
code=46(cudaErrorDevicesUnavailable) 
"cudaEventCreate(&m_deviceData[0].event)"

There is at this point clearly some very odd behavior with CUDA 7.5 and 
GeForce 960M.  CMake still can build a project, but will not run or 
create memory with cudaMalloc etc.

Installed driver at this point is 353.90.

GeForce Experience reports 381.65 driver

but I have downloaded:

384.94-notebook-win10-64bit-international-whql.exe

So I try that and driver installed is now 384.94

CUDA 7.5 works with new driver, but not seemingly driver shipped with 
7.5 or 8.0.  NBody Runs.

CMake 3.9 still fails to build a runable project with:

GPU Device 0: "GeForce GTX 960M" with compute capability 5.0

Current device is [0]
Current device is [0]
CUDA error at 
C:\projects\cmake\cmaketesting\v3.9\cuda_basic\src\cuda_basic_test.cpp:66 
code=46(cudaErrorDevicesUnavailable) "cudaMalloc((void **)&dev_mem_ptr, 
size)"

DeviceQuery now reports:

C:\ProgramData\NVIDIA Corporation\CUDA 
Samples\v7.5\bin\win64\Debug>deviceQuery.exe
deviceQuery.exe Starting...

  CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "GeForce GTX 960M"
   CUDA Driver Version / Runtime Version          9.0 / 7.5
   CUDA Capability Major/Minor version number:    5.0
   Total amount of global memory:                 4096 MBytes 
(4294967296 bytes)
   ( 5) Multiprocessors, (128) CUDA Cores/MP:     640 CUDA Cores
   GPU Max Clock rate:                            1176 MHz (1.18 GHz)
   Memory Clock rate:                             2505 Mhz
   Memory Bus Width:                              128-bit
   L2 Cache Size:                                 2097152 bytes
   Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 
65536), 3D=(4096, 4096, 4096)
   Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
   Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 
2048 layers
   Total amount of constant memory:               65536 bytes
   Total amount of shared memory per block:       49152 bytes
   Total number of registers available per block: 65536
   Warp size:                                     32
   Maximum number of threads per multiprocessor:  2048
   Maximum number of threads per block:           1024
   Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
   Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
   Maximum memory pitch:                          2147483647 bytes
   Texture alignment:                             512 bytes
   Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
   Run time limit on kernels:                     Yes
   Integrated GPU sharing Host Memory:            No
   Support host page-locked memory mapping:       Yes
   Alignment requirement for Surfaces:            Yes
   Device has ECC support:                        Disabled
   CUDA Device Driver Mode (TCC or WDDM):         WDDM (Windows Display 
Driver Model)
   Device supports Unified Addressing (UVA):      Yes
   Device PCI Domain ID / Bus ID / location ID:   0 / 2 / 0
   Compute Mode:
      < Default (multiple host threads can use ::cudaSetDevice() with 
device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.0, CUDA 
Runtime Version = 7.5, NumDevs = 1, Device0 = GeForce GTX 960M
Result = PASS

Which is how seemingly CUDA 9 driver support was installed.

Tried CMake 3.2 with CUDA 7.5

GPU Device 0: "GeForce GTX 960M" with compute capability 5.0

Current device is [0]
Current device is [0]
CUDA error at 
C:\projects\cmake\cmaketesting\v3.2\cuda_basic\src\cuda_basic_test.cpp:67 
code=46(cudaErrorDevicesUnavailable) "cudaMalloc((void **)&d_volume, size)"

and sigh!

There is some bizarre behavior going on here.

So CMake/Kitware I can get CUDA 7.5 to run samples with 384.94 driver 
and CUDA 8.0 uninstalled but I cannot get CMake 3.2 using FindCUDA or 
CMake 3.9 using project calls to build a simple CUDA app to allocate 
memory on the device.  What gives?

I have been using CMake since 2.8 and CUDA since 1.3 on C1060's and 
mobile Quadros and never experienced this.

Clearly NVIDIA is to blame for the 7.5/8.0 cats in a bag fighting and 
7.5 not working with itself and only working with 9.0 driver, but I 
cannot get any 3.2 or 3.9 to generate a project I can run... this is 
really strange... it's always just worked.  If I could compile and run a 
CUDA sdk app then I knew CMake would and has worked.  What could 
possibly be going on here?