MATLAB R2024b GPU validation device fail for Multi-Instance GPU (MIG) A100
34 views (last 30 days)
Show older comments
We are currently installing MATLAB R2024b on our HPC cluster. The instillation works beautifully across all of our GPUs except an A100 that utilizes NVIDIA's Multi-Instance GPU (MIG). When I launch a CLI session using
matlab -nodesktop -nodisplay -nosoftwareopengl
and run "validateGPU", I receive the following error: "Encountered error when calling NVML. The NVML error was: Invalid Argument."
The same sequence does not produce an error when ran on one of our other A100 GPUs with the same Driver and CUDA version. In our MATLAB version R2023b we do not receive this error with our MIG GPU and it is able to run GPU code successfully. Could someone please let me know if MATLAB R2024b is able to run on A100 GPUs with MIG and if it can, what the issue might be?
For robustness, here is the full output:
nvidia-smi
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.12 Driver Version: 550.90.12 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A100-PCIE-40GB On | 00000000:21:00.0 Off | On |
| N/A 29C P0 32W / 250W | 75MiB / 40960MiB | N/A Default |
| | | Enabled |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA A100-PCIE-40GB On | 00000000:81:00.0 Off | On |
| N/A 28C P0 33W / 250W | 75MiB / 40960MiB | N/A Default |
| | | Enabled |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA A100-PCIE-40GB On | 00000000:E2:00.0 Off | On |
| N/A 28C P0 34W / 250W | 75MiB / 40960MiB | N/A Default |
| | | Enabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| MIG devices: |
+------------------+----------------------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG |
| | | ECC| |
|==================+==================================+===========+=======================|
| 0 2 0 0 | 38MiB / 19968MiB | 42 0 | 3 0 2 0 0 |
| | 0MiB / 32767MiB | | |
+------------------+----------------------------------+-----------+-----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
Launching a session and attempting to validate the GPU:
matlab -nodesktop -nodisplay -nosoftwareopengl
< M A T L A B (R) >
Copyright 1984-2024 The MathWorks, Inc.
R2024b Update 2 (24.2.0.2773142) 64-bit (glnxa64)
October 22, 2024
To get started, type doc.
For product information, visit www.mathworks.com.
Warning: OpenGL Startup options will be removed in a future release.
>> validateGPU
# Beginning GPU validation
# Performing system validation
# CUDA-supported platform .................................................PASSED
# CUDA-enabled graphics driver exists .....................................PASSED
# Version: 550.90.12
# CUDA-enabled graphics driver load .......................................PASSED
# CUDA environment variables ..............................................PASSED
# CUDA_VISIBLE_DEVICES: "0"
# CUDA device count .......................................................PASSED
# Found 1 devices.
# GPU libraries load ......................................................PASSED
#
# Performing device validation for device index 1
# Device exists ...........................................................FAILED
# Encountered error when calling NVML. The NVML error was:
# Invalid Argument.
#
# Device supported ........................................................SKIPPED
# Device available ........................................................SKIPPED
# Device selectable .......................................................SKIPPED
# Device memory allocation ................................................SKIPPED
# Device kernel launch ....................................................SKIPPED
# Finished GPU validation with 1 failures.
Output using "coder.checkGpuInstall":
>> gpuEnvObj = coder.gpuEnvConfig;
>> gpuEnvObj.GpuId = 0;
>> gpuEnvObj.BasicCodegen = 1;
>> gpuEnvObj.BasicCodeexec = 1;
>> results = coder.checkGpuInstall(gpuEnvObj)
Compatible GPU : FAILED (There is a problem with the graphics driver or with this GPU device. Code execution will not be available. Check that you have a supported GPU and the latest graphics driver.)
CUDA Environment : FAILED (Unable to execute the nvcc command. Check your CUDA Toolkit installation.)
Runtime : PASSED
cuFFT : PASSED
cuSOLVER : PASSED
cuBLAS : PASSED
Host Compiler : PASSED
results =
struct with fields:
gpu: 0
cuda: 0
cudnn: 0
tensorrt: 0
hostcompiler: 1
basiccodegen: 0
basiccodeexec: 0
deepcodegen: 0
tensorrtdatatype: 0
deepcodeexec: 0
0 Comments
Answers (1)
Joss Knight
on 13 Jan 2025
Try running nvidia-smi -L in a terminal to get the UUID of the device, and then set CUDA_VISIBLE_DEVICES to that full UUID instead of the device index, following the advice in the Knowledge Article here. I'm not sure device index works properly with MIG in CUDA 12.
Do you have one A100 divided into 3 or 3 A100s with one in MIG mode? If the latter I think something is wrong, your driver should not be able to see anything but the MIG device.
11 Comments
Joss Knight
on 14 Feb 2025 at 14:41
It seems that the workaround for this is to avoid selecting the GPU device. Creating and using gpuArrays works but querying device properties does not, so gpuDevice, gpuDeviceTable, vaidateGPU, canUseGPU will all error.
Can you try this and see if this solves your problem?
See Also
Categories
Find more on Get Started with GPU Coder in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!