Analysis with NVIDIA Profiler
Not Enough Parallelism
Condition
If the kernel is doing little work, then the overhead of memcpy
and
kernel launches can offset any performance gains. Consider working on a larger sample set
(thus increasing the loop size). To detect this condition, look at the
nvvpreport
.
Action
Do more work in the loop or increase sample set size
Too Many Local per-Thread Registers
Condition
If there are too many local/temp variables used in the loop body, then it causes high
register pressure in the per-thread register file. You can detect this condition by
running in GPU safe-build mode. Or, nvvp
reports this fact.
Action
Consider using different block sizes in coder.gpu.kernel
pragma.
Related Topics
- Code Generation Using the Command Line Interface
- Code Generation by Using the GPU Coder App
- Code Generation Reports
- Trace Between Generated CUDA Code and MATLAB Source Code
- Generating a GPU Code Metrics Report for Code Generated from MATLAB Code
- Kernel Analysis
- Memory Bottleneck Analysis
- GPU Performance Analyzer