CUDA Performance: What Actually Matters

Shared memory, coalescing, occupancy — the techniques that actually make GPU kernels fast. And the benchmarks that show when they don't.

Published: Nov 06, 2025
Read Time: 12 min read
Words: 2,486
views: —
Author: Nguyen Xuan Hoa

Activity

views/week

last 24 weeks

Activity

views/week

last 24 weeks

Motivation & Recap

In the previous article, we got acquainted with CUDA and the big picture of GPU architecture. This time, we go one level deeper: performance engineering.

The performance gap between unoptimized and optimized CUDA code can be 10x, 50x, sometimes over 100x. That gap doesn't come from "writing GPU code" — it comes from understanding how the hardware actually executes your code, and writing code that works with it instead of against it.

This article focuses on the fundamental techniques that determine real-world kernel performance. Not parallel algorithms — those deserve their own series. Just the hardware-level mechanics that matter for every kernel you'll ever write.

Hardware & Experimental Setup

Performance is always limited by hardware. An optimization technique might work effectively on one architecture but be inefficient on another.

To measure objectively, we will use two different hardware platforms:

NVIDIA GeForce MX330: An old, weak laptop GPU, representing low-end hardware (Pascal architecture).
NVIDIA RTX A4000: A modern, powerful workstation GPU based on the Ampere architecture.

Below is a comparison table of the key specifications for these two GPUs:

Specification	NVIDIA GeForce MX330	NVIDIA RTX A4000
Architecture	Pascal	Ampere
Compute Capability	6.1	8.6
CUDA Cores	384	6144
SM Count	3	48
VRAM	2 GB GDDR5	16 GB GDDR6 (ECC)
Memory Bandwidth	~56.1 GB/s	448 GB/s
L2 Cache	512 KB	4 MB
Registers / SM	65,536 (32-bit)	65,536 (32-bit)

The gap is massive. The A4000 has 16× more SMs and nearly 8× the memory bandwidth. Memory-bound kernels will see the biggest gains there. Compute-bound kernels benefit from the sheer number of CUDA cores.

The important thing: running the same kernel on both lets us separate architecture-dependent wins from genuinely general optimizations. That distinction matters a lot, and the benchmarks below will make it very concrete.

Performance Optimization Techniques

Shared Memory: The User-Managed Cache

Accessing Global Memory (the GPU's VRAM) is one of the most expensive operations in a CUDA kernel. Latency can reach hundreds of clock cycles. If your kernel constantly reads from and writes to global memory, that's your bottleneck.

Shared Memory (SMEM) is an on-chip memory region located right inside the SM — small (typically 48KB to 128KB per SM) but extremely fast. Its latency is lower than L2 cache and only slightly higher than registers, making it an effective tool for data reuse within a block. ^[1]

The general strategy is:

Load a block of data from Global Memory into Shared Memory (once, ideally coalesced — see section 2).
Have threads perform many computations by reading from Shared Memory.
Write the final result back to Global Memory.

Example 1: Matrix Multiplication

Let's consider the problem of multiplying two matrices $C = A \times B$ (of sizes $M \times N$ and $N \times K$ ).

Naive Kernel (No SMEM)

Each thread calculates one element of C. Simple, but each thread does N reads from A and N reads from B — all from Global Memory.

__global__ void matrix_multiplication_naive(
    const float *A, const float *B, float *C, 
    int M, int N, int K)
{
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;
 
    if (row < M && col < K)
    {
        float sum = 0.0f;
        // Loop N times, accessing Global Memory each time
        for (int k = 0; k < N; ++k)
        {
            sum += A[row * N + k] * B[k * K + col];
        }
        C[row * K + col] = sum;
    }
}

Optimized Kernel (Tiling & SMEM)

We use tiling. Each block computes one $16 \times 16$ tile of C by looping over corresponding tiles from A and B, loading them into Shared Memory first.

Execution flow for each thread:

Loop (over the number of tiles)
Load: Pull one element of tile A and tile B from Global Memory into SMEM.
Sync: Wait for all threads to finish loading (__syncthreads()).
Compute: Dot product using data in SMEM.
Sync: Wait before loading the next tile.
Write: Write the final sum to C in Global Memory (once, at the end).

#define TILE_SIZE 16
 
__global__ void matrix_multiplication_smem(
    const float *A, const float *B, float *C, 
    int M, int N, int K)
{
    __shared__ float As[TILE_SIZE][TILE_SIZE];
    __shared__ float Bs[TILE_SIZE][TILE_SIZE];
 
    int row = blockIdx.y * TILE_SIZE + threadIdx.y;
    int col = blockIdx.x * TILE_SIZE + threadIdx.x;
 
    int tx = threadIdx.x;
    int ty = threadIdx.y;
 
    float sum = 0.0f;
 
    for (int t = 0; t < (N + TILE_SIZE - 1) / TILE_SIZE; ++t)
    {
        int A_col = t * TILE_SIZE + tx;
        int B_row = t * TILE_SIZE + ty;
 
        As[ty][tx] = (row < M && A_col < N) ? A[row * N + A_col] : 0.0f;
        Bs[ty][tx] = (B_row < N && col < K) ? B[B_row * K + col] : 0.0f;
 
        __syncthreads();
 
        #pragma unroll
        for (int i = 0; i < TILE_SIZE; ++i)
        {
            sum += As[ty][i] * Bs[i][tx];
        }
 
        __syncthreads();
    }
 
    if (row < M && col < K)
    {
        C[row * K + col] = sum;
    }
}

Results (Example 1: Matrix Multiplication)

Method	MX330 (ms)	A4000 (ms)
Naive	37.240 ms	1.861 ms
Tiling & SMEM	15.846 ms	2.077 ms
`cuBLAS`	2.423 ms	0.206 ms

The most important lesson from this benchmark: optimization is architecture-dependent. Always.

On MX330 (Pascal): Tiling & SMEM gave a clear ~2.35x speedup (37.240 / 15.846). On older hardware with a small L2 cache and limited bandwidth, manually reducing global memory access pays off.
On RTX A4000 (Ampere): Our "optimized" kernel is actually slower — 2.077 ms vs 1.861 ms for the naive version. This doesn't mean SMEM is useless. It means the naive kernel is already benefiting from Ampere's large L2 cache (4 MB) and massive bandwidth (448 GB/s), which hides most of the global memory latency automatically. Our SMEM kernel, meanwhile, introduces real overhead: more complex index calculations and __syncthreads() barriers. On fast-enough hardware, the synchronization cost outweighs the SMEM benefit.
cuBLAS: Wins on both, by a lot. On the A4000, it's ~9x faster than naive and ~10x faster than our SMEM kernel. The reason: cuBLAS is optimized at the assembly level and uses Tensor Cores — specialized hardware for matrix ops that our basic float kernel never touches.

Example 2: Optimizing Atomic Operations

atomicAdd() is necessary to avoid race conditions when multiple threads update the same global memory variable. ^[1]

Problem: Count Array Elements

Given an input array input of size N, count how many elements equal a constant K. The result goes into a single output variable in global memory.

Naive Kernel

Each thread that finds a match calls atomicAdd() directly on output in global memory.

__global__ void count_equal_naive(
    const int *input, int *output, int N, int K)
{
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N && input[idx] == K)
    {
        atomicAdd(output, 1);
    }
}

If many elements equal K, thousands of threads simultaneously contend for the same address. Those accesses get serialized — parallelism gone.

Optimized Kernel (SMEM + Parallel Reduction)

The fix: minimize atomic calls to global memory. Call it once per block, not once per thread.

Each thread counts its own matches locally (using a grid-stride loop to handle any array size).
Local counts go into Shared Memory.
Parallel reduction sums those counts entirely within SMEM — no global memory contention.
Thread 0 makes a single atomicAdd call with the block's total.

__global__ void count_equal_optimized(
    const int *input, int *output, int N, int K)
{
    extern __shared__ int I[];
    
    int g_idx = blockIdx.x * blockDim.x + threadIdx.x;
    int l_idx = threadIdx.x;
    int stride = gridDim.x * blockDim.x;
 
    // 1. Each thread counts locally
    int count = 0;
    while (g_idx < N)
    {
        if (input[g_idx] == K) count++;
        g_idx += stride;
    }
    
    // 2. Store in SMEM
    I[l_idx] = count;
    __syncthreads();
 
    // 3. Parallel reduction in SMEM
    for (int s = blockDim.x / 2; s > 0; s >>= 1)
    {
        if (l_idx < s) I[l_idx] += I[l_idx + s];
        __syncthreads();
    }
 
    // 4. One atomic write per block
    if (l_idx == 0) atomicAdd(output, I[0]);
}

Results (Example 2: Optimizing Atomic Operations)

Method	MX330 (ms)	A4000 (ms)
Naive `atomicAdd()`	12.271 ms	1.000 ms
Opt (SMEM + Reduction)	9.123 ms	0.990 ms

On MX330 (Pascal): ~1.34x speedup (12.271 / 9.123). Global atomicAdd contention is a real bottleneck on Pascal. Moving to SMEM reduction meaningfully reduces it.
On RTX A4000 (Ampere): Essentially no difference — 1% improvement at best. Ampere's atomic processing is far more efficient than Pascal's; global memory atomics get coalesced at the L2 cache, so contention is already nearly gone. Our "optimization" replaced a bottleneck the hardware had already solved with a bunch of SMEM writes, __syncthreads() calls, and reduction logic.

General conclusion: Classic techniques like SMEM tiling and atomic reduction are still valuable. But their effectiveness is no longer absolute. A technique that saves your kernel on Pascal can slow it down on Ampere. Profile first. Optimize second. Never assume.

Memory Coalescing: Don't Make the GPU Work 32× Harder

When 32 threads in a warp access global memory, the hardware tries to coalesce those 32 requests into as few transactions as possible.

Ideal (Coalesced): 32 threads access 32 contiguous addresses — A[idx], A[idx+1], ..., A[idx+31]. That's 128 bytes, served in one 128-byte transaction.
Worst case (Uncoalesced): 32 threads access 32 random or strided addresses — e.g., A[idx * 100]. They scatter across 32 different memory segments, requiring 32 separate 128-byte transactions. You're wasting 31/32 of your bandwidth.

In the naive matrix multiplication, accessing A[row * N + k] row-wise is coalesced. Accessing B[k * K + col] column-wise is often not — a classic bandwidth killer. ^[1]

Bank Conflicts: When Shared Memory Fights Itself

SMEM is divided into 32 memory banks. Threads in a warp can access SMEM in parallel — if they hit different banks. When two or more threads hit the same bank, those accesses get serialized. That's a bank conflict.

Rule: For 4-byte words (float, int), address addr lives in bank $(addr / 4) \bmod 32$ .

Classic example:

__shared__ float A[32][32];

Row-wise access A[my_row][threadIdx.x]: 32 threads hit 32 contiguous addresses → 32 different banks → no conflict, full speed.
Column-wise access A[threadIdx.x][my_col]: thread 0 hits A[0][c], thread 1 hits A[1][c] (32 elements away), thread 2 hits A[2][c] (64 elements away) — all mod 32 map to the same bank. 32-way conflict. All serialized.

Think of banks as checkout lanes. Row-wise: 32 people, 32 lanes, everyone served at once. Column-wise: 32 people, 1 lane, everyone waits.

Fix: Padding.

__shared__ float A[32][33]; // +1 padding column

Now column-wise accesses shift each row by 33 elements instead of 32. Bank assignments become (c + 33k) % 32, which cycles through all 32 banks. No more conflicts. You spend a little extra SMEM, but regain full parallel access.

Occupancy: Keeping the SM Busy

Occupancy is the ratio of active warps on an SM to the maximum it can support — e.g., 32 active warps / 64 max = 50% occupancy.

Occupancy is the key to latency hiding. When a warp stalls waiting for global memory, the SM scheduler switches to another warp that's ready to run. If there are no other warps, the SM sits idle.

This is the fundamental difference between CPU and GPU design: CPUs are latency-oriented (complete one task as fast as possible); GPUs are throughput-oriented (keep as many tasks running as possible at once). Occupancy is what makes the throughput model work.

What limits occupancy? Whichever resource runs out first:

Registers: Too many per thread → fewer threads fit on the SM → lower occupancy.
Shared Memory: Too much per block → fewer blocks fit on the SM → lower occupancy.
Threads per block: Too low (e.g., 64) → you'll never saturate the SM regardless of other factors.
Hardware limits: The A4000 (CC 8.6) supports up to 1536 threads/SM (48 warps) and 32 blocks/SM. ^[1]

Kernel Fusion: Eliminating the Round Trip

Kernel fusion is merging multiple sequential kernels into one.

Example — SAXPY variant (Y = a*X + b):

Without fusion:

kernel_scale(X, a, Temp);   // Temp = a*X  →  written to Global Memory
kernel_add(Temp, b, Y);     // Y = Temp + b  ←  read back from Global Memory

With fusion:

kernel_fused(X, a, b, Y);   // Y[i] = a*X[i] + b[i], Temp lives in a register

Temp never touches global memory. Two expensive read/write round trips — gone.

Benefits:

Eliminates intermediate global memory traffic (the biggest win).
Reduces kernel launch overhead (small, but adds up).

Trade-offs:

Harder to write and debug.
More registers per thread → potential occupancy hit.

Kernel fusion trades memory bandwidth for register pressure. Whether it's worth it depends on which one is your actual bottleneck. Measure.

Conclusion

The recurring theme in this article: measure before you optimize. Every benchmark above had at least one "obvious" optimization that backfired on modern hardware. The GPU has gotten smarter, and sometimes the best thing you can do is stay out of its way.

The real skill is identifying what is actually bottlenecking your kernel — memory bandwidth, compute throughput, or latency — and then applying the right technique for that specific problem on that specific hardware. Tools like NVIDIA Nsight are essential here; gut feeling isn't.

Future articles will go deeper: Tensor Cores for matrix ops, asynchronous memory operations with streams, and dynamic parallelism. The fundamentals covered here will make all of that much easier to reason about.