Optimizing PyTorch Data Pipelines: From Bottlenecks to 39× Speedups
I went to revisit the PyTorch training loop, and ended up chasing a bottleneck that had nothing to do with the model.
- Published
- Nov 14, 2025
- Read Time
- 11 min read
- Words
- 2,332
- views
- —
- Author
- Nguyen Xuan Hoa
Activity
views/week
last 24 weeks
Activity
views/week
last 24 weeks
Introduction
Today I decided to revisit the basic PyTorch training loop before diving into experimenting with some new computer vision algorithms. The main goal was to check if my previous understanding and habits are still effective, or if I'm missing anything in the current landscape. To be frank, sometimes we use "packaged" frameworks (like PyTorch Lightning, Hugging Face Accelerate) so much that we forget what's really happening underneath.
To perform these experiments, I'm using a personal workstation with a pretty powerful configuration, designed for computation and AI tasks:
- CPU: Intel i9-14900K (32 cores) @ 5.700GHz
- GPU: NVIDIA RTX A4000 (16GB VRAM)
- RAM: 32 GB
A powerful setup like this, especially the GPU, will make any slowness from the CPU or I/O pipeline (data loading) more obvious. If the GPU has to wait for the CPU to prepare data, we will immediately see "GPU starvation".
The training loop itself is standard — nothing special here. I'm including it just for completeness.
for epoch in range(10):
for batch_idx, (x, y) in enumerate(train_loader):
optimizer.zero_grad()
predictions = model(x)
loss = loss_function(predictions, y)
loss.backward()
optimizer.step()Experimental Datasets
In the experimental section, I use two classic datasets in computer vision: MNIST and FashionMNIST.
- MNIST: The handwritten digit dataset published by Yann LeCun, consisting of 10 classes corresponding to digits 0 through 9.
- FashionMNIST: Proposed as a more challenging drop-in replacement for MNIST, consisting of images of fashion products like t-shirts, trousers, shoes, bags, etc.


Both are 28×28 grayscale, 60k train / 10k test, 10 classes — basically identical in structure, which makes them perfect for focusing on the pipeline rather than the data itself.
Thanks to their compact size and ease of processing, MNIST and FashionMNIST are often considered the "Hello World" of Computer Vision problems. They let us focus entirely on optimizing the data loading mechanism, without being dominated by complex transformations or large I/O costs.
An interesting discovery: CSV vs. Parquet
At this point, I noticed the .csv files I was storing (each row is 784 pixels + 1 label) were quite large. I suddenly remembered reading somewhere about Parquet and decided to try storing the dataset in this format.
The results were really interesting: significantly smaller file sizes and much faster read/write speeds.
| Dataset | train.csv (MB) | test.csv (MB) | train.parquet (MB) | test.parquet (MB) |
|---|---|---|---|---|
| MNIST | 109.6 | 18.3 | 18.1 (~6.0x) | 3.8 (~4.8x) |
| FashionMNIST | 133 | 22.2 | 37.6 (~3.5x) | 7.3 (~3.0x) |
The reason is columnar storage: Parquet stores data by column rather than by row, so values in the same column tend to share the same type and range, making compression dramatically more effective. On top of that, Parquet supports codecs like Snappy and Zstandard natively. The downside is that it's a binary format — you can't just cat it and read it. For large data though, that's an easy trade-off.
A Naive First Pass
Ok, now let's get into the implementation. We need a custom Dataset class to read the Parquet file and a simple CNN model. I fixed random_seed = 42 to ensure consistent results.
Dataset Class (Version 1)
This is the first "naive" implementation. The idea is to load the entire Parquet file into RAM in __init__, then the __getitem__ function will be responsible for retrieving a row (.iloc), processing it, and transforming it into a Tensor.
class MNISTDataset(Dataset):
def __init__(self, parquet_path: str, num_classes: int = 10):
# Load the entire file into RAM
self.df = pd.read_parquet(parquet_path)
self.label_col = 'label' if 'label' in self.df.columns else None
self.feature_cols = [c for c in self.df.columns if c != self.label_col]
self.num_classes = num_classes
def __len__(self):
return len(self.df)
def __getitem__(self, idx):
# 1. Get data by row
row = self.df.iloc[idx]
# 2. Process features
pixels = row[self.feature_cols].to_numpy(dtype=np.float32)
image = torch.tensor(pixels.reshape(1, 28, 28)) / 255.0
# 3. Process label
if self.label_col:
label = torch.tensor(int(row[self.label_col]), dtype=torch.long)
else:
label = torch.tensor(-1, dtype=torch.long)
return {
"image": image, # Tensor [1, 28, 28]
"label": label, # Tensor [1]
}Tiny Neural Network
This model is just a basic CNN to give the GPU something to compute.
class SimpleCNN(nn.Module):
def __init__(self, num_classes=10):
super().__init__()
self.conv1 = nn.Conv2d(1, 16, 3, padding=1) # [B, 1, 28, 28] → [B, 16, 28, 28]
self.pool = nn.MaxPool2d(2, 2) # [B, 16, 28, 28] → [B, 16, 14, 14]
self.conv2 = nn.Conv2d(16, 32, 3, padding=1) # [B, 16, 14, 14] → [B, 32, 14, 14]
self.fc1 = nn.Linear(32 * 7 * 7, 128)
self.fc2 = nn.Linear(128, num_classes)
def forward(self, x):
x = self.pool(F.relu(self.conv1(x)))
x = self.pool(F.relu(self.conv2(x)))
x = x.view(x.size(0), -1)
x = F.relu(self.fc1(x))
x = self.fc2(x)
return x206,922 parameters — a tiny network by any standard.
Hyperparameters
I'll try the FashionMNIST dataset first, with the following config:
train_dataset = MNISTDataset("data/FashionMNIST/train.parquet")
test_dataset = MNISTDataset("data/FashionMNIST/test.parquet")
train_dataloader = DataLoader(
train_dataset,
batch_size=16,
shuffle=True,
)
test_dataloader = DataLoader(
test_dataset,
batch_size=16,
shuffle=False,
)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = SimpleCNN().to(device)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)I'll start by training for 5 epochs.
Note: I should have split the train set into train and val sets to evaluate on the val set, not the test set. But for now, I'm not worried about it. Also, calling
model.train()andmodel.eval()is very important (especially when usingDropoutorBatchNorm), but with this simple model, I'll skip it for now.
And the log for batch_size=16:
Average Train Time per Epoch: 17.13s
Average Test Time per Epoch: 2.39s
Total Train Time: 85.64s
Total Test Time: 11.95sSo one epoch takes about 19 seconds on average (17s train + 2s test).
Why Bigger Batch Size Didn't Help
A very natural thought is: increase the batch size to speed up computation — as I often do. My RTX A4000 GPU is idling with batch_size=16. I'll try increasing the batch_size to 32, 64, 128, 256, 512.
At this point, something strange happened. The speed did increase a bit (to about 16 seconds/epoch on average), but from batch_size=32 all the way up to batch_size=512, the time did not change significantly.
Here is the log for batch_size=512:
Average Train Time per Epoch: 13.67s
Average Test Time per Epoch: 2.17s
Total Train Time: 68.36s
Total Test Time: 10.87sOnly slightly faster — negligible. What on earth is going on?
Then I immediately realized the problem, and I have my CUDA studies to thank for my quicker reaction time compared to the past. The problem wasn't in the compute (GPU) at all — it was in the data pipeline.
More specifically: the data isn't loading fast enough to keep up with the GPU's computation speed. A classic bottleneck. The GPU utilization was very low at this point; it was spending most of its time waiting for the CPU to deliver the next batch.
I realized there were two suspects: the DataLoader configuration, and something uglier hiding inside __getitem__.
Note: The Dataset class here loads the entire Parquet file into RAM. In reality, if the data is large (hundreds of GB), this is impossible — you'd have to split the file or stream the data. But in this case, the data is small enough to fit in RAM.
Optimizing DataLoader to Increase Throughput
I started experimenting with the DataLoader first, keeping batch_size=512.
num_workers
num_workers is the number of subprocesses that DataLoader will spawn to load data in parallel. If not specified (default is 0), the data is loaded in the main process — and that's the bottleneck. The main process is busy coordinating the GPU and doing the work of loading data at the same time. In theory, increasing the number of workers should help. I'll try num_workers=4 (my CPU has 32 cores, but 4 is a reasonable starting point).
train_dataloader = DataLoader(
train_dataset,
batch_size=512,
shuffle=True,
num_workers=4
)
test_dataloader = DataLoader(
test_dataset,
batch_size=512,
shuffle=False,
num_workers=4
)The result was astonishing. The average time per epoch dropped to about 4.3 seconds — nearly 4× faster than before.
Average Train Time per Epoch: 3.68s
Average Test Time per Epoch: 0.67s
Total Train Time: 18.39s
Total Test Time: 3.37sWhat about pushing to 8 or 16 workers? I tried it, and the phenomenon was similar to the batch size experiment: the speed didn't improve significantly. It seems num_workers=4 was already enough to saturate the pipeline. I'll fix the config here.
persistent_workers
By default this is False, which means:
- Start of epoch → create
num_workersprocesses. - Workers load batches.
- End of epoch → kill all workers.
- Next epoch → spawn new workers from scratch.
The overhead is the time to create new processes, reload the dataset file, and reload libraries like numpy/pandas. Setting persistent_workers=True keeps the workers alive across epochs. With only 5 epochs, the benefit is small:
Average Train Time per Epoch: 3.61s
Average Test Time per Epoch: 0.62s
Total Train Time: 18.06s
Total Test Time: 3.09sBarely moved the needle here — makes sense, since the overhead of spawning workers amortizes quickly. Worth keeping on for longer runs though.
pin_memory
pin_memory is directly related to how PyTorch transfers data from RAM (CPU) to VRAM (GPU). To understand what pin_memory=True does, we need to know about two types of CPU memory.
Pageable Memory is what you get by default. The OS has full control over it — if RAM runs low, it can swap that block of data to disk at any time.
Page-Locked / Pinned Memory is different. When you "pin" a memory region, you're telling the OS: do not move or swap this data. It stays at a fixed physical address in RAM.
Why does this matter? Data transfer from CPU to GPU goes through DMA (Direct Memory Access), which requires a fixed physical address. With Pageable Memory, the OS can move data at any time, so the GPU can't use DMA directly — it either waits or performs an intermediate copy. With Pinned Memory, DMA can work unobstructed.
When pin_memory=True is set in DataLoader, batches are automatically loaded into Pinned Memory instead of Pageable Memory. The theory is solid. The practice, in this case:
train_dataloader = DataLoader(
train_dataset,
batch_size=512,
shuffle=True,
num_workers=4,
persistent_workers=True,
pin_memory=True
)
test_dataloader = DataLoader(
test_dataset,
batch_size=512,
shuffle=False,
num_workers=4,
persistent_workers=True,
pin_memory=True
)Average Train Time per Epoch: 3.64s
Average Test Time per Epoch: 0.62s
Total Train Time: 18.19s
Total Test Time: 3.09sAlmost no impact. For tiny 28×28 images and a small model like this, the overhead of pinning roughly cancels out the gain. On larger inputs — say, ImageNet-scale — this is something you'd want on by default.
The Real Bottleneck Was in __getitem__
After sorting out the DataLoader, I removed the I/O bottleneck, but the bottleneck had simply shifted to CPU processing.
The problem was hiding in plain sight inside __getitem__ of Dataset v1:
# Inside __getitem__(self, idx)
row = self.df.iloc[idx]
pixels = row[self.feature_cols].to_numpy(dtype=np.float32)
image = torch.tensor(pixels.reshape(1, 28, 28)) / 255.0Every single one of these operations — iloc, to_numpy, torch.tensor, reshape, the division by 255.0 — runs on every single call to __getitem__. That's 60,000 times per epoch, across 4 workers. self.df is already in RAM so it's not I/O bound, but the redundant processing still adds up fast.
The fix is obvious in hindsight: if we've already decided to load everything into RAM, why not do all the transformation once at initialization time? Then __getitem__ becomes a pure index lookup.
Dataset Class (Version 2)
class MNISTDataset(Dataset):
def __init__(self, parquet_path: str, num_classes: int = 10):
df = pd.read_parquet(parquet_path)
self.label_col = 'label' if 'label' in df.columns else None
self.feature_cols = [c for c in df.columns if c != self.label_col]
self.num_classes = num_classes
# Process all features ONCE
features_np = df[self.feature_cols].to_numpy(dtype=np.float32)
self.images = torch.from_numpy(features_np.reshape(-1, 1, 28, 28)) / 255.0
# Process all labels ONCE
if self.label_col:
labels_np = df[self.label_col].to_numpy(dtype=np.int64)
self.labels = torch.from_numpy(labels_np).long()
else:
self.labels = torch.zeros(len(df), dtype=torch.long)
def __len__(self):
return len(self.images)
def __getitem__(self, idx):
# Simply access by index
return {
"image": self.images[idx],
"label": self.labels[idx],
}Putting It All Together
Running again with Dataset v2 and the optimized DataLoader config (bs=512, num_workers=4, pin_memory=True) — the drop is almost uncomfortable to look at:
Average Train Time per Epoch: 0.44s
Average Test Time per Epoch: 0.06s
Total Train Time: 2.21s
Total Test Time: 0.29s0.5 seconds per epoch (0.44s train + 0.06s test). Nearly 9× faster than after the DataLoader optimizations alone.
Here's the full picture:
| Method | Avg Train Time/Epoch | Avg Test Time/Epoch | Total Time (5 epochs) | Speedup (vs. Naive) |
|---|---|---|---|---|
| Naive | 17.13s | 2.39s | 97.60s | 1x |
num_workers | 3.68s | 0.67s | 21.75s | ~4.5x |
persistent_workers | 3.61s | 0.62s | 21.15s | ~4.6x |
pin_memory | 3.64s | 0.62s | 21.30s | ~4.6x |
__getitem__ refactor | 0.44s | 0.06s | 2.50s | ~39x |
Conclusion
39× faster, and most of it came from a change in __getitem__ that takes about 10 seconds to make. The big lesson here isn't really about any specific parameter — it's that __getitem__ runs 60,000 times per epoch, so every unnecessary operation in there compounds fast. If your data fits in RAM, preprocess it once at __init__ and be done with it.
I got so caught up in the optimization that I completely forgot to test the MNIST dataset too — that, and the new vision algorithms, will have to wait for the next post.

