So You Want Low-Latency Object Detection

How a 3-second lag taught me everything I didn't know about real-time video pipelines.

Published: Dec 10, 2025
Read Time: 7 min read
Words: 1,375
views: —
Author: Nguyen Xuan Hoa

Activity

views/week

last 24 weeks

Activity

views/week

last 24 weeks

I thought this would be easy. cv2.VideoCapture(), throw the frame into YOLO, cv2.imshow(), done. Then I actually ran it and measured the latency: 3 seconds.

My setup was nothing exotic — a Tapo consumer camera (RTSP, 15 FPS, 720p), an NVIDIA GPU for dev, a Jetson Nano for deployment. Simple hardware, simple goal: detect objects with the lowest possible latency. And yet, 3 seconds. In any system that needs to react to the real world, 3 seconds is game over.

This is the log of how I fixed it.

The Naive Approach

import cv2
cap = cv2.VideoCapture("rtsp://admin:pass@ip:554/stream", cv2.CAP_FFMPEG)
 
while True:
    ret, frame = cap.read() # Blocking I/O
    # ... AI processing ...
    cv2.imshow("Camera", frame) # Render UI
    if cv2.waitKey(1) == ord('q'): break

The problem is the Blocking I/O and the internal buffer. The camera pushes 15 FPS. But if the while loop only runs at 10 FPS — because imshow and the model inference eat time — those 5 extra frames per second have nowhere to go. They pile up in OpenCV/FFmpeg's internal buffer. So when you call cap.read(), you're not getting the frame that just arrived. You're getting the oldest frame in the queue.

This is the key thing to understand about real-time systems: throughput and latency are not the same thing. You can have smooth 60 FPS playback while watching footage from 3 seconds ago. High throughput, terrible latency. For passive recording, fine. For anything that needs to react — robots, safety systems, anything interactive — it's completely wrong.

Fixing It: Two Approaches That Mostly Work

Multithreading

The obvious first fix: split reading and processing into two threads.

Thread A (Reader): Constantly reads the latest frame, pushes it into a Queue(maxsize=1). If the queue is full, discard the old frame and overwrite.
Thread B (Processor): Pulls from the queue and runs inference.

Latency dropped a lot, because we're actively throwing away stale frames instead of processing them. But the code got messy fast — race conditions, thread safety issues. And the buffer could still fill up at the FFmpeg/driver layer before our Python thread even sees the frame.

Bypassing OpenCV: FFmpeg CLI and GStreamer

The next step was going deeper and bypassing the OpenCV wrapper entirely.

FFmpeg via subprocess worked well — forcing UDP and nobuffer got latency down to near-realtime. But FFmpeg's whole philosophy is optimized for throughput, so squeezing low latency out of it means fighting its defaults with a long list of flags:

cmd = [
    "ffmpeg",
    "-rtsp_transport", "udp",           # Use UDP to reduce delay
    "-fflags", "nobuffer",              # Disable internal buffer
    "-flags", "low_delay",              # Enable low latency mode
    "-reorder_queue_size", "0",         # Do not reorder packets
    "-use_wallclock_as_timestamps", "1",
    "-i", RTSP_URL,
    "-f", "rawvideo",                   # Output raw
    "-pix_fmt", "bgr24",                # Pixel format for OpenCV
    "-"
]

And piping raw bytes into Python is just... manual and brittle.

GStreamer is the better answer. It's a modular pipeline — you get full control over every element in the chain from source → demux → parse → decode → sink. It supports zero-copy between CPU and GPU better than FFmpeg, and it's what serious players (Hikvision, NVIDIA's DeepStream) build on. The tradeoff is that the learning curve is genuinely steep.

The Dependency Hell Nobody Warns You About

This section took longer than everything else combined. Getting GStreamer + Python + OpenCV to actually work together is a mess.

Conda: I tried. Library conflicts everywhere, constantly. Not worth it.
pip install opencv-python: The PyPI package does not ship with GStreamer support. The code will run, the pipeline will silently fail to open.
Ultralytics: When you pip install ultralytics, it pulls in opencv-python from PyPI and overwrites your working OpenCV build. This will break your GStreamer setup without any obvious error message. ^[1]

The actual fix is boring but it works: skip Conda and Docker, use venv with --system-site-packages:

# 1. Install GStreamer and OpenCV system-wide (Ubuntu)
sudo apt-get install python3-opencv libgstreamer1.0-dev ...
 
# 2. Create venv with --system-site-packages flag to inherit system libraries
python3 -m venv myenv --system-site-packages
source myenv/bin/activate
 
# 3. Note: If installing ultralytics, check carefully if it overwrites opencv

The Final Architecture

Once the environment was stable, I changed the overall architecture to stop using imshow entirely — it blocks the main thread and isn't usable in a real deployment anyway. Instead, the pipeline streams annotated output to a local RTSP server.

The data flow: GStreamer reads from the camera RTSP stream → OpenCV + YOLO runs inference → GStreamer encodes and pushes to MediaMTX. ^[1]

One thing that helped me reason about this: the camera is doing 15 FPS, which means a new frame arrives every ~66 ms. YOLO11n on an RTX A4000 runs at ~550–700 FPS, meaning each inference takes roughly 1.5 ms. The GPU is idle more than 98% of the time. The bottleneck is the sensor, full stop — the system cannot run faster than 15 FPS no matter how fast everything else is. That also means all the multithreading complexity I added earlier was solving a problem that didn't exist yet. Simple single-threaded loop is fine here. When you upgrade to a 60 FPS camera or a heavier model, then it's time to add workers and queues.

The Code

import cv2
from ultralytics import YOLO
 
def main():
    # Local RTSP Server
    rtsp_url = "rtsp://admin:pass@192.168.1.50:554/stream"
    output_url = "rtsp://localhost:8554/stream"
 
    model = YOLO("checkpoints/yolo11n.pt") 
 
    # --- INPUT PIPELINE ---
    # latency=0: Try to reduce network buffer delay
    # appsink sync=false drop=true max-buffers=1: 
    # -> This is the key! Keep only the very latest frame, discard all old frames.
    input_pipeline = (
        f"rtspsrc location={rtsp_url} latency=0 ! queue ! "
        f"rtph264depay ! h264parse ! avdec_h264 ! "
        f"videoconvert ! appsink sync=false drop=true max-buffers=1"
    )
 
    cap = cv2.VideoCapture(input_pipeline, cv2.CAP_GSTREAMER)
    
    if not cap.isOpened():
        print("Error: Cannot open input pipeline")
        return
 
    # Get stream parameters
    width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
    height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
    fps = cap.get(cv2.CAP_PROP_FPS) or 25
 
    # --- OUTPUT PIPELINE ---
    # appsrc -> x264enc -> rtspclientsink
    # tune=zerolatency: Optimize encoder for real-time
    # speed-preset=ultrafast: Sacrifice compression for speed
    output_pipeline = (
        "appsrc is-live=true ! "
        "videoconvert ! video/x-raw,format=I420 ! "
        "x264enc tune=zerolatency bitrate=2000 speed-preset=ultrafast key-int-max=30 ! "
        f"rtspclientsink location={output_url}"
    )
 
    writer = cv2.VideoWriter(
        output_pipeline,
        cv2.CAP_GSTREAMER,
        0, fps, (width, height), True
    )
 
    if not writer.isOpened():
        print("Error: Cannot open output pipeline")
        return
 
    try:
        while True:
            ret, frame = cap.read()
            if not ret: break
 
            # Inference YOLO
            results = model(frame, verbose=False)
            annotated_frame = results[0].plot()
 
            # Stream output (Do not render GUI)
            writer.write(annotated_frame)
            
    except KeyboardInterrupt:
        print("Stopping...")
    finally:
        cap.release()
        writer.release()
 
if __name__ == "__main__":
    main()

Latency is now under 100 ms end-to-end. The key line is appsink sync=false drop=true max-buffers=1 — that's what actually solves the buffering problem. Everything else is just plumbing.

Before You Deploy This on a Real Network

This pipeline was built for LAN. If you take it outside that environment, a few things will bite you:

latency=0 disables the GStreamer jitter buffer entirely. On an unstable network, you'll get stuttering and frame drops.
max-buffers=1 drop=true means if inference slows down even briefly, frames get dropped — including potentially important ones.
UDP is fine on LAN. On WiFi or WAN with interference, expect packet loss and skipped frames.
ultrafast + zerolatency on x264 generates high bitrate. On a weak connection, that congestion will add the latency back.

These aren't bugs — they're tradeoffs that make sense in a controlled environment. Just know what you're trading away before deploying.

The Honest Summary

The actual fix was max-buffers=1 drop=true. Two GStreamer parameters. Most of the time went into understanding why the buffer was the problem in the first place, and then fighting the dependency mess to get GStreamer working at all. If you're starting from scratch, fix your environment first — everything else follows.