Benchmarking And Optimization
When a pipeline is ready, developers often need to measure its performance. There are three types of benchmarks developers want to run:
non-real-time single stream performance;
non-real-time multiple stream performance;
real-time multiple stream performance.
There is one more dimension to benchmark:
with adapters (end-to-end);
without adapters (only pipeline).
When benchmarking, we usually measure the aggregate throughput of the pipeline in FPS (frames-per-second). When the pipeline is intended to work in real-time and in some other cases, we may also measure end-to-end latency, but this document does not discuss it.
Tip
Remember, that GPU/CPU overheating significantly hits the performance. Plan the benchmarks to address the situation. Read our article on Medium to find out more.
Warning
Benchmarking in the shared cloud may produce results changing greatly from launch to launch: as you do not control all the host resources, other users may influence a lot on the results. We recommend running benchmarks on dedicated bare-metal servers when possible.
Measuring Non-Real-Time Single Stream Performance
The benchmark allows an understanding of the performance of the pipeline in a batch mode. A typical case is file-by-file processing, when you start the pipeline ingesting a file into it and want to measure how fast the file can be processed. This sort of processing is often used when processing archived videos.
To measure the performance, use the uridecodebin
Gstreamer source:
pipeline:
# local video file source, not using source adapter
source:
element: uridecodebin
properties:
uri: file:///data/file.mp4
# define pipeline's main elements
elements:
...
# noop pipeline sink, not using sink adapter
sink:
- element: devnull_sink
At the end of operation, you will see the FPS result.
Measuring Non-Real-Time Multiple Stream Performance
This kind of benchmarking is valuable to discover the maximum aggregate number of FPS the pipeline can reach. It may occur that the maximum value will be reached when ingesting 32 pipelines, but each of them will be processed at the rate of 5 FPS. That is why it is not real-time performance.
To run such benchmarks we implemented a special adapter: Multi-Stream Source Adapter, allowing the ingesting of a selected video file in parallel in the pipeline under benchmarking. By changing the number of parallel streams you can find out the maximum value for the pipeline.
To measure non-real-time performance with it, use SYNC_OUTPUT=False
.
Measuring Real-Time Multiple Stream Performance
This kind of benchmarking is valuable to find out the maximum aggregate number of FPS the pipeline can handle in real time. Considering the per-stream FPS is 30, you are looking for working configurations satisfying the equation N = X / 30
, where N
is the number of streams ingested in the pipeline, and X
is the aggregate FPS.
To run such benchmarks, you also can use the Multi-Stream Source Adapter but set SYNC_OUTPUT=True
. By changing the number of parallel streams, you need to determine the value which is the maximum for the pipeline.
End-To-End or Isolated Benchmarking
In the above-mentioned listing you may see that the sink for the pipeline is set to:
pipeline:
# noop pipeline sink, not using sink adapter
sink:
- element: devnull_sink
It represents benchmarking without real-life sinks which can form a bottleneck. This is what you are looking for if you want to test CV/ML performance . However, often you need to test end-to-end, including a specific sink implementation used practically. In such a situation, you need to include additional components in the benchmark, like a sink adapter you are planning to use and 3rd-party systems.
Tools
The tools for monitoring the benchmarking environment include but are not limited by:
nvidia-smi
,tegrastats
: analyze GPU performance;sar
: analyze host CPU/RAM utilization;nvtop
: monitor GPU utilization;htop
: monitor CPU/RAM utilization;OpenTelemetry and ClientSDK: profile the code.
Real-Time Data Sources And The Pipeline is a Bottleneck
If real-time sources are used and the pipeline is a bottleneck, to avoid data loss, the sources must be connected to the pipeline with an in-memory or persistent queue system like Apache Kafka. The same is true for communication between the pipeline and sinks.
GIL-Bound Pipelines
Pipeline performance may be limited by GIL. This is a frequent case when a lot of unoptimized Python code is used. Such code utilizes a single CPU core to 100%, while other cores remain underutilized. If htop
shows such a picture while nvtop
demonstrates that GPU resources are underutilized, the pipeline is GIL-bound.
What to do:
switch from VPS to bare metal;
consider using high-frequency CPUs with small number of cores, fast memory and large cache;
move heavyweight operations out of the pipeline (e.g., use Apache Spark or Flink);
unlock GIL by introducing GIL-free FFI code (Cython, C, C++, Rust), replace naive code with optimized computations made with NumPy, Numba, OpenCV;
try pipeline chaining to split workload among several Python processes;
launch multiple instances of a pipeline on a single GPU to distribute the workload between more CPU cores and fully utilize GPU resources.
CPU-Bound Pipelines
It may occur that the pipeline utilizes proper optimizing techniques and utilizes all CPU cores available, while GPU remains underutilized.
What to do:
switch from VPS to bare metal;
consider choosing CPU with large number of cores;
move heavyweight operations out of the pipeline to a separate host (e.g., use Apache Spark or Flink);
reconfigure a platform, selecting less capable GPU keeping the same CPU.
GPU-Bound Pipelines
This is normally a good situation. What approaches may improve the performance:
network pruning;
network quantization;
try pipeline chaining and multiple GPUs;
choosing a more capable GPU model.