Benchmarking And Optimization
When a pipeline is ready, developers often need to measure its performance. There are three types of benchmarks developers want to run:
non-real-time single stream performance;
non-real-time multiple stream performance;
real-time multiple stream performance.
There is one more dimension to benchmark:
with adapters (end-to-end);
without adapters (only pipeline).
When benchmarking, we usually measure the aggregate throughput of the pipeline in FPS (frames-per-second). When the pipeline is intended to work in real-time and in some other cases, we may also measure end-to-end latency, but this document does not discuss it.
Remember, that GPU/CPU overheating significantly hits the performance. Plan the benchmarks to address the situation. Read our article on Medium to find out more.
Benchmarking in the shared cloud may produce results changing greatly from launch to launch: as you do not control all the host resources, other users may influence a lot on the results. We recommend running benchmarks on dedicated bare-metal servers when possible.
Measuring Non-Real-Time Single Stream Performance
The benchmark allows an understanding of the performance of the pipeline in a batch mode. A typical case is file-by-file processing, when you start the pipeline ingesting a file into it and want to measure how fast the file can be processed. This sort of processing is often used when processing archived videos.
To measure the performance, use the
uridecodebin Gstreamer source:
pipeline: # local video file source, not using source adapter source: element: uridecodebin properties: uri: file:///data/file.mp4 # define pipeline's main elements elements: ... # noop pipeline sink, not using sink adapter sink: - element: devnull_sink
At the end of operation, you will see the FPS result.
Measuring Non-Real-Time Multiple Stream Performance
This kind of benchmarking is valuable to discover the maximum aggregate number of FPS the pipeline can reach. It may occur that the maximum value will be reached when ingesting 32 pipelines, but each of them will be processed at the rate of 5 FPS. That is why it is not real-time performance.
To run such benchmarks we implemented a special adapter: Multi-Stream Source Adapter, allowing the ingesting of a selected video file in parallel in the pipeline under benchmarking. By changing the number of parallel streams you can find out the maximum value for the pipeline.
To measure non-real-time performance with it, use
Measuring Real-Time Multiple Stream Performance
This kind of benchmarking is valuable to find out the maximum aggregate number of FPS the pipeline can handle in real time. Considering the per-stream FPS is 30, you are looking for working configurations satisfying the equation
N = X / 30, where
N is the number of streams ingested in the pipeline, and
X is the aggregate FPS.
To run such benchmarks, you also can use the Multi-Stream Source Adapter but set
SYNC_OUTPUT=True. By changing the number of parallel streams, you need to determine the value which is the maximum for the pipeline.
End-To-End or Isolated Benchmarking
In the above-mentioned listing you may see that the sink for the pipeline is set to:
pipeline: # noop pipeline sink, not using sink adapter sink: - element: devnull_sink
It represents benchmarking without real-life sinks which can form a bottleneck. This is what you are looking for if you want to test CV/ML performance . However, often you need to test end-to-end, including a specific sink implementation used practically. In such a situation, you need to include additional components in the benchmark, like a sink adapter you are planning to use and 3rd-party systems.
The tools for monitoring the benchmarking environment include but are not limited by:
Real-Time Data Sources And The Pipeline is a Bottleneck
If real-time sources are used and the pipeline is a bottleneck, to avoid data loss, the sources must be connected to the pipeline with an in-memory or persistent queue system like Apache Kafka. The same is true for communication between the pipeline and sinks.
Pipeline performance may be limited by GIL. This is a frequent case when a lot of unoptimized Python code is used. Such code utilizes a single CPU core to 100%, while other cores remain underutilized. If
htop shows such a picture while
nvtop demonstrates that GPU resources are underutilized, the pipeline is GIL-bound.
What to do:
switch from VPS to bare metal;
consider using high-frequency CPUs with small number of cores, fast memory and large cache;
move heavyweight operations out of the pipeline (e.g., use Apache Spark or Flink);
unlock GIL by introducing GIL-free FFI code (Cython, C, C++, Rust), replace naive code with optimized computations made with NumPy, Numba, OpenCV;
try pipeline chaining to split workload among several Python processes;
launch multiple instances of a pipeline on a single GPU to distribute the workload between more CPU cores and fully utilize GPU resources.
It may occur that the pipeline utilizes proper optimizing techniques and utilizes all CPU cores available, while GPU remains underutilized.
What to do:
switch from VPS to bare metal;
consider choosing CPU with large number of cores;
move heavyweight operations out of the pipeline to a separate host (e.g., use Apache Spark or Flink);
reconfigure a platform, selecting less capable GPU keeping the same CPU.
This is normally a good situation. What approaches may improve the performance:
try pipeline chaining and multiple GPUs;
choosing a more capable GPU model.