First your OpenCV should be compiled with CUDA (and OpenGL) support to test all this features. Detect your CUDA hardware with OpenCV CUDA by:
#include <iostream>
using namespace std;
#include <opencv2/core.hpp>
using namespace cv;
#include <opencv2/cudaarithm.hpp>
using namespace cv::cuda;
int main()
{
printShortCudaDeviceInfo(getDevice());
int cuda_devices_number = getCudaEnabledDeviceCount();
cout << "CUDA Device(s) Number: "<< cuda_devices_number << endl;
DeviceInfo _deviceInfo;
bool _isd_evice_compatible = _deviceInfo.isCompatible();
cout << "CUDA Device(s) Compatible: " << _isd_evice_compatible << endl;
return 0;
}
Run and debug the code in your C++ IDE and see if it shows like this below to check hardware compatibility of CUDA.
Device 0: "GeForce GTX 1650" 4096Mb, sm_75, Driver/Runtime ver.10.10/10.10
CUDA Device(s) Number: 1
CUDA Device(s) Compatible: 1
Obviously when adding CUDA support to your code, nothing is more important than adding the header first. All the .hpp
file stored in ~\include\opencv2
and ~\include\opencv2\cudalegacy
in install path. For example we add the headers below when liner blending two images:
#include <iostream>
using namespace std;
#include <opencv2/core.hpp>
#include <opencv2/highgui.hpp>
using namespace cv;
//Add CUDA support
#include <opencv2/cudaarithm.hpp>
#include <opencv2/cudafeatures2d.hpp>
using namespace cv::cuda;
Then we should declare the difference between the basic class cv::Mat
and cv::gpu::GpuMat
. Firstly GpuMat
added two member function as cv::gpu::GpuMat::upload(cv::Mat::InputArray arr)
and cv::gpu::GpuMat::download(cv::OutputArray dst)
. We use them to link RAM with GPU Memory but actually it not temporarily link, because Mat
and GpuMat
both have data pointer and both data pointer pointed to different memory block which causes memory copy between RAM and GPU Memory.
It seems for better access speed, GpuMat
only supports 2D array and filled with blank bytes in the end in empty or not fully filled cols to align the RAM and the real memory of a row is stored in member cv::gpu::GpuMat::step
.
Let's look deeper, the function cv::gpu::GpuMat::upload
and cv::gpu::GpuMat::download
in OpenCV 3 is actually designed for asynchronous processing form its definition:
void upload(InputArray arr);
void upload(InputArray arr, Stream& stream);
void download(OutputArray dst) const;
void download(OutputArray dst, Stream& stream) const;
And now let's see 3 special examples to learn how the GpuMat
is structed in different ways(GpuMat Class Reference).
//Default constructor
cv::cuda::GpuMat::GpuMat(GpuMat::Allocator* allocator = GpuMat::defaultAllocator())
//Constructs GpuMat of the specified size and type
cv::cuda::GpuMat::GpuMat(int rows, int cols, int type, Scalar s, GpuMat::Allocator* allocator = GpuMat::defaultAllocator())
//Builds GpuMat from host memory (Blocking call)
cv::cuda::GpuMat::GpuMat(InputArray arr, GpuMat::Allocator* allocator = GpuMat::defaultAllocator())
We have to declare how copy works between GpuMat
. There're 2 types of copying, one is shallow copy and another is deep copy. Examples here:
//Shallow copy, allocate data pointer and GPU(Memory)-GPU(Memory)
cv::cuda::GpuMat dst1 = dst0;
//Deep copy, allocate memory and GPU(Memory)-GPU(Memory)
cv::cuda::GpuMat dst0, dst1;
dst0.copyTo(dst1);
//Deep copy, allocate memory and upload to CPU(RAM)-GPU(Memory)
cv::cuda::GpuMat dst1.upload(dst1);
cv::cuda::GpuMat dst1 = dst0.clone();
OpenCV designed shallow copy to only copy the header and data pointer which uses same real memory and affect each other when perform a change in one of each, only useable in same memory space. The different of deep copy is its temporarily copy the data with no affection of each when one is changed, not restricted to the type of the memory whether RAM or GPU Memory.
Now we found that our main idea to use CUDA is increase the performance when processing data with the complex algorithm but not a huge amount of data in real time. Because too many data transferred instantly causes IO overflow and transferring data from RAM and GPU Memory requires amount of computing performance and increases its delay, through OpenCV developer designed PtrStepSz
and PtrStep
two light-weighted class in OpenCV2 to reduce copying data size for fast and low latency computing.
I still confused about how to pass VideoCapture
stream directly to GPU Memory, in that case we don't need to upload or download frames from GPU Memory to RAM, just grab frames from stream in GpuMat
. Because NVIDIA Video Decoder(NVCUVID) is deprecated and cv::Ptr<cv::cudacodec::VideoReader> d_reader = cv::cudacodec::createVideoReader(fname)
is no longer working. For older CUDA version 8 the createVideoReader()
would pass camera frames directly to GPU Memory.
Here I wrote a function that grab frame from streams and liner blend with a static image part example using OpenCV CUDA:
Mat video_frame_temp, temp_frame_downloaded;
GpuMat mask_image, video_frame, temp_frame;
void initWindow()
{
namedWindow(WINDOW_NAME, WINDOW_AUTOSIZE);
}
void blendFrame()
{
while (getWindowProperty(WINDOW_NAME, WINDOW_AUTOSIZE) != -1)
{
grabFrame();
if (video_frame_temp.empty())
{
break;
}
video_frame.upload(video_frame_temp);
blendImage();
outputImage();
waitKey(GRAB_DELAY);
}
}
void grabFrame()
{
capture >> video_frame_temp;
}
void blendImage()
{
cv::cuda::addWeighted(video_frame, ALPHA, mask_image, BETA, 0.0, temp_frame);
}
void outputImage()
{
temp_frame.download(temp_frame_downloaded);
imshow(WINDOW_NAME, temp_frame_downloaded);
}
With OpenGL support, we could easily write namedWindow(WINDOW_NAME, WINDOW_AUTOSIZE)
to namedWindow(WINDOW_NAME, WINDOW_OPENGL)
and then we could access temp_frame
directly by enabling function imshow
to directly access to GPU Memory. For example outputImage
could be write like this:
void outputImage()
{
imshow(WINDOW_NAME, temp_frame);
}
It saves valuable time when copying GPU Memory to RAM and makes processed image displayed faster.
Hey, great work here. This OpenCV stuff is not very well documented when it comes to the proper way of processing video with CUDA.
Have you solved this piece here?:
I'm trying to use the VideoCapture with FFMPEG backend + CUDA to load the video file direct to GPU and avoid the upload/download delay. Any ideas since you wrote this?
Thanks!