opencv - Why is cv2.dnn work faster when I use CPU rather than GPU?

Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.
Learn more about Collectives
Teams
Q&A for work
Connect and share knowledge within a single location that is structured and easy to search.
Learn more about Teams
I am new to openCV - CUDA so I have been testing the most simple one which is loading a model on GPU rather than CPU to see how fast GPU is and I am horrified at the result I get.
----------------------------------------------------------------
---         GPU                vs             CPU            ---
---                                                          ---
--- 21.913758993148804 seconds ---3.0586464405059814 seconds ---
--- 22.379303455352783 seconds ---3.1384341716766357 seconds ---
--- 21.500431060791016 seconds ---2.9400241374969482 seconds ---
--- 21.292986392974854 seconds ---3.3738017082214355 seconds ---
--- 20.88358211517334 seconds  ---3.388749599456787 seconds  ---
I will give my code snippet in case I may be doing something wrong that cause GPU time to spike so high.
def loadYolo():
    net = cv2.dnn.readNet("yolov4.weights", "yolov4.cfg")
    net.setPreferableBackend(cv2.dnn.DNN_BACKEND_CUDA)
    net.setPreferableTarget(cv2.dnn.DNN_TARGET_CUDA_FP16)
    classes = []
    with open("coco.names", "r") as f:
        classes = [line.strip() for line in f.readlines()]
    layer_names = net.getLayerNames()
    output_layers = [layer_names[i[0] - 1] for i in net.getUnconnectedOutLayers()]
    return net,classes,layer_names,output_layers
@socketio.on('image')
def image(data_image):
    sbuf = StringIO()
    sbuf.write(data_image)
    b = io.BytesIO(base64.b64decode(data_image))
    if(str(data_image) == 'data:,'):
    else:
        pimg = Image.open(b)
        frame = cv2.cvtColor(np.array(pimg), cv2.COLOR_RGB2BGR)
        frame = resize(frame, width=700)
        frame = cv2.flip(frame, 1)
        net,classes,layer_names,output_layers=loadYolo()
        height, width, channels = frame.shape
        blob = cv2.dnn.blobFromImage(frame, 1 / 255.0, (416, 416),
        swapRB=True, crop=False)
        net.setInput(blob)
        outs = net.forward(output_layers)
        print("--- %s seconds ---" % (time.time() - start_time))
        class_ids = []
        confidences = []
        boxes = []
        for out in outs:
            for detection in out:
                scores = detection[5:]
                class_id = np.argmax(scores)
                confidence = scores[class_id]
                if confidence > 0.5:
                    # Object detected
                    center_x = int(detection[0] * width)
                    center_y = int(detection[1] * height)
                    w = int(detection[2] * width)
                    h = int(detection[3] * height)
                    # Rectangle coordinates
                    x = int(center_x - w / 2)
                    y = int(center_y - h / 2)
                    boxes.append([x, y, w, h])
                    confidences.append(float(confidence))
                    class_ids.append(class_id)
        indexes = cv2.dnn.NMSBoxes(boxes, confidences, 0.5, 0.4)
        font = cv2.FONT_HERSHEY_PLAIN
        colors = np.random.uniform(0, 255, size=(len(classes), 3))
        for i in range(len(boxes)):
            if i in indexes:
                x, y, w, h = boxes[i]
                label = str(classes[class_ids[i]])
                color = colors[class_ids[i]]
                cv2.rectangle(frame, (x, y), (x + w, y + h), color, 2)
                cv2.putText(frame, label, (x, y + 30), font, 1, color, 2)
        imgencode = cv2.imencode('.jpg', frame)[1]
        stringData = base64.b64encode(imgencode).decode('utf-8')
        b64_src = 'data:image/jpg;base64,'
        stringData = b64_src + stringData
        emit('response_back', stringData)
My Gpu is Nvidia 1050 Ti and my CPU is i5 gen 9 in case someone need the specification. Can someone please enlighten me as I am super confused right now? Thank you very much
EDIT 1: I tried to use cv2.dnn.DNN_TARGET_CUDA instead of cv2.dnn.DNN_TARGET_CUDA_FP16, but the time is still terrible compare to CPU. Below is the GPU result :
--- 10.91195559501648 seconds ---
--- 11.344025135040283 seconds ---
--- 11.754926204681396 seconds ---
--- 12.779674530029297 seconds ---
Below is CPU result :
--- 4.780993223190308 seconds ---
--- 4.910650253295898 seconds ---
--- 4.990436553955078 seconds ---
--- 5.246175050735474 seconds ---
it is still slower than CPU
EDIT 2: OpenCv is 4.5.0, CUDA 11.1 and CUDNN 8.0.1
                Did you check the GPU resources usage during executing GPU version? If yes, what was the load?
– trojek
                Nov 23, 2021 at 9:30
You should definitely only load YOLO once. Recreating it for every image that comes through the socket is slow for both CPU and GPU, but GPU takes longer to initially load which is why you're seeing it run slower than CPU.
I don't understand what you mean by using an LRU cache for your YOLO model. Without seeing the rest of your code structure I can't make any real suggestions, but can you try at least temporarily putting the network into the global space just to see if it runs faster? (remove the function altogether and put its body in the global space)
something like this
net = cv2.dnn.readNet("yolov4.weights", "yolov4.cfg")
net.setPreferableBackend(cv2.dnn.DNN_BACKEND_CUDA)
net.setPreferableTarget(cv2.dnn.DNN_TARGET_CUDA_FP16)
classes = []
with open("coco.names", "r") as f:
    classes = [line.strip() for line in f.readlines()]
layer_names = net.getLayerNames()
output_layers = [layer_names[i[0] - 1] for i in net.getUnconnectedOutLayers()]
@socketio.on('image')
def image(data_image):
    sbuf = StringIO()
    sbuf.write(data_image)
    b = io.BytesIO(base64.b64decode(data_image))
    if(str(data_image) == 'data:,'):
    else:
        pimg = Image.open(b)
        frame = cv2.cvtColor(np.array(pimg), cv2.COLOR_RGB2BGR)
        frame = resize(frame, width=700)
        frame = cv2.flip(frame, 1)
        height, width, channels = frame.shape
        blob = cv2.dnn.blobFromImage(frame, 1 / 255.0, (416, 416),
        swapRB=True, crop=False)
        net.setInput(blob)
        outs = net.forward(output_layers)
        print("--- %s seconds ---" % (time.time() - start_time))
        class_ids = []
        confidences = []
        boxes = []
        for out in outs:
            for detection in out:
                scores = detection[5:]
                class_id = np.argmax(scores)
                confidence = scores[class_id]
                if confidence > 0.5:
                    # Object detected
                    center_x = int(detection[0] * width)
                    center_y = int(detection[1] * height)
                    w = int(detection[2] * width)
                    h = int(detection[3] * height)
                    # Rectangle coordinates
                    x = int(center_x - w / 2)
                    y = int(center_y - h / 2)
                    boxes.append([x, y, w, h])
                    confidences.append(float(confidence))
                    class_ids.append(class_id)
        indexes = cv2.dnn.NMSBoxes(boxes, confidences, 0.5, 0.4)
        font = cv2.FONT_HERSHEY_PLAIN
        colors = np.random.uniform(0, 255, size=(len(classes), 3))
        for i in range(len(boxes)):
            if i in indexes:
                x, y, w, h = boxes[i]
                label = str(classes[class_ids[i]])
                color = colors[class_ids[i]]
                cv2.rectangle(frame, (x, y), (x + w, y + h), color, 2)
                cv2.putText(frame, label, (x, y + 30), font, 1, color, 2)
        imgencode = cv2.imencode('.jpg', frame)[1]
        stringData = base64.b64encode(imgencode).decode('utf-8')
        b64_src = 'data:image/jpg;base64,'
        stringData = b64_src + stringData
        emit('response_back', stringData)
                well doing what you do i get  cv2.error: OpenCV(4.5.1) D:\OpencvBuild\opencv-4.5.1\modules\dnn\src\dnn.cpp:1070: error: (-215:Assertion failed) memHosts.find(lp) == memHosts.end() in function 'cv::dnn::dnn4_v20201117::BlobManager::addHost' haha. I suspect because I am streaming a lot of images doing what you told me to do means there is only one net instances and there is a lot of images need to be processed which make them request the same instances causing that error
– user12088653
                Apr 28, 2021 at 1:52
                the code i share is almost all my flask code XD. did you need me to share my html code too? I am willing to share if you need it.
– user12088653
                Apr 28, 2021 at 1:54
From the previous two answer I manage to get the solution changing :
net.setPreferableTarget(cv2.dnn.DNN_TARGET_CUDA_FP16)
into :
net.setPreferableTarget(cv2.dnn.DNN_TARGET_CUDA)
have help to twice the GPU speed due to my GPU type is not compatible with FP16 this is thanks to Amir Karami and also despite Ian Chu answer did not solve my problem it give me basis to forcefully make all the images to only use one net instances this actually lower the processing time significantly from each needing 10 second into 0.03-0.04 seconds thus surpassing CPU speed by many times. The reason I did not accept both answer because neither really solve my problem but both become strong basis to my solution so I still upvote them. I just leave my answer here in case anyone encounter this problem like me.
DNN_TARGET_CUDA_FP16 refers to 16-bit floating-point. since your gpu is 1050 Ti, your gpu seems not works too well with FP16.you can check it from here and your compute capability from here.
i think you should change this line :
net.setPreferableTarget(cv2.dnn.DNN_TARGET_CUDA_FP16)
into :
net.setPreferableTarget(cv2.dnn.DNN_TARGET_CUDA)
                @Albert does it work? cause i just wrote my guess since i dont have 6.1 compute capability gpu.let me know if it works well..thanks
– Amir Karami
                Apr 27, 2021 at 4:54
                @ Amir Karami  Thanks for answering. It did indeed do considerable well compare to previous in fact 2 times faster but still slower much more than CPU which is illogical because GPU should be faster haha
– user12088653
                Apr 27, 2021 at 5:03
                it takes considerable time to load the image over cuda memory so it may not be effective for single image, try running the comparison for a batch of images and check if there is any performance gain.
– flamelite
                Apr 27, 2021 at 5:52
                @flamelite actually it is a few image as you can see i am using socket io and i am actually streaming a lot of images from my client to server to be processed. The CPU is definitely faster for some reason haha. The only side effect is that it eat a ridiculously large amount of CPU that is the reason why i am migrating it to GPU lol but with the current situation the times take too long which is disappointing because according to what i know it should be faster
– user12088653
                Apr 27, 2021 at 5:57
        Thanks for contributing an answer to Stack Overflow!
Please be sure to answer the question. Provide details and share your research!
But avoid …
Asking for help, clarification, or responding to other answers.
Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.