Collectives™ on Stack Overflow
Find centralized, trusted content and collaborate around the technologies you use most.
Learn more about Collectives
Teams
Q&A for work
Connect and share knowledge within a single location that is structured and easy to search.
Learn more about Teams
I am new to openCV - CUDA so I have been testing the most simple one which is loading a model on GPU rather than CPU to see how fast GPU is and I am horrified at the result I get.
----------------------------------------------------------------
--- GPU vs CPU ---
--- ---
--- 21.913758993148804 seconds ---3.0586464405059814 seconds ---
--- 22.379303455352783 seconds ---3.1384341716766357 seconds ---
--- 21.500431060791016 seconds ---2.9400241374969482 seconds ---
--- 21.292986392974854 seconds ---3.3738017082214355 seconds ---
--- 20.88358211517334 seconds ---3.388749599456787 seconds ---
I will give my code snippet in case I may be doing something wrong that cause GPU time to spike so high.
def loadYolo():
net = cv2.dnn.readNet("yolov4.weights", "yolov4.cfg")
net.setPreferableBackend(cv2.dnn.DNN_BACKEND_CUDA)
net.setPreferableTarget(cv2.dnn.DNN_TARGET_CUDA_FP16)
classes = []
with open("coco.names", "r") as f:
classes = [line.strip() for line in f.readlines()]
layer_names = net.getLayerNames()
output_layers = [layer_names[i[0] - 1] for i in net.getUnconnectedOutLayers()]
return net,classes,layer_names,output_layers
@socketio.on('image')
def image(data_image):
sbuf = StringIO()
sbuf.write(data_image)
b = io.BytesIO(base64.b64decode(data_image))
if(str(data_image) == 'data:,'):
else:
pimg = Image.open(b)
frame = cv2.cvtColor(np.array(pimg), cv2.COLOR_RGB2BGR)
frame = resize(frame, width=700)
frame = cv2.flip(frame, 1)
net,classes,layer_names,output_layers=loadYolo()
height, width, channels = frame.shape
blob = cv2.dnn.blobFromImage(frame, 1 / 255.0, (416, 416),
swapRB=True, crop=False)
net.setInput(blob)
outs = net.forward(output_layers)
print("--- %s seconds ---" % (time.time() - start_time))
class_ids = []
confidences = []
boxes = []
for out in outs:
for detection in out:
scores = detection[5:]
class_id = np.argmax(scores)
confidence = scores[class_id]
if confidence > 0.5:
# Object detected
center_x = int(detection[0] * width)
center_y = int(detection[1] * height)
w = int(detection[2] * width)
h = int(detection[3] * height)
# Rectangle coordinates
x = int(center_x - w / 2)
y = int(center_y - h / 2)
boxes.append([x, y, w, h])
confidences.append(float(confidence))
class_ids.append(class_id)
indexes = cv2.dnn.NMSBoxes(boxes, confidences, 0.5, 0.4)
font = cv2.FONT_HERSHEY_PLAIN
colors = np.random.uniform(0, 255, size=(len(classes), 3))
for i in range(len(boxes)):
if i in indexes:
x, y, w, h = boxes[i]
label = str(classes[class_ids[i]])
color = colors[class_ids[i]]
cv2.rectangle(frame, (x, y), (x + w, y + h), color, 2)
cv2.putText(frame, label, (x, y + 30), font, 1, color, 2)
imgencode = cv2.imencode('.jpg', frame)[1]
stringData = base64.b64encode(imgencode).decode('utf-8')
b64_src = 'data:image/jpg;base64,'
stringData = b64_src + stringData
emit('response_back', stringData)
My Gpu is Nvidia 1050 Ti and my CPU is i5 gen 9 in case someone need the specification. Can someone please enlighten me as I am super confused right now? Thank you very much
EDIT 1: I tried to use cv2.dnn.DNN_TARGET_CUDA instead of cv2.dnn.DNN_TARGET_CUDA_FP16, but the time is still terrible compare to CPU. Below is the GPU result :
--- 10.91195559501648 seconds ---
--- 11.344025135040283 seconds ---
--- 11.754926204681396 seconds ---
--- 12.779674530029297 seconds ---
Below is CPU result :
--- 4.780993223190308 seconds ---
--- 4.910650253295898 seconds ---
--- 4.990436553955078 seconds ---
--- 5.246175050735474 seconds ---
it is still slower than CPU
EDIT 2: OpenCv is 4.5.0, CUDA 11.1 and CUDNN 8.0.1
–
You should definitely only load YOLO once. Recreating it for every image that comes through the socket is slow for both CPU and GPU, but GPU takes longer to initially load which is why you're seeing it run slower than CPU.
I don't understand what you mean by using an LRU cache for your YOLO model. Without seeing the rest of your code structure I can't make any real suggestions, but can you try at least temporarily putting the network into the global space just to see if it runs faster? (remove the function altogether and put its body in the global space)
something like this
net = cv2.dnn.readNet("yolov4.weights", "yolov4.cfg")
net.setPreferableBackend(cv2.dnn.DNN_BACKEND_CUDA)
net.setPreferableTarget(cv2.dnn.DNN_TARGET_CUDA_FP16)
classes = []
with open("coco.names", "r") as f:
classes = [line.strip() for line in f.readlines()]
layer_names = net.getLayerNames()
output_layers = [layer_names[i[0] - 1] for i in net.getUnconnectedOutLayers()]
@socketio.on('image')
def image(data_image):
sbuf = StringIO()
sbuf.write(data_image)
b = io.BytesIO(base64.b64decode(data_image))
if(str(data_image) == 'data:,'):
else:
pimg = Image.open(b)
frame = cv2.cvtColor(np.array(pimg), cv2.COLOR_RGB2BGR)
frame = resize(frame, width=700)
frame = cv2.flip(frame, 1)
height, width, channels = frame.shape
blob = cv2.dnn.blobFromImage(frame, 1 / 255.0, (416, 416),
swapRB=True, crop=False)
net.setInput(blob)
outs = net.forward(output_layers)
print("--- %s seconds ---" % (time.time() - start_time))
class_ids = []
confidences = []
boxes = []
for out in outs:
for detection in out:
scores = detection[5:]
class_id = np.argmax(scores)
confidence = scores[class_id]
if confidence > 0.5:
# Object detected
center_x = int(detection[0] * width)
center_y = int(detection[1] * height)
w = int(detection[2] * width)
h = int(detection[3] * height)
# Rectangle coordinates
x = int(center_x - w / 2)
y = int(center_y - h / 2)
boxes.append([x, y, w, h])
confidences.append(float(confidence))
class_ids.append(class_id)
indexes = cv2.dnn.NMSBoxes(boxes, confidences, 0.5, 0.4)
font = cv2.FONT_HERSHEY_PLAIN
colors = np.random.uniform(0, 255, size=(len(classes), 3))
for i in range(len(boxes)):
if i in indexes:
x, y, w, h = boxes[i]
label = str(classes[class_ids[i]])
color = colors[class_ids[i]]
cv2.rectangle(frame, (x, y), (x + w, y + h), color, 2)
cv2.putText(frame, label, (x, y + 30), font, 1, color, 2)
imgencode = cv2.imencode('.jpg', frame)[1]
stringData = base64.b64encode(imgencode).decode('utf-8')
b64_src = 'data:image/jpg;base64,'
stringData = b64_src + stringData
emit('response_back', stringData)
–
–
From the previous two answer I manage to get the solution changing :
net.setPreferableTarget(cv2.dnn.DNN_TARGET_CUDA_FP16)
into :
net.setPreferableTarget(cv2.dnn.DNN_TARGET_CUDA)
have help to twice the GPU speed due to my GPU type is not compatible with FP16 this is thanks to Amir Karami and also despite Ian Chu answer did not solve my problem it give me basis to forcefully make all the images to only use one net instances this actually lower the processing time significantly from each needing 10 second into 0.03-0.04 seconds thus surpassing CPU speed by many times. The reason I did not accept both answer because neither really solve my problem but both become strong basis to my solution so I still upvote them. I just leave my answer here in case anyone encounter this problem like me.
DNN_TARGET_CUDA_FP16
refers to 16-bit floating-point. since your gpu is 1050 Ti, your gpu seems not works too well with FP16.you can check it from here and your compute capability from here.
i think you should change this line :
net.setPreferableTarget(cv2.dnn.DNN_TARGET_CUDA_FP16)
into :
net.setPreferableTarget(cv2.dnn.DNN_TARGET_CUDA)
–
–
–
–
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.