import cv2
import numpy as np
import time

# We don't need to write our owan loaing function
# This function returns model objects that we can use later on for prediction

net = cv2.dnn.readNet('/content/drive/My Drive/ObjectDetectionv2/yolov3.weights','/content/drive/My Drive/ObjectDetectionv2/yolov3.cfg')

# Extract the objects name from coco.name and put everything into list
classes = []
with open('/content/drive/My Drive/ObjectDetectionv2/coco.names','r') as f:
  classes = f.read().splitlines()

print(classes)

['person', 'bicycle', 'car', 'motorbike', 'aeroplane', 'bus', 'train', 'truck', 'boat', 'traffic light', 'fire hydrant', 'stop sign', 'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow', 'elephant', 'bear', 'zebra', 'giraffe', 'backpack', 'umbrella', 'handbag', 'tie', 'suitcase', 'frisbee', 'skis', 'snowboard', 'sports ball', 'kite', 'baseball bat', 'baseball glove', 'skateboard', 'surfboard', 'tennis racket', 'bottle', 'wine glass', 'cup', 'fork', 'knife', 'spoon', 'bowl', 'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot', 'hot dog', 'pizza', 'donut', 'cake', 'chair', 'sofa', 'pottedplant', 'bed', 'diningtable', 'toilet', 'tvmonitor', 'laptop', 'mouse', 'remote', 'keyboard', 'cell phone', 'microwave', 'oven', 'toaster', 'sink', 'refrigerator', 'book', 'clock', 'vase', 'scissors', 'teddy bear', 'hair drier', 'toothbrush']

from google.colab.patches import cv2_imshow

cap = cv2.VideoCapture('/content/drive/My Drive/ObjectDetectionv2/Road_traffic_video2.mp4') # from video
#cap = cv2.VideoCapture(0) # from webcam
#img = cv2.imread('/content/drive/My Drive/ObjectDetectionv2/restaruant2.jpg') # from image
font = cv2.FONT_HERSHEY_PLAIN
colors = np.random.uniform(0, 255, size=(len(classes), 3))
#img = cv2.resize(img, None,fx=0.4, fy=0.4)
start_time = time.time()
img_id = 0
while True:
  _, img = cap.read()
  img_id +=1
  height, width, _ = img.shape
  # we need to resize the image in a square 416x416 that can be fit into the Yolo3
  # and also we normalize it by dividing the pixel value by 255
  # here we use dimension 32xX here X=13 so the input size w,h=(416,416) and it's work better the bigger X is.

  blob = cv2.dnn.blobFromImage(img,1/255,(320,320),(0,0,0),swapRB=True,crop=False)
  # first image,second normalization,third dimension,fourth no any means of substraction
  # fifth swapRb = true that convert BGR to RGB,sixth no croping

  # passing this blob into our model inside a net
  net.setInput(blob)

  # get output layersname ['yolo_82', 'yolo_94', 'yolo_106']
  output_layers_names = net.getUnconnectedOutLayersNames() 

  output_all = net.getLayerNames() # to get names of all layers
  # print(output_layers_names)
  # print(output_all)

  # forward propogation
  # passing output layers names to get output at that layers
  layersOutputs = net.forward(output_layers_names)
  # we need to extract the bounding boxes
  # confidences and the predicted classes
  boxes = []
  confidences = []
  class_ids = []

  #print(layersOutputs)

  #3 boxes with box co-ordinates,confidence score,class score
  for output in layersOutputs:
    # 4 box co-ordinates + 1 confidence score + 80 class score = 85
    for detection in output:
      scores = detection[5:] # store 80 class predeictions
      class_id = np.argmax(scores) #extract highest scores indexes
      confidence = scores[class_id] 

      if confidence > 0.5:
        # Object detected
        # we have normalized img by scalefactor 1/255 
        # so co-ordinates are appropriate to that img
        # to get original, denormalize by multiplying it's original width,height 
        center_x = int(detection[0] * width)
        center_y = int(detection[1] * height)
        w = int(detection[2] * width)
        h = int(detection[3] * height)
        
        # Rectangle coordinates
        # extract the upper left corners positions in order to present them with use of opencv
        x = int(center_x - w / 2)
        y = int(center_y - h / 2)
        
        # append everything to draw boxes
        boxes.append([x, y, w, h])
        confidences.append(float(confidence))
        class_ids.append(class_id)

  #print(len(boxes))
  #third parameter is set under the confidence,last is NMS
  indexes = cv2.dnn.NMSBoxes(boxes, confidences, 0.5, 0.4)
  #print(indexes)



  if len(indexes)>0:
    for i in indexes.flatten():
      x, y, w, h = boxes[i]
      label = str(classes[class_ids[i]])
      confidence = str(round(confidences[i],2))
      color = colors[class_ids[i]]
      cv2.rectangle(img, (x, y), (x + w, y + h), color, 6)
      cv2.putText(img, label + " " + confidence, (x, y + 50), font, 5, color, 3)
  # cv2.imshow('Image',img)  DisabledFunctionError: cv2.imshow() is disabled in Colab, 
  # because it causes Jupyter sessions to crash;
  elapsed_time = time.time() - start_time
  fps = img_id/elapsed_time
  cv2.putText(img,"FPS: "+str(round(fps,2)),(10,30),font,3,(0,0,0),1)
  cv2_imshow(img)
  key = cv2.waitKey(1)
  if key==27:
    break

cv2.release()
cv2.destroyAllWindows()

Feature Extractor

The previous YOLO versions have used Darknet-19 (a custom neural network architecture written in C and CUDA) as a feature extractor which was of 19 layers as the name suggests. YOLO v2 added 11 more layers to Darknet-19 making it a total 30-layer architecture. Still, the algorithm faced a challenge while detecting small objects due to downsampling the input image and losing fine-grained features.

YOLO V3 came up with a better architecture where the feature extractor used was a hybrid of YOLO v2, Darknet-53 (a network trained on the ImageNet), and Residual networks(ResNet). The network uses 53 convolution layers (hence the name Darknet-53) where the network is built with consecutive 3x3 and 1x1 convolution layers followed by a skip connection (introduced by ResNet to help the activations propagate through deeper layers without gradient diminishing).

The 53 layers of the darknet are further stacked with 53 more layers for the detection head, making YOLO v3 a total of a 106 layer fully convolutional underlying architecture. thus leading to a large architecture, though making it a bit slower as compared to YOLO v2, but enhancing the accuracy at the same time.

For visualizing how the multi-scale extractor would look like, I’m taking an example of a 416x416 image. A stride of a layer is defined as the ratio by which it downsamples the input, and hence the three scales in our case would be 52x52, 26x26, and 13x13 where 13x13 would be used for larger objects and 26x26 and 52x52 would be used for medium and smaller objects.

Multi-scale Detector

An important feature of the YOLO v3 model is its multi-scale detector, which means that the detection for an eventual output of a fully convolutional network is done by applying 1x1 detection kernels on feature maps of three different sizes at three different places. The shape of the kernel is 1x1x(B*(5+C)).

Complete Network Architecture

diagram that beautifully explains the complete architecture of YOLO v3 (Combining both, the extractor and the detector).

YOLO v3 makes prediction at three scales, which are precisely given by downsampling the dimensions of the input image by 32, 16 and 8 respectively.

The first detection is made by the 82nd layer. For the first 81 layers, the image is down sampled by the network, such that the 81st layer has a stride of 32. If we have an image of 416 x 416, the resultant feature map would be of size 13 x 13. One detection is made here using the 1 x 1 detection kernel, giving us a detection feature map of 13 x 13 x 255.

Then, the feature map from layer 79 is subjected to a few convolutional layers before being up sampled by 2x to dimensions of 26 x 26. This feature map is then depth concatenated with the feature map from layer 61. Then the combined feature maps is again subjected a few 1 x 1 convolutional layers to fuse the features from the earlier layer (61). Then, the second detection is made by the 94th layer, yielding a detection feature map of 26 x 26 x 255.

A similar procedure is followed again, where the feature map from layer 91 is subjected to few convolutional layers before being depth concatenated with a feature map from layer 36. Like before, a few 1 x 1 convolutional layers follow to fuse the information from the previous layer (36). We make the final of the 3 at 106th layer, yielding feature map of size 52 x 52 x 255.

The multi-scale detector is used to ensure that the small objects are also being detected unlike in YOLO v2, where there was constant criticism regarding the same. Upsampled layers concatenated with the previous layers end up preserving the fine-grained features which help in detecting small objects.

The details of how this kernel looks in our model is described below

Working of YOLO v3

The YOLO v3 network aims to predict bounding boxes (region of interest of the candidate object) of each object along with the probability of the class which the object belongs to.

For this, the model divides every input image into an SxS grid of cells and each grid predicts B bounding boxes and C class probabilities of the objects whose centers fall inside the grid cells. The paper states that each bounding box may specialize in detecting a certain kind of object.

Bounding boxes "B" is associated with the number of anchors being used. Each bounding box has 5+C attributes, where '5' refers to the five bounding box attributes (eg: center coordinates(bx, by), height(bh), width(bw), and confidence score) and C is the number of classes.

Our output from passing this image into a forward pass convolution network is a 3-D tensor because we are working on an SxS image. The output looks like [S, S, B*(5+C)].

Let’s just understand this better using an example.

In the above example, we see that our input image is divided into 13 x 13 grid cells. Now, let us understand what happens with taking just a single grid cell.

Due to the multi-scale detection feature of YOLO v3, a detection kernel of three different sizes is applied at three different places, hence the 3 boxes(i.e B=3). YOLO v3 was trained on the COCO dataset with 80 object categories or classes, hence C=80.

Thus, the output is a 3-D tensor as mentioned earlier with dimensions (13, 13, 3*(80+5)).

Anchor Boxes

In the earlier years for detecting an object, scientists used the concept of the sliding window and ran an image classification algorithm on each window. Soon they realized this didn’t make sense and was very inefficient so they moved on to using ConvNets and running the entire image in a single shot. Since the ConvNet outputs square matrices of feature values (i.e something like a 13x13 or 26x26 in case of YOLO) the concept of "grid" came into the picture. We define the square feature matrix as a grid but the real problem came when the objects to detect were not in square shapes. These objects could be in any shape (mostly rectangular). Thus, anchor boxes were started being used.

Anchor boxes are pre-defined boxes that have an aspect ratio set. These aspect ratios are defined beforehand even before training by running a K-means clustering on the entire dataset. These anchor boxes anchor to the grid cells and share the same centroid. YOLO v3 uses 3 anchor boxes for every detection scale, which makes it a total of 9 anchor boxes.

Non-Maximum Suppression

There is a chance that after the single forward pass, the output predicted would have multiple bounding boxes for the same object since the centroid would be the same, but we only need one bounding box which is best suited for all the.

For this, we can use a method called non-maxim suppression (NMS) which basically cleans up after these detections. We can define a certain threshold that would act as a constraint for this NMS method where it would ignore all the other bounding boxes whose confidence is below the threshold mentioned, thus eliminating a few. But this wouldn’t eliminate all, so the next step in the NMS would be implemented, i.e to arrange all the confidences of the bounding boxes in descending order and choose the one with the highest score as the most appropriate one for the object. Then we find all the other boxes with high Intersection over union (IOU) with the bounding box with maximum confidence and eliminate all those as well.

References:

A Gentle Introduction to Object Recognition With Deep Learning

Bounding box object detectors: understanding YOLO, You Look Only Once

What’s new in YOLO v3?

Digging deep into YOLO V3 - A hands-on guide

Dive Really Deep into YOLO v3: A Beginner’s Guide

YOLO_v1

YOLO_v2

YOLO_v3

If you need help, there’re plenty of excellent resources like Udacity Computer Vision Nanodegree, Cousera Deep Learning Specialization and Stanford CS231n.

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive