(Day 29) How to do object detection?

Ivan Ivanov · January 30, 2024

Hello :) Today is Day 29!

A quick summary of today:

  • Object detection from DeepLearning.AI’s DL course

Today’s material is still part of the CNN course.

image

1. Classification with localization

image

If you give a picture to our model, output also includes the object you looked for and the ‘bounding box’ coordinates of that object

image

For example, y includes [pc, bx, by, bh, bw, c1, c2, c3] and pc indicates whether there is an object or not, and the coordinate diagram of the bx, by, bh, bw bounding box is also a variable of multiple classifications (c1 = pedestrian, c2 = car, c3 = motorcycle).

**How do we train such a model? **

First of all, we can train a classification model to see if there is a car or not, and then we can find out whether there is a car or not using the Sliding Windows detection method.

image

However, it was found that the cost of this method was too high, and a method that could be done with convolution was created. In the above case, we move the window one by one, one by one, but we can do it all at once with a convolution.

image

2. YOLO

image

The You Only Look Once algorithm is one of the most widely used algorithms. It is a fast method that can be processed at once if you give an image. Using the ‘Grid based approach’, the picture is divided into several windows and calculates the bounding box and pc of the object in that window.

3. Object localization evaluation

image

It can be evaluated by the Intersection over Union (IoU) method. If there is a prediction bounding box and a real bounding box, the IoU can be calculated with area of overlap (yellow) and area of union (blue).

image image

In fact, when using the YOLO model, 19x19 grid is often used, so the object may be in various windows, so if the pc (probability of object presence) of the bounding box is not 0.6, it can be discarded.

4. Anchor boxes

image

The anchor box concept is, for example, a car is usually horizontal. A person is vertically standing. So if you define anchor box for 2 things, you will include anchor box information in the output.

image

And the output is not 3x3x8, but it will be 3x3x(8x2) (there are 2 anchor boxes)

5. Regional proposal (R-CNN)

image

R-CNN is a method of defining and showing different classes as different colours (numbers) rather than dividing them into windows.

image

Below are algorithms that have been found to be faster than R-CNN

image image

6. Semantic segmentation

image

Semantic segmented images can be received by a transpose convolution method without the Dense layer in the normal model (hence it is removed in the pic above)

image image

This method is called U-Net.

image

The U-Net implementation seems a bit hard right now, so I will take a note to do it later.

KCSE 2024

I don’t think I can learn much on my own for three days starting from tomorrow at the KCSE 2024 conference, but I will summarize what will be presented at the conference. My presentation is on February 3rd and I’m looking forward to it!


That is all for today!

See you tomorrow :)

Original post in Korean