(Day 29) How to do object detection?

Ivan Ivanov · January 30, 2024

deep-learning cnn

Hello :) Today is Day 29!

A quick summary of today:

Object detection from DeepLearning.AI’s DL course

Today’s material is still part of the CNN course.

1. Classification with localization

If you give a picture to our model, output also includes the object you looked for and the ‘bounding box’ coordinates of that object

For example, y includes [pc, bx, by, bh, bw, c1, c2, c3] and pc indicates whether there is an object or not, and the coordinate diagram of the bx, by, bh, bw bounding box is also a variable of multiple classifications (c1 = pedestrian, c2 = car, c3 = motorcycle).

**How do we train such a model? **

First of all, we can train a classification model to see if there is a car or not, and then we can find out whether there is a car or not using the Sliding Windows detection method.

However, it was found that the cost of this method was too high, and a method that could be done with convolution was created. In the above case, we move the window one by one, one by one, but we can do it all at once with a convolution.

2. YOLO

The You Only Look Once algorithm is one of the most widely used algorithms. It is a fast method that can be processed at once if you give an image. Using the ‘Grid based approach’, the picture is divided into several windows and calculates the bounding box and pc of the object in that window.

3. Object localization evaluation

It can be evaluated by the Intersection over Union (IoU) method. If there is a prediction bounding box and a real bounding box, the IoU can be calculated with area of overlap (yellow) and area of union (blue).

In fact, when using the YOLO model, 19x19 grid is often used, so the object may be in various windows, so if the pc (probability of object presence) of the bounding box is not 0.6, it can be discarded.

4. Anchor boxes

The anchor box concept is, for example, a car is usually horizontal. A person is vertically standing. So if you define anchor box for 2 things, you will include anchor box information in the output.

And the output is not 3x3x8, but it will be 3x3x(8x2) (there are 2 anchor boxes)

5. Regional proposal (R-CNN)

R-CNN is a method of defining and showing different classes as different colours (numbers) rather than dividing them into windows.

Below are algorithms that have been found to be faster than R-CNN

6. Semantic segmentation

Semantic segmented images can be received by a transpose convolution method without the Dense layer in the normal model (hence it is removed in the pic above)

This method is called U-Net.

The U-Net implementation seems a bit hard right now, so I will take a note to do it later.

KCSE 2024

I don’t think I can learn much on my own for three days starting from tomorrow at the KCSE 2024 conference, but I will summarize what will be presented at the conference. My presentation is on February 3rd and I’m looking forward to it!

That is all for today!

See you tomorrow :)

Original post in Korean