(Day 38) Traffic sign classification and bbox model-PyTorch and more of Andrej Karpathy's talks

Ivan Ivanov · February 8, 2024

cnn applying-knowledge

Hello :) Today is Day 38!

A quick summary of today:

Traffic sign detection model with PyTorch
Learned about how Telsa’s autopilot works [1, 2, 3]

These days I stay till late, studying and then writing these blog posts ~

Firstly, the model that I struggled to make yesterday - a model where I give it a picture of a sign, and what I get back is - what kind of sign it is + the sign’s bounding box. And today, I am happy to say that I got it :)

The dataset I used looks like this:

For the processed version, I took the bounding box (bbox) data, the ClassId and the path. Augmentations - only resize to 112x112 (3 channels by default). Used the albumenations library since it rescales the bounding boxes alongside the image, so no extra resizing for the bbox size was needed

Final train and val lengths were 31367 and 7842, after which I batched with size 32. An example image from the data:

Class 14 = Stop sign

The classification model was the same as yesterday, a variation of AlexNet.

For the bbox regression model, I used a pretrained efficientnet_b0, and made the final “CombinedModel”

For the losses:

MSE for bbox
cross entropy for the classification
Adam optimizer with 0.001 learning rate

Trained for 10 epochs

Seems like good results

Finally, showing some examples from the val data:

Maybe the data is not good, or because of the resizing, some of the images are very hard even for humans to distinguish.

In the end, I am glad, this is my first more complex model. And I also used transfer learning for the 1st time too. Of course this is a very simple, give an image of a sign, get back its classification and bounding box, model but going forward making it more complex, making multi-object classification + bbox models is a good idea.

For the second interesting thing today, I saw Andrej Karpathy’s name again and decided to search for some of his talks - turns out there are quite a few (nice!).

1st one - At ICLM 2019 about Tesla’s AI, and 2nd one - AI for Full-Self Driving at Tesla

Both around 30 minutes, and covered similar materials, which I will try to summarise below through some screenshots I took

Some other companies use a ‘lidar’ that beams waves from its top, and then maps the area around it, so that the vehicle can see. But at Tesla - they use just camera vision. More specifically - 8 cameras. Each camera registering new information every 27ms.

At the start, their model was very simple - draw rectangles around cars:

But even with cars, this quickly became impractical, as cars can be in many different shapes:

So their training process became: get data from cars -> train -> deploy -> get new inaccuracies -> re-evaluate -> deploy again etc.

So, next the model started to have a variety of inputs and classes (other vehicles, road markings, people, signs, traffic lights, environment, crosswalks, lane lines, etc.)

And identifying each task has additional sub-tasks. For examples, vehicles can come in many shapes and forms

Apparently this is 2020, they used to have 48 networks in total.

Given the complex nature of the problem, calculating loss, proper training time, regularization is hard because some tasks have a lot of info (i.e. seeing cars) but other might have very little data.

He gave a funny example, early stopping is fine when we have 1 task:

But what happens when you have many ?

Some tasks will overfit faster than others because of more/less data, or are easier. The problem becomes how to manipulate the oversampling rates, task weights or reguralizations so that these lines’ minima is at the same spot.

His 3rd talk - Tesla AI day 2021

He mentioned a little about the structure. The take 1280x960 sized data from the cameras.

Going from more bigger picture with less channels, to a more detail and abstract picture with 512 channels.

A high level explanation would be, these pictures are put afterwards to each task, and afterwards the model tries to understand the area it is in. They use ‘multi-camera’ fusion. For example, 1 long truck may appear on multiple cameras, and instead of each camera trying to identify the truck on its own, the cameras can pass info so that the whole truck is identified. He gave examples with cars passing by, because when that happens, multiple cameras will see the car and we will get multiple recognitions of that car (orange) from each single camera, but by using multi-camera, the vehicle can have 1 smooth represetation.

Also, to remember what the car saw earlier, they started using some kinds of RNN so that the car can know its surroundings even if it does not see everything at the moment. For example, there is a bike 2 lanes to the right, but currently a truck is passing by and the cameras are not seeing the bike, but thanks to the ‘memory’ the car will know that the bike is there (or could be moving alongside).

I am not able to understand completely such complex structures yet, but seeing the path ahead is exciting! I need to start reading research papers to broaden my knowledge on models and what is happening in academia.

That is all for today!

See you tomorrow :)