Video Segmentation & Understanding of Automotive Scenes

Overview

One pipeline, three tasks, temporal memory

Models that look at one frame at a time tend to flicker, where the same pixel jumps between labels from one moment to the next. This project treats a driving clip as a sequence instead. It carries a learned geometry state forward through the clip so segmentation and depth stay steady over time, and a separate filter works out how the camera itself moved.

Task 1

Semantic Segmentation

Per-pixel labelling into 22 CARLA classes (road, vehicles, pedestrians, buildings, vegetation…) using a DeepLabv3-ResNet50 backbone.

Task 2

Monocular Depth

Per-pixel metric depth from a single RGB frame via a ResNet50 encoder and a custom progressive-upsampling decoder.

Task 3

Camera Ego-Motion

How the camera itself moved between two frames, written as a 4×4 transform. The network regresses 3 translation and 3 rotation values from motion features.

Architecture

How the system fits together

A GeometryFilter carries a ConvGRU memory at the network bottleneck (2048 channels, 32×64 spatial) to produce temporally-consistent segmentation and depth. Its embeddings feed an EgoMotionFilter that regresses the camera pose between frames.

Input RGB sequence · 6 frames · 256×512

▼

GeometryFilter (273.4M params)

ResNet-50 backbone → ConvGRU (2048 ch) → Segmentation head (22 classes) + Depth decoder (1 ch)

▼ embeddings

EgoMotionFilter (40.5M params)

MotionEstimator (4096→128) → GRU (128) → CameraHead → 4×4 transform matrix

Model	Architecture	Parameters	Output
SegmentationModel	DeepLabv3-ResNet50	39.6M	22-class logits
DepthEstimationModel	ResNet50 + DepthDecoder	30.7M	1-channel depth
GeometryFilter	DeepLabv3 + ConvGRU + dual heads	273.4M	seg + depth
EgoMotionFilter	MotionEstimator + GRU + CameraHead	40.5M	4×4 pose matrix
Full pipeline	GeometryFilter + EgoMotionFilter	313.9M	seg + depth + pose

Results

Temporal consistency in action

The biggest win is stability over time. The naïve frame-by-frame baseline flickers, while adding the ConvGRU geometry filter, and then fine-tuning it, gives noticeably steadier segmentation across the whole clip.

Tip: click any clip to replay it from the start.

Segmentation, sequence 0

Naïve baselineper-frame, no temporal memory

Fine-tunedtemporal model, fine-tuned

End-to-endfull temporal pipeline

Segmentation, sequence 1

Naïve baselineper-frame, no temporal memory

Fine-tunedtemporal model, fine-tuned

End-to-endfull temporal pipeline

Segmentation, sequence 2

Naïve baselineper-frame, no temporal memory

Fine-tunedtemporal model, fine-tuned

End-to-endfull temporal pipeline

Segmentation, sequence 3

Naïve baselineper-frame, no temporal memory

Fine-tunedtemporal model, fine-tuned

End-to-endfull temporal pipeline

Segmentation, sequence 4

Naïve baselineper-frame, no temporal memory

Fine-tunedtemporal model, fine-tuned

End-to-endfull temporal pipeline

Depth target maps

These show the depth that the model was trained to recover, with closer surfaces shown lighter and far-away ones darker. The model's own depth head turned out to be the weakest part of the system, so we show the target maps here and let the point clouds below carry the depth story.

Depth target sequence 0 — Depth target, sequence 0

Depth target sequence 1 — Depth target, sequence 1

3D point-cloud reconstruction

Depth lifted into 3D and coloured by predicted semantic class, then placed in the scene using the estimated camera poses.

Point cloud 1 — Semantic point cloud, sequence 0

Point cloud 2 — Semantic point cloud, sequence 1

Quantitative evaluation (test set, Town 02)

Metric	Value	Notes
Pixel accuracy	83.69%	Per-pixel, averaged over the test set
mean IoU	0.3575	Averaged across the 22 classes
Average loss	0.5585	Mean of segmentation + depth loss over the test set

These are the exact numbers reported by the evaluation in demonstration.ipynb. For reference, the loss terms on the very last evaluation batch read about 0.514 for segmentation, 0.00007 for depth, and 3.01 for camera pose. The camera-pose term is tracked separately and is not part of the average loss above.

A note on honesty: segmentation is the strongest part of this system. The depth branch was the weakest, so it is best treated as a rough qualitative signal rather than a precise depth estimator.

Training

Four-stage curriculum

The tasks are trained one stage at a time rather than all at once. A single-frame backbone is trained first, then reused and frozen so the later stages learn on top of stable features. The values below are read straight from the training scripts in the repo.

Stage	Model	Epochs	Batch	LR	Loss
1 · Segmentation	DeepLabv3-ResNet50	20	8	3e-4	CrossEntropy
2 · Depth	ResNet50 + DepthDecoder	20	8	3e-4	MSE
3 · Ego-motion	EgoMotionFilter (frozen geometry features)	20	4	3e-4	L1
4 · Sequence	GeometryFilter (ego-motion loaded, frozen)	10	2	3e-4	CrossEntropy + depth L1 (÷6000) + pose MSE

Optimiser & schedule

Adam throughout, with ReduceLROnPlateau (factor 0.1). In the sequence stage the optimiser updates only the geometry filter (its depth decoder and ConvGRU); the backbone, the segmentation head, and the ego-motion filter are all kept fixed, with the ego-motion filter loaded from its own earlier training.

Temporal augmentations

Gaussian noise (σ=0.05), ColorJitter, simulated lighting shifts across a clip, and random occluding clutter (p=0.25). Each one is applied consistently across a whole sequence, so the model learns to be robust without losing its sense of time.

Dataset

CARLA-simulated driving sequences

Synchronised RGB, semantic segmentation, depth, and per-frame camera extrinsics captured from the CARLA 0.9.16 simulator. Each raw clip holds 20 frames; the temporal model is trained on 6-frame windows from those clips.

Simulator	CARLA 0.9.16 (Unreal Engine 4.26)
Frames per raw clip	20
Frames used by temporal model	6
Training resolution	256 × 512
Semantic classes	22
Modalities	RGB · segmentation · depth · extrinsics

Split	Towns
Train	Town 1, 3, 4, 5, 6, 7
Validation	Town 10
Test	Town 2

22 semantic classes

0Unlabeled

1Building

2Fence

3Other

4Pedestrian

5Pole

6Road Line

7Road

8Sidewalk

9Vegetation

10Vehicles

11Wall

12Traffic Sign

13Sky

14Ground

15Bridge

16Rail Track

17Guard Rail

18Traffic Light

19Static

20Dynamic

21Water

Video Segmentation & Understanding
of Automotive Scenes

One pipeline, three tasks, temporal memory

Semantic Segmentation

Monocular Depth

Camera Ego-Motion

How the system fits together

Temporal consistency in action

Segmentation, sequence 0

Segmentation, sequence 1

Segmentation, sequence 2

Segmentation, sequence 3

Segmentation, sequence 4

Depth target maps

3D point-cloud reconstruction

Quantitative evaluation (test set, Town 02)

Four-stage curriculum

Optimiser & schedule

Temporal augmentations

CARLA-simulated driving sequences

22 semantic classes

Training environment

Dig deeper