CARLA · Temporal Perception · PyTorch

Video Segmentation & Understanding
of Automotive Scenes

A temporal deep-learning pipeline that reads short driving-video clips and predicts three things at once: semantic segmentation, monocular depth, and camera ego-motion. A ConvGRU geometry filter gives it a memory, so its predictions stay steady from one frame to the next.

83.69%
Pixel Accuracy
0.358
mean IoU (22 classes)
313.9M
Pipeline Parameters
3-in-1
Seg · Depth · Ego-motion
Overview

One pipeline, three tasks, temporal memory

Models that look at one frame at a time tend to flicker, where the same pixel jumps between labels from one moment to the next. This project treats a driving clip as a sequence instead. It carries a learned geometry state forward through the clip so segmentation and depth stay steady over time, and a separate filter works out how the camera itself moved.

Task 1

Semantic Segmentation

Per-pixel labelling into 22 CARLA classes (road, vehicles, pedestrians, buildings, vegetation…) using a DeepLabv3-ResNet50 backbone.

Task 2

Monocular Depth

Per-pixel metric depth from a single RGB frame via a ResNet50 encoder and a custom progressive-upsampling decoder.

Task 3

Camera Ego-Motion

How the camera itself moved between two frames, written as a 4×4 transform. The network regresses 3 translation and 3 rotation values from motion features.

Architecture

How the system fits together

A GeometryFilter carries a ConvGRU memory at the network bottleneck (2048 channels, 32×64 spatial) to produce temporally-consistent segmentation and depth. Its embeddings feed an EgoMotionFilter that regresses the camera pose between frames.

Input RGB sequence · 6 frames · 256×512
GeometryFilter (273.4M params)
ResNet-50 backbone → ConvGRU (2048 ch) → Segmentation head (22 classes) + Depth decoder (1 ch)
▼  embeddings
EgoMotionFilter (40.5M params)
MotionEstimator (4096→128) → GRU (128) → CameraHead → 4×4 transform matrix
ModelArchitectureParametersOutput
SegmentationModelDeepLabv3-ResNet5039.6M22-class logits
DepthEstimationModelResNet50 + DepthDecoder30.7M1-channel depth
GeometryFilterDeepLabv3 + ConvGRU + dual heads273.4Mseg + depth
EgoMotionFilterMotionEstimator + GRU + CameraHead40.5M4×4 pose matrix
Full pipelineGeometryFilter + EgoMotionFilter313.9Mseg + depth + pose
Results

Temporal consistency in action

The biggest win is stability over time. The naïve frame-by-frame baseline flickers, while adding the ConvGRU geometry filter, and then fine-tuning it, gives noticeably steadier segmentation across the whole clip.

Tip: click any clip to replay it from the start.

Segmentation, sequence 0

Naïve baselineper-frame, no temporal memory
Naive segmentation sequence 0
Fine-tunedtemporal model, fine-tuned
Fine-tuned segmentation sequence 0
End-to-endfull temporal pipeline
End-to-end segmentation sequence 0

Segmentation, sequence 1

Naïve baselineper-frame, no temporal memory
Naive segmentation sequence 1
Fine-tunedtemporal model, fine-tuned
Fine-tuned segmentation sequence 1
End-to-endfull temporal pipeline
End-to-end segmentation sequence 1

Segmentation, sequence 2

Naïve baselineper-frame, no temporal memory
Naive segmentation sequence 2
Fine-tunedtemporal model, fine-tuned
Fine-tuned segmentation sequence 2
End-to-endfull temporal pipeline
End-to-end segmentation sequence 2

Segmentation, sequence 3

Naïve baselineper-frame, no temporal memory
Naive segmentation sequence 3
Fine-tunedtemporal model, fine-tuned
Fine-tuned segmentation sequence 3
End-to-endfull temporal pipeline
End-to-end segmentation sequence 3

Segmentation, sequence 4

Naïve baselineper-frame, no temporal memory
Naive segmentation sequence 4
Fine-tunedtemporal model, fine-tuned
Fine-tuned segmentation sequence 4
End-to-endfull temporal pipeline
End-to-end segmentation sequence 4

Depth target maps

These show the depth that the model was trained to recover, with closer surfaces shown lighter and far-away ones darker. The model's own depth head turned out to be the weakest part of the system, so we show the target maps here and let the point clouds below carry the depth story.

Depth target sequence 0
Depth target, sequence 0
Depth target sequence 1
Depth target, sequence 1

3D point-cloud reconstruction

Depth lifted into 3D and coloured by predicted semantic class, then placed in the scene using the estimated camera poses.

Point cloud 1
Semantic point cloud, sequence 0
Point cloud 2
Semantic point cloud, sequence 1

Quantitative evaluation (test set, Town 02)

MetricValueNotes
Pixel accuracy83.69%Per-pixel, averaged over the test set
mean IoU0.3575Averaged across the 22 classes
Average loss0.5585Mean of segmentation + depth loss over the test set

These are the exact numbers reported by the evaluation in demonstration.ipynb. For reference, the loss terms on the very last evaluation batch read about 0.514 for segmentation, 0.00007 for depth, and 3.01 for camera pose. The camera-pose term is tracked separately and is not part of the average loss above.

A note on honesty: segmentation is the strongest part of this system. The depth branch was the weakest, so it is best treated as a rough qualitative signal rather than a precise depth estimator.
Training

Four-stage curriculum

The tasks are trained one stage at a time rather than all at once. A single-frame backbone is trained first, then reused and frozen so the later stages learn on top of stable features. The values below are read straight from the training scripts in the repo.

StageModelEpochsBatchLRLoss
1 · SegmentationDeepLabv3-ResNet502083e-4CrossEntropy
2 · DepthResNet50 + DepthDecoder2083e-4MSE
3 · Ego-motionEgoMotionFilter (frozen geometry features)2043e-4L1
4 · SequenceGeometryFilter (ego-motion loaded, frozen)1023e-4CrossEntropy + depth L1 (÷6000) + pose MSE

Optimiser & schedule

Adam throughout, with ReduceLROnPlateau (factor 0.1). In the sequence stage the optimiser updates only the geometry filter (its depth decoder and ConvGRU); the backbone, the segmentation head, and the ego-motion filter are all kept fixed, with the ego-motion filter loaded from its own earlier training.

Temporal augmentations

Gaussian noise (σ=0.05), ColorJitter, simulated lighting shifts across a clip, and random occluding clutter (p=0.25). Each one is applied consistently across a whole sequence, so the model learns to be robust without losing its sense of time.

Dataset

CARLA-simulated driving sequences

Synchronised RGB, semantic segmentation, depth, and per-frame camera extrinsics captured from the CARLA 0.9.16 simulator. Each raw clip holds 20 frames; the temporal model is trained on 6-frame windows from those clips.

SimulatorCARLA 0.9.16 (Unreal Engine 4.26)
Frames per raw clip20
Frames used by temporal model6
Training resolution256 × 512
Semantic classes22
ModalitiesRGB · segmentation · depth · extrinsics
SplitTowns
TrainTown 1, 3, 4, 5, 6, 7
ValidationTown 10
TestTown 2

22 semantic classes

0Unlabeled
1Building
2Fence
3Other
4Pedestrian
5Pole
6Road Line
7Road
8Sidewalk
9Vegetation
10Vehicles
11Wall
12Traffic Sign
13Sky
14Ground
15Bridge
16Rail Track
17Guard Rail
18Traffic Light
19Static
20Dynamic
21Water
Compute

Training environment

The models were trained on a shared academic GPU cluster. The exact hardware (GPU model, CPU, memory) was never recorded in the project, so we don't claim any specific specs here. Training runs on CUDA, and the batch sizes above reflect how much memory the 273M-parameter geometry filter needs.
Explore

Dig deeper

The full technical walkthrough, the reference paper this project was built from, and the complete source are all available.