A temporal deep-learning pipeline that reads short driving-video clips and predicts three things at once: semantic segmentation, monocular depth, and camera ego-motion. A ConvGRU geometry filter gives it a memory, so its predictions stay steady from one frame to the next.
Models that look at one frame at a time tend to flicker, where the same pixel jumps between labels from one moment to the next. This project treats a driving clip as a sequence instead. It carries a learned geometry state forward through the clip so segmentation and depth stay steady over time, and a separate filter works out how the camera itself moved.
Per-pixel labelling into 22 CARLA classes (road, vehicles, pedestrians, buildings, vegetation…) using a DeepLabv3-ResNet50 backbone.
Per-pixel metric depth from a single RGB frame via a ResNet50 encoder and a custom progressive-upsampling decoder.
How the camera itself moved between two frames, written as a 4×4 transform. The network regresses 3 translation and 3 rotation values from motion features.
A GeometryFilter carries a ConvGRU memory at the network bottleneck (2048 channels, 32×64 spatial) to produce temporally-consistent segmentation and depth. Its embeddings feed an EgoMotionFilter that regresses the camera pose between frames.
| Model | Architecture | Parameters | Output |
|---|---|---|---|
| SegmentationModel | DeepLabv3-ResNet50 | 39.6M | 22-class logits |
| DepthEstimationModel | ResNet50 + DepthDecoder | 30.7M | 1-channel depth |
| GeometryFilter | DeepLabv3 + ConvGRU + dual heads | 273.4M | seg + depth |
| EgoMotionFilter | MotionEstimator + GRU + CameraHead | 40.5M | 4×4 pose matrix |
| Full pipeline | GeometryFilter + EgoMotionFilter | 313.9M | seg + depth + pose |
The biggest win is stability over time. The naïve frame-by-frame baseline flickers, while adding the ConvGRU geometry filter, and then fine-tuning it, gives noticeably steadier segmentation across the whole clip.
Tip: click any clip to replay it from the start.















These show the depth that the model was trained to recover, with closer surfaces shown lighter and far-away ones darker. The model's own depth head turned out to be the weakest part of the system, so we show the target maps here and let the point clouds below carry the depth story.


Depth lifted into 3D and coloured by predicted semantic class, then placed in the scene using the estimated camera poses.


| Metric | Value | Notes |
|---|---|---|
| Pixel accuracy | 83.69% | Per-pixel, averaged over the test set |
| mean IoU | 0.3575 | Averaged across the 22 classes |
| Average loss | 0.5585 | Mean of segmentation + depth loss over the test set |
These are the exact numbers reported by the evaluation in demonstration.ipynb. For reference, the loss terms on the very last evaluation batch read about 0.514 for segmentation, 0.00007 for depth, and 3.01 for camera pose. The camera-pose term is tracked separately and is not part of the average loss above.
The tasks are trained one stage at a time rather than all at once. A single-frame backbone is trained first, then reused and frozen so the later stages learn on top of stable features. The values below are read straight from the training scripts in the repo.
| Stage | Model | Epochs | Batch | LR | Loss |
|---|---|---|---|---|---|
| 1 · Segmentation | DeepLabv3-ResNet50 | 20 | 8 | 3e-4 | CrossEntropy |
| 2 · Depth | ResNet50 + DepthDecoder | 20 | 8 | 3e-4 | MSE |
| 3 · Ego-motion | EgoMotionFilter (frozen geometry features) | 20 | 4 | 3e-4 | L1 |
| 4 · Sequence | GeometryFilter (ego-motion loaded, frozen) | 10 | 2 | 3e-4 | CrossEntropy + depth L1 (÷6000) + pose MSE |
Adam throughout, with ReduceLROnPlateau (factor 0.1). In the sequence stage the optimiser updates only the geometry filter (its depth decoder and ConvGRU); the backbone, the segmentation head, and the ego-motion filter are all kept fixed, with the ego-motion filter loaded from its own earlier training.
Gaussian noise (σ=0.05), ColorJitter, simulated lighting shifts across a clip, and random occluding clutter (p=0.25). Each one is applied consistently across a whole sequence, so the model learns to be robust without losing its sense of time.
Synchronised RGB, semantic segmentation, depth, and per-frame camera extrinsics captured from the CARLA 0.9.16 simulator. Each raw clip holds 20 frames; the temporal model is trained on 6-frame windows from those clips.
| Simulator | CARLA 0.9.16 (Unreal Engine 4.26) |
| Frames per raw clip | 20 |
| Frames used by temporal model | 6 |
| Training resolution | 256 × 512 |
| Semantic classes | 22 |
| Modalities | RGB · segmentation · depth · extrinsics |
| Split | Towns |
|---|---|
| Train | Town 1, 3, 4, 5, 6, 7 |
| Validation | Town 10 |
| Test | Town 2 |
The full technical walkthrough, the reference paper this project was built from, and the complete source are all available.