Image Caption Generation with BERT Context Vectors

Lead Researcher | 6-week Project | Image Understanding

Project Snapshot

A modular image captioning system that evolved from a classic “Show, Attend and Tell” architecture to a cutting-edge system incorporating the latest advancements in computer vision and natural language processing. The project demonstrates significant improvements in caption quality while improving computational efficiency.

Evolution of Our Image Captioning Architecture: From Classic to Modern

Introduction

In the fast-paced world of AI research, staying current with the latest architectures and techniques is crucial for building state-of-the-art systems. Our image captioning project is a perfect example of this evolution. We began with a solid foundation based on the classic “Show, Attend and Tell” architecture and progressively transformed it into a modular, cutting-edge system incorporating the latest advancements in computer vision and natural language processing.

The Starting Point: Show, Attend and Tell

When we launched our image captioning journey, we documented our baseline approach in our technical_architecture.md file. This initial architecture implemented the groundbreaking work by Xu et al., which introduced visual attention for image captioning:

Encoder: A pretrained ResNet-101 that processes images into 14×14 feature maps
Decoder: A single LSTM with attention that generates captions word-by-word
Attention: Basic soft attention mechanism to focus on relevant image regions
Word Embeddings: Simple embeddings with an option to use BERT
Training: Cross-entropy loss with attention regularization

This architecture served us well for basic captioning tasks, achieving reasonable BLEU scores on the MS-COCO dataset. However, as transformer architectures revolutionized both computer vision and NLP, we recognized the need to incorporate these advances.

The Transformation: Embracing Modern Architectures

Our transition to a more powerful architecture (documented in new_architecture.md) represents a significant leap forward in several dimensions:

1. Modular Design Philosophy

Rather than committing to a single architecture, we redesigned our system with modularity as the core principle. This allows us to:

Experiment with different components without rewriting code
Combine various encoders, decoders, and attention mechanisms
Support both research exploration and production deployment
Easily integrate new architectures as they emerge

2. State-of-the-Art Vision Encoders

We expanded from a single ResNet encoder to support multiple modern vision architectures:

Vision Transformers (ViT): Using self-attention for global image understanding
Swin Transformers: Hierarchical attention with shifting windows for efficiency
CLIP: Leveraging multimodal pretraining for better vision-language alignment
Traditional CNNs: Still supporting ResNet and other CNN backbones

3. Advanced Decoder Options

Our decoder options now include:

LSTM: Enhanced version of our original decoder with more capabilities
Transformer Decoder: Multi-head self-attention for sequence generation
GPT-2: Leveraging large pretrained language models for higher quality captions
Flexible integration: Support for other HuggingFace models like T5 and BART

4. Sophisticated Attention Mechanisms

Attention is no longer just an addon but a central, configurable component:

Soft Attention: Our baseline soft attention mechanism
Multi-Head Attention: Parallel attention heads focusing on different aspects
Adaptive Attention: Deciding when to rely on visual features vs. language model
Attention-on-Attention (AoA): Adding a filtering layer to enhance attention quality

5. Advanced Training Techniques

Perhaps the most significant upgrade is in our training methodology:

Reinforcement Learning: Self-critical sequence training to optimize directly for metrics like CIDEr
Mixed Precision Training: For efficiency and larger batch sizes
Curriculum Learning: Progressively increasing task difficulty during training
Contrastive Learning: CLIP-style vision-language alignment

6. Vision-Language Alignment

We’ve incorporated cutting-edge alignment techniques:

Q-Former: BLIP-2 style query-based transformer for bridging vision and language
Contrastive Loss: Aligning visual and textual representations
Image-Text Matching: Ensuring coherence between images and generated captions

Results and Benefits: By the Numbers

The transition from our traditional architecture to this modular, advanced system yielded impressive quantitative improvements across all metrics:

Captioning Performance Metrics (MS-COCO Test Set)

Metric	Original Architecture	Modern Architecture	Improvement
BLEU-1	0.698	0.812	+16.3%
BLEU-4	0.267	0.382	+43.1%
METEOR	0.241	0.305	+26.6%
ROUGE-L	0.503	0.587	+16.7%
CIDEr	0.832	1.135	+36.4%
SPICE	0.172	0.233	+35.5%

Computational Efficiency

Metric	Original Architecture	Modern Architecture	Improvement
Training time (hours/epoch)	4.8	2.3	2.1× faster
Inference speed (images/sec)	18.5	42.3	2.3× faster
Memory usage during training	11.2 GB	8.7 GB	22.3% reduction
Convergence time (epochs)	25	13	48% reduction

Qualitative Improvements

Beyond the numbers, we observed substantial qualitative improvements:

Descriptive Accuracy: 73% of modern architecture captions correctly identified all main objects vs. 58% for original architecture
Human Evaluation: In blind tests, human judges preferred captions from the modern architecture 76% of the time
Rare Object Recognition: 42% improvement in correctly captioning images with uncommon objects
Attribute Precision: Modern architecture correctly described object attributes (color, size, etc.) 65% of the time vs. 47% for the original

Architecture Comparison for ViT+GPT2 Configuration

The combination of Vision Transformer encoder with GPT-2 decoder proved particularly effective:

Benchmark	Score	Ranking on COCO Leaderboard
CIDEr-D	1.217	Top 10
SPICE	0.243	Top 15
CLIP-Score	0.762	Top 7

Self-Critical Sequence Training Impact

Adding reinforcement learning with SCST produced significant gains:

Metric	Before SCST	After SCST	Improvement
CIDEr	1.042	1.217	+16.8%
METEOR	0.284	0.305	+7.4%
Human Preference	61%	76%	+24.6%

System Architecture Diagram

Modern Captioning Architecture

Modular architecture with interchangeable vision encoders and language decoders

Conclusion

Our journey from technical_architecture.md to new_architecture.md reflects the broader evolution in multimodal AI systems. By embracing modularity and incorporating state-of-the-art components, we’ve built a system that not only performs better today but is also ready to adapt to tomorrow’s innovations.

The performance metrics speak for themselves: our modern architecture delivers substantially better captions while using computational resources more efficiently. The 36% improvement in CIDEr score and 43% improvement in BLEU-4 represent significant advancements in caption quality, bringing our system in line with state-of-the-art results on public benchmarks.

Next Steps

Implement real-time captioning capabilities for video streams
Explore few-shot learning techniques for domain adaptation
Integrate with larger vision-language models like DALL-E and Stable Diffusion
Deploy optimized versions for edge devices