Early 2023 is still clear in my memory. I was sitting in a meeting room with my manager at Siemens Healthineers, sketching ideas on a small whiteboard. We were talking about something exciting — how to combine the new wave of large language models with medical images.

SH knows medical imaging from the inside out: CT, MRI, X-ray, ultrasound — they build the machines themselves. I had experience with LLMs and 3D CNNs. But combining them? That was something new.

We wanted to build our own multi-modal medical AI.
A Vision-Language Model specialized for healthcare.

I had no idea how hard it would be.

The Challenge: My First Real Multi-Modal Model

Until then, I had worked with:

LSTMs
Vision CNNs
3D CNNs
LLMs

But multi-modal fusion — taking images and language and making them talk to each other — was unknown territory. A completely different level of complexity.

I felt like a junior engineer again, learning from scratch.

My First Attempts: Reading, Testing, Failing (a lot)

I started where every engineer starts: reading everything.

I dug into every early VLM I could find before 2023:

BLIP
BLIP-2
Flamingo
GIT
PaLI
OWL-ViT
SimVLM

I studied their architectures piece by piece.
(I’ll add the figures and diagrams here later.)

Then I tried running open-source VLMs — especially BLIP-2 — on medical images. I asked simple questions like:

“Where is the tumor?”
“Is there a fracture?”
“What organ is this CT image showing?”

The results were… not good.
Sometimes funny, sometimes random, sometimes completely wrong.

So I started designing my own architecture:

Image encoder — a CNN or ViT
LLM — starting with BloomZ (different sizes)
Fusion module — to connect them - I use a simple Gated Cross Attention module

Early prototype architecture for multimodal VLM

I implemented everything using HuggingFace.
For training data, I used the MedVQA challenge datasets — clean, labeled, well-organized.

And the result?

Just okay.
Sometimes slightly better than open-source general VLMs.
Sometimes worse.

It wasn’t bad.
But it wasn’t special.

And “not special” isn’t enough for medical AI.

The Turning Point: Talking With Experts

Everything changed after a long conversation with two domain experts — one in medical imaging, one in medical LLMs.

They both told me the same thing:

“Your architecture is fine. Your data isn’t.”

Medical images are extremely domain-specific.
General datasets don’t help much.
General encoders don’t understand the shapes, textures, densities.
General LLMs don’t speak the language of radiologists.

So I focused on data + domain adaptation.

I dug deeper and found something promising:

1.6 million medical image–caption pairs
Perfect for pre-training a VLM.

I also started experimenting with better image encoders:

general CNNs
ViTs
medical-fine-tuned CNNs
medical ViTs

And I tried different LLMs — some general, some medically adapted.

Dozen of draft of architectures that I tried :)

Slowly, things started to make sense.

The Final Approach

After dozens of prototypes, the best-performing architecture was:

🧩 Vision Encoder: BioMedCLIP ViT

Pre-trained on huge radiology datasets — it "sees" like a radiologist.

🧠 Language Model: RadBloomZ-7B

A specialized medical version of BloomZ.

🔗 Fusion Module: Query Transformer (Q-Former style)

Transforms vision features → LLM token space.

This became the backbone of the system.

Training: The Unexpected Nightmare

Designing the architecture was hard.
Training it was harder.

RadBloomZ-7B is large.
BioMedCLIP ViT is heavy.
The fusion layer adds more parameters.

Even on A100 GPUs with 40GB memory, it didn’t fit.

I tried everything I knew:

mixed precision (FP16, BF16)
gradient checkpointing
LoRA
PyTorch DDP

Nothing was enough. At some point I realized:
I was not fighting the model — I was fighting physics.

To understand what was happening, I revisited the classic DeepSpeed ZeRO memory model. It explains exactly why training a 7B parameter model is impossible on a single GPU with standard data parallelism.

Memory Explosion: Why 7B Doesn’t Fit on a 40GB GPU

DeepSpeed’s paper breaks down memory for standard data-parallel training:

Baseline (no ZeRO) memory usage:

Memory=(2+2+K)×ΨMemory=(2+2+K)×Ψ

Where:

Ψ = number of model parameters
2 = FP16 model weights
2 = FP16 gradients
K = optimizer constant (Adam → 12 due to m and v states in FP32)

So for Adam, the baseline is:

(2+2+12)=16(2+2+12)=16

Example: 7.5B parameter model

Memory=16×7.5B=120B bytesMemory=16×7.5B=120B bytes

That’s ~120 GB, before activations, before batch size, and before temporary buffers.

No A100 with 40GB can hold that. Not even close.

This was the moment I understood why nothing fit, no matter how much I optimized manually.

Memory savings and communication volume across ZeRO stages (compared to standard data-parallel). Adapted from Microsoft Research, "ZeRO: New system optimizations enable training models with over 100 billion parameters," 2020.

The figure shows exactly why pure data-parallel training breaks at this scale — and why ZeRO-1/2/3 reduce memory so aggressively.

Using ZeRO and Distributed Training to Survive

Once I accepted that baseline parallelism was impossible, I moved everything to DeepSpeed ZeRO:

ZeRO-1: shards optimizer states
ZeRO-2: shards optimizer + gradients
ZeRO-3: shards optimizer + gradients + model weights

With ZeRO-3, the effective memory per GPU becomes approximately:

with 4 GPUs:

120GB / 4=30GB

With 8 GPUs (sometimes I get lucky:)):

120GB / 8 = 15GB

Both finally fit inside A100 40GB, even after including activations (thanks to checkpointing + BF16 + offloading).

Slower, But Possible

Once I switched to ZeRO-2 / ZeRO-3, the model finally ran, but:

batch size was tiny
communication overhead was heavy
training speed dropped
checkpointing took forever
BF16 vs FP16 stability mattered a lot

But the most important thing was this:

The model finally trained.
Not fast. Not easily.
But it trained.

And that moment felt like a breakthrough.

Results + What I Learned

The final VLM wasn’t perfect — no medical model ever is.

But it was far better than my early versions, and significantly more consistent than general-purpose VLMs on medical tasks. It proved one thing strongly:

In medical AI, data + domain expertise matter more than complex architecture.

Along the way, this work also led to something I never expected in the beginning: a paper published at NAACL ClinicalNLP 2024 (check it our here) and a patent filed together with my company. At that moment, it felt like all the experiments, failures, and late-night debugging sessions were turning into something real and lasting.

My Lessons

Be creative. Architecture is a space of exploration, not rules.
Talk to experts early. They save you months of wrong assumptions.
Data is everything. Especially in specialized domains.
Narrow your focus. Trying all ideas at once slows you down.
Don’t be afraid of things you’ve never built. That’s where the most growth happens.:)

Ha Nhat Cuong (Clark)

Hey I’m Clark, Hà Nhật Cương from Vietnam, a beautiful country in the South East Asia

How I Designed My First Vision-Language Model for Healthcare