How I Designed My First Vision-Language Model for Healthcare

Early 2023 is still clear in my memory. I was sitting in a meeting room with my manager at Siemens Healthineers, sketching ideas on a small whiteboard. We were talking about something exciting — how to combine the new wave of large language models with medical images.

SH knows medical imaging from the inside out: CT, MRI, X-ray, ultrasound — they build the machines themselves. I had experience with LLMs and 3D CNNs. But combining them? That was something new.

We wanted to build our own multi-modal medical AI.
Vision-Language Model specialized for healthcare.

I had no idea how hard it would be.


The Challenge: My First Real Multi-Modal Model

Until then, I had worked with:

  • LSTMs

  • Vision CNNs

  • 3D CNNs

  • LLMs

But multi-modal fusion — taking images and language and making them talk to each other — was unknown territory. A completely different level of complexity.

I felt like a junior engineer again, learning from scratch.


My First Attempts: Reading, Testing, Failing (a lot)

I started where every engineer starts: reading everything.

I dug into every early VLM I could find before 2023:

  • BLIP

  • BLIP-2

  • Flamingo

  • GIT

  • PaLI

  • OWL-ViT

  • SimVLM

I studied their architectures piece by piece.
(I’ll add the figures and diagrams here later.)

Then I tried running open-source VLMs — especially BLIP-2 — on medical images. I asked simple questions like:

“Where is the tumor?”
“Is there a fracture?”
“What organ is this CT image showing?”

The results were… not good.
Sometimes funny, sometimes random, sometimes completely wrong.

So I started designing my own architecture:

  • Image encoder — a CNN or ViT

  • LLM — starting with BloomZ (different sizes)

  • Fusion module — to connect them - I use a simple Gated Cross Attention module

Early prototype architecture for multimodal VLM

I implemented everything using HuggingFace.
For training data, I used the MedVQA challenge datasets — clean, labeled, well-organized.

And the result?

Just okay.
Sometimes slightly better than open-source general VLMs.
Sometimes worse.

It wasn’t bad.
But it wasn’t special.

And “not special” isn’t enough for medical AI.


The Turning Point: Talking With Experts

Everything changed after a long conversation with two domain experts — one in medical imaging, one in medical LLMs.

They both told me the same thing:

“Your architecture is fine. Your data isn’t.”

Medical images are extremely domain-specific.
General datasets don’t help much.
General encoders don’t understand the shapes, textures, densities.
General LLMs don’t speak the language of radiologists.

So I focused on data + domain adaptation.

I dug deeper and found something promising:

  • 1.6 million medical image–caption pairs
    Perfect for pre-training a VLM.

I also started experimenting with better image encoders:

  • general CNNs

  • ViTs

  • medical-fine-tuned CNNs

  • medical ViTs

And I tried different LLMs — some general, some medically adapted.

Dozen of draft of architectures that I tried :)

Slowly, things started to make sense.


The Final Approach

After dozens of prototypes, the best-performing architecture was:

🧩 Vision Encoder: BioMedCLIP ViT

Pre-trained on huge radiology datasets — it "sees" like a radiologist.

🧠 Language Model: RadBloomZ-7B

A specialized medical version of BloomZ.

🔗 Fusion Module: Query Transformer (Q-Former style)

Transforms vision features → LLM token space.

This became the backbone of the system.


Training: The Unexpected Nightmare

Designing the architecture was hard.
Training it was harder.

RadBloomZ-7B is large.
BioMedCLIP ViT is heavy.
The fusion layer adds more parameters.

Even on A100 GPUs with 40GB memory, it didn’t fit.

I tried everything I knew:

  • mixed precision (FP16, BF16)

  • gradient checkpointing

  • LoRA

  • PyTorch DDP

Nothing was enough. At some point I realized:
I was not fighting the model — I was fighting physics.

To understand what was happening, I revisited the classic DeepSpeed ZeRO memory model. It explains exactly why training a 7B parameter model is impossible on a single GPU with standard data parallelism.

Memory Explosion: Why 7B Doesn’t Fit on a 40GB GPU

DeepSpeed’s paper breaks down memory for standard data-parallel training:

Baseline (no ZeRO) memory usage:

Memory=(2+2+K)×ΨMemory=(2+2+K)×Ψ

Where:

  • Ψ = number of model parameters

  • 2 = FP16 model weights

  • 2 = FP16 gradients

  • K = optimizer constant (Adam → 12 due to m and v states in FP32)

So for Adam, the baseline is:

(2+2+12)=16(2+2+12)=16

Example: 7.5B parameter model

Memory=16×7.5B=120B bytesMemory=16×7.5B=120B bytes

That’s ~120 GBbefore activationsbefore batch size, and before temporary buffers.

No A100 with 40GB can hold that. Not even close.

This was the moment I understood why nothing fit, no matter how much I optimized manually.

Memory savings and communication volume across ZeRO stages (compared to standard data-parallel). Adapted from Microsoft Research, "ZeRO: New system optimizations enable training models with over 100 billion parameters," 2020.

The figure shows exactly why pure data-parallel training breaks at this scale — and why ZeRO-1/2/3 reduce memory so aggressively.


Using ZeRO and Distributed Training to Survive

Once I accepted that baseline parallelism was impossible, I moved everything to DeepSpeed ZeRO:

  • ZeRO-1: shards optimizer states

  • ZeRO-2: shards optimizer + gradients

  • ZeRO-3: shards optimizer + gradients + model weights

With ZeRO-3, the effective memory per GPU becomes approximately: 

  • with 4 GPUs:
120GB / 4=30GB
  • With 8 GPUs (sometimes I get lucky:)):
120GB / 8 = 15GB

Both finally fit inside A100 40GB, even after including activations (thanks to checkpointing + BF16 + offloading).


Slower, But Possible

Once I switched to ZeRO-2 / ZeRO-3, the model finally ran, but:

  • batch size was tiny

  • communication overhead was heavy

  • training speed dropped

  • checkpointing took forever

  • BF16 vs FP16 stability mattered a lot

But the most important thing was this:

The model finally trained.
Not fast. Not easily.
But it trained.

And that moment felt like a breakthrough.


Results + What I Learned

The final VLM wasn’t perfect — no medical model ever is.

But it was far better than my early versions, and significantly more consistent than general-purpose VLMs on medical tasks. It proved one thing strongly:

In medical AI, data + domain expertise matter more than complex architecture.


Along the way, this work also led to something I never expected in the beginning: a paper published at NAACL ClinicalNLP 2024 (check it our here) and a patent filed together with my company. At that moment, it felt like all the experiments, failures, and late-night debugging sessions were turning into something real and lasting.

My Lessons

  • Be creative. Architecture is a space of exploration, not rules.

  • Talk to experts early. They save you months of wrong assumptions.

  • Data is everything. Especially in specialized domains.

  • Narrow your focus. Trying all ideas at once slows you down.

  • Don’t be afraid of things you’ve never built. That’s where the most growth happens.:)