Ha Nhat Cuong (Clark)

How I Designed My First Vision-Language Model for Healthcare

2024-11-15T17:53:00Z

Early 2023 is still clear in my memory. I was sitting in a meeting room with my manager at Siemens Healthineers, sketching ideas on a small whiteboard. We were talking about something exciting — how to combine the new wave of large language models with medical images.

SH knows medical imaging from the inside out: CT, MRI, X-ray, ultrasound — they build the machines themselves. I had experience with LLMs and 3D CNNs. But combining them? That was something new.

We wanted to build our own multi-modal medical AI.
A Vision-Language Model specialized for healthcare.

I had no idea how hard it would be.

The Challenge: My First Real Multi-Modal Model

Until then, I had worked with:

LSTMs
Vision CNNs
3D CNNs
LLMs

But multi-modal fusion — taking images and language and making them talk to each other — was unknown territory. A completely different level of complexity.

I felt like a junior engineer again, learning from scratch.

My First Attempts: Reading, Testing, Failing (a lot)

I started where every engineer starts: reading everything.

I dug into every early VLM I could find before 2023:

BLIP
BLIP-2
Flamingo
GIT
PaLI
OWL-ViT
SimVLM

I studied their architectures piece by piece.
(I’ll add the figures and diagrams here later.)

Then I tried running open-source VLMs — especially BLIP-2 — on medical images. I asked simple questions like:

“Where is the tumor?”
“Is there a fracture?”
“What organ is this CT image showing?”

The results were… not good.
Sometimes funny, sometimes random, sometimes completely wrong.

So I started designing my own architecture:

Image encoder — a CNN or ViT
LLM — starting with BloomZ (different sizes)
Fusion module — to connect them - I use a simple Gated Cross Attention module

Early prototype architecture for multimodal VLM

I implemented everything using HuggingFace.
For training data, I used the MedVQA challenge datasets — clean, labeled, well-organized.

And the result?

Just okay.
Sometimes slightly better than open-source general VLMs.
Sometimes worse.

It wasn’t bad.
But it wasn’t special.

And “not special” isn’t enough for medical AI.

The Turning Point: Talking With Experts

Everything changed after a long conversation with two domain experts — one in medical imaging, one in medical LLMs.

They both told me the same thing:

“Your architecture is fine. Your data isn’t.”

Medical images are extremely domain-specific.
General datasets don’t help much.
General encoders don’t understand the shapes, textures, densities.
General LLMs don’t speak the language of radiologists.

So I focused on data + domain adaptation.

I dug deeper and found something promising:

1.6 million medical image–caption pairs
Perfect for pre-training a VLM.

I also started experimenting with better image encoders:

general CNNs
ViTs
medical-fine-tuned CNNs
medical ViTs

And I tried different LLMs — some general, some medically adapted.

Dozen of draft of architectures that I tried :)

Slowly, things started to make sense.

The Final Approach

After dozens of prototypes, the best-performing architecture was:

🧩 Vision Encoder: BioMedCLIP ViT

Pre-trained on huge radiology datasets — it "sees" like a radiologist.

🧠 Language Model: RadBloomZ-7B

A specialized medical version of BloomZ.

🔗 Fusion Module: Query Transformer (Q-Former style)

Transforms vision features → LLM token space.

This became the backbone of the system.

Training: The Unexpected Nightmare

Designing the architecture was hard.
Training it was harder.

RadBloomZ-7B is large.
BioMedCLIP ViT is heavy.
The fusion layer adds more parameters.

Even on A100 GPUs with 40GB memory, it didn’t fit.

I tried everything I knew:

mixed precision (FP16, BF16)
gradient checkpointing
LoRA
PyTorch DDP

Nothing was enough. At some point I realized:
I was not fighting the model — I was fighting physics.

To understand what was happening, I revisited the classic DeepSpeed ZeRO memory model. It explains exactly why training a 7B parameter model is impossible on a single GPU with standard data parallelism.

Memory Explosion: Why 7B Doesn’t Fit on a 40GB GPU

DeepSpeed’s paper breaks down memory for standard data-parallel training:

Baseline (no ZeRO) memory usage:

Memory=(2+2+K)×ΨMemory=(2+2+K)×Ψ

Where:

Ψ = number of model parameters
2 = FP16 model weights
2 = FP16 gradients
K = optimizer constant (Adam → 12 due to m and v states in FP32)

So for Adam, the baseline is:

(2+2+12)=16(2+2+12)=16

Example: 7.5B parameter model

Memory=16×7.5B=120B bytesMemory=16×7.5B=120B bytes

That’s ~120 GB, before activations, before batch size, and before temporary buffers.

No A100 with 40GB can hold that. Not even close.

This was the moment I understood why nothing fit, no matter how much I optimized manually.

Memory savings and communication volume across ZeRO stages (compared to standard data-parallel). Adapted from Microsoft Research, "ZeRO: New system optimizations enable training models with over 100 billion parameters," 2020.

The figure shows exactly why pure data-parallel training breaks at this scale — and why ZeRO-1/2/3 reduce memory so aggressively.

Using ZeRO and Distributed Training to Survive

Once I accepted that baseline parallelism was impossible, I moved everything to DeepSpeed ZeRO:

ZeRO-1: shards optimizer states
ZeRO-2: shards optimizer + gradients
ZeRO-3: shards optimizer + gradients + model weights

With ZeRO-3, the effective memory per GPU becomes approximately:

with 4 GPUs:

120GB / 4=30GB

With 8 GPUs (sometimes I get lucky:)):

120GB / 8 = 15GB

Both finally fit inside A100 40GB, even after including activations (thanks to checkpointing + BF16 + offloading).

Slower, But Possible

Once I switched to ZeRO-2 / ZeRO-3, the model finally ran, but:

batch size was tiny
communication overhead was heavy
training speed dropped
checkpointing took forever
BF16 vs FP16 stability mattered a lot

But the most important thing was this:

The model finally trained.
Not fast. Not easily.
But it trained.

And that moment felt like a breakthrough.

Results + What I Learned

The final VLM wasn’t perfect — no medical model ever is.

But it was far better than my early versions, and significantly more consistent than general-purpose VLMs on medical tasks. It proved one thing strongly:

In medical AI, data + domain expertise matter more than complex architecture.

Along the way, this work also led to something I never expected in the beginning: a paper published at NAACL ClinicalNLP 2024 (check it our here) and a patent filed together with my company. At that moment, it felt like all the experiments, failures, and late-night debugging sessions were turning into something real and lasting.

My Lessons

Be creative. Architecture is a space of exploration, not rules.
Talk to experts early. They save you months of wrong assumptions.
Data is everything. Especially in specialized domains.
Narrow your focus. Trying all ideas at once slows you down.
Don’t be afraid of things you’ve never built. That’s where the most growth happens.:)

We build a Robot for smart agriculture

2023-04-09T09:36:00Z

It started as a small idea during a brainstorming session:
could we build a robot that helps farmers by separating male and female rapeseed plants in breeding fields? In rapeseed hybrid breeding, workers normally walk through long rows, manually marking the male sterile lines (female lines) and the fertile male lines - a slow, tiring, and repetitive job. We wondered if a simple robot could do this automatically using cameras, a bit of computer vision, and some creativity.

So we began in the most humble way possible: with cardboard, masking tape, and a motor. We sketched a little chassis, built a tiny circuit, wrote a simple program to make the robot go straight, turn left, turn right - and celebrated every tiny movement like kids playing with a toy car. Then came the vision module: a small CNN model that tried to classify rapeseed male and female lines in the field. Eventually, after many small fixes and many laughs with teammates, our cardboard box evolved into a fully moving robot that could actually walk between rows and do its job.

It wasn’t perfect, and it wasn’t fancy. But we built it together - and for a moment, it felt like the future of sustainable agriculture was walking on four wheels right in front of us.

My teammate after a sleepless night.

Download 6242746D-50A2-46FF-9E98-46BFAE82CA57_2_0_a.mov

Optimize a Spark query to run x100 times faster

2022-02-16T22:33:00Z

This project started with a deceptively simple question:

Can we identify return trips in New York City taxi data?

A taxi trip b is a return trip for a if:

b picks up within 8 hours after a drops off
b picks up within r meters of a’s dropoff
b drops off within r meters of a’s pickup
where r ∈ {50, 100, 150, 200} meters

The baseline implementation looked straightforward.
In practice, the first version took 250 seconds — more than four minutes.

The 250-Second Problem

My beginner approach looked like this:



val joined = trips.alias("a").crossJoin(trips.alias("b"))

If you have ~10 million trips, that means:



10M × 10M = 100 trillion comparisons

Even after adding time filters, distance calculations, and reasonable Spark tuning, it still ran around 250 seconds. It was clear Spark wasn’t the problem — my thinking was.

I shifted the question from:

“How do I make Spark faster?”
to
“Why am I comparing trips that will never match?”

That shift unlocked the solution.

Rethinking Geography

NYC is around 1,200 km², roughly 50 km × 50 km.
Our matching radius is 50–200 meters.

A tiny dot inside a huge canvas.

So I mapped NYC into grid cells where:

Any two points inside the same cell are guaranteed to be < r meters apart.

Instead of joining across the entire city, a trip in cell (i, j) only needs to compare with trips in these 9 spatial neighborhoods:



+-----------+-----------+-----------+
| (i-1,j-1) | (i-1, j)  | (i-1,j+1) |
+-----------+-----------+-----------+
| (i,  j-1) | (i,   j)  | (i,  j+1) |
+-----------+-----------+-----------+
| (i+1,j-1) | (i+1, j)  | (i+1,j+1) |
+-----------+-----------+-----------+

Just 9 boxes, instead of the entire city.

This alone removed ~99% of unnecessary comparisons.

Time Works the Same Way

Return trips must happen within 8 hours.
So I discretized time into 8-hour buckets:



time_idx = floor(timestamp / 8 hours)

Return-trip candidates satisfy:



time_idx_B ∈ [time_idx_A, time_idx_A + 1]

Again, only local neighbors — not the whole month.

Reshaping the Data Into Something Spark Loves

I duplicated each trip in tripAB for each valid combination:

9 spatial neighbor cells
× 2 time buckets
= 18 rows per trip

Usually duplication is bad, but here it made the join trivial.

The join key became three integers:



(lat_idx, lon_idx, time_idx)

Spark can hash-partition this instantly — no trig functions, no shuffle explosions, no Cartesian nightmares.

The expensive haversine distance and time checks were applied after the join, on a tiny set of candidates.

The Result

The optimized pipeline didn’t run in four minutes.
It didn’t run in one minute.

It ran in:

👉 55 milliseconds

From 250 seconds → 0.055 seconds.

A speedup of over 4,500×.

Not from tuning Spark, but from giving Spark far less work.

Lessons Learned

Don’t optimize the join. Reduce the search space.
Most pairs should never be compared; remove them before the join exists.
Geography and time are not fields — they are structure.
Use them to build natural partitions the data actually lives in.
Spark isn’t slow. Wrong data shapes are slow.
Make your data small, local, and integer-joinable.
Trigonometry is expensive. Run it last, not first.
Sometimes duplication speeds things up.
If it turns a hard join into a cheap one, do it.
The best optimization is a better question.
“Why am I comparing these rows at all?”

We won the first hackathon

2017-12-20T20:32:00Z

2 months into my time at Zalo, we won our first hackathon - thanks to our wonderful team!

We built a scalable e-commerce chatbot in the Zalo super app - helping customers find products, place orders, and purchase effortlessly.