SH knows medical imaging from the inside out: CT, MRI, X-ray, ultrasound — they build the machines themselves. I had experience with LLMs and 3D CNNs. But combining them? That was something new.
We wanted to build our own multi-modal medical AI.
A Vision-Language Model specialized for healthcare.
I had no idea how hard it would be.
Until then, I had worked with:
LSTMs
Vision CNNs
3D CNNs
LLMs
But multi-modal fusion — taking images and language and making them talk to each other — was unknown territory. A completely different level of complexity.
I felt like a junior engineer again, learning from scratch.
I started where every engineer starts: reading everything.
I dug into every early VLM I could find before 2023:
BLIP
BLIP-2
Flamingo
GIT
PaLI
OWL-ViT
SimVLM
I studied their architectures piece by piece.
(I’ll add the figures and diagrams here later.)
Then I tried running open-source VLMs — especially BLIP-2 — on medical images. I asked simple questions like:
“Where is the tumor?”
“Is there a fracture?”
“What organ is this CT image showing?”
The results were… not good.
Sometimes funny, sometimes random, sometimes completely wrong.
So I started designing my own architecture:
Image encoder — a CNN or ViT
LLM — starting with BloomZ (different sizes)
Fusion module — to connect them - I use a simple Gated Cross Attention module
I implemented everything using HuggingFace.
For training data, I used the MedVQA challenge datasets — clean, labeled, well-organized.
And the result?
Just okay.
Sometimes slightly better than open-source general VLMs.
Sometimes worse.
It wasn’t bad.
But it wasn’t special.
And “not special” isn’t enough for medical AI.
Everything changed after a long conversation with two domain experts — one in medical imaging, one in medical LLMs.
They both told me the same thing:
“Your architecture is fine. Your data isn’t.”
Medical images are extremely domain-specific.
General datasets don’t help much.
General encoders don’t understand the shapes, textures, densities.
General LLMs don’t speak the language of radiologists.
So I focused on data + domain adaptation.
I dug deeper and found something promising:
1.6 million medical image–caption pairs
Perfect for pre-training a VLM.
I also started experimenting with better image encoders:
general CNNs
ViTs
medical-fine-tuned CNNs
medical ViTs
And I tried different LLMs — some general, some medically adapted.
Dozen of draft of architectures that I tried :)
Slowly, things started to make sense.
After dozens of prototypes, the best-performing architecture was:
Pre-trained on huge radiology datasets — it "sees" like a radiologist.
A specialized medical version of BloomZ.
Transforms vision features → LLM token space.
This became the backbone of the system.
Designing the architecture was hard.
Training it was harder.
RadBloomZ-7B is large.
BioMedCLIP ViT is heavy.
The fusion layer adds more parameters.
Even on A100 GPUs with 40GB memory, it didn’t fit.
I tried everything I knew:
mixed precision (FP16, BF16)
gradient checkpointing
LoRA
PyTorch DDP
Nothing was enough. At some point I realized:
I was not fighting the model — I was fighting physics.
To understand what was happening, I revisited the classic DeepSpeed ZeRO memory model. It explains exactly why training a 7B parameter model is impossible on a single GPU with standard data parallelism.
DeepSpeed’s paper breaks down memory for standard data-parallel training:
Where:
Ψ = number of model parameters
2 = FP16 model weights
2 = FP16 gradients
K = optimizer constant (Adam → 12 due to m and v states in FP32)
So for Adam, the baseline is:
(2+2+12)=16(2+2+12)=16That’s ~120 GB, before activations, before batch size, and before temporary buffers.
No A100 with 40GB can hold that. Not even close.
This was the moment I understood why nothing fit, no matter how much I optimized manually.
Memory savings and communication volume across ZeRO stages (compared to standard data-parallel). Adapted from Microsoft Research, "ZeRO: New system optimizations enable training models with over 100 billion parameters," 2020.
The figure shows exactly why pure data-parallel training breaks at this scale — and why ZeRO-1/2/3 reduce memory so aggressively.
Once I accepted that baseline parallelism was impossible, I moved everything to DeepSpeed ZeRO:
ZeRO-1: shards optimizer states
ZeRO-2: shards optimizer + gradients
ZeRO-3: shards optimizer + gradients + model weights
With ZeRO-3, the effective memory per GPU becomes approximately:
120GB / 4=30GB
120GB / 8 = 15GB
Both finally fit inside A100 40GB, even after including activations (thanks to checkpointing + BF16 + offloading).
Once I switched to ZeRO-2 / ZeRO-3, the model finally ran, but:
batch size was tiny
communication overhead was heavy
training speed dropped
checkpointing took forever
BF16 vs FP16 stability mattered a lot
But the most important thing was this:
The model finally trained.
Not fast. Not easily.
But it trained.
And that moment felt like a breakthrough.
The final VLM wasn’t perfect — no medical model ever is.
But it was far better than my early versions, and significantly more consistent than general-purpose VLMs on medical tasks. It proved one thing strongly:
In medical AI, data + domain expertise matter more than complex architecture.
Be creative. Architecture is a space of exploration, not rules.
Talk to experts early. They save you months of wrong assumptions.
Data is everything. Especially in specialized domains.
Narrow your focus. Trying all ideas at once slows you down.
Don’t be afraid of things you’ve never built. That’s where the most growth happens.:)
It started as a small idea during a brainstorming session:
could we build a robot that helps farmers by separating male and female rapeseed plants in breeding fields? In rapeseed hybrid breeding, workers normally walk through long rows, manually marking the male sterile lines (female lines) and the fertile male lines - a slow, tiring, and repetitive job. We wondered if a simple robot could do this automatically using cameras, a bit of computer vision, and some creativity.
So we began in the most humble way possible: with cardboard, masking tape, and a motor. We sketched a little chassis, built a tiny circuit, wrote a simple program to make the robot go straight, turn left, turn right - and celebrated every tiny movement like kids playing with a toy car. Then came the vision module: a small CNN model that tried to classify rapeseed male and female lines in the field. Eventually, after many small fixes and many laughs with teammates, our cardboard box evolved into a fully moving robot that could actually walk between rows and do its job.
It wasn’t perfect, and it wasn’t fancy. But we built it together - and for a moment, it felt like the future of sustainable agriculture was walking on four wheels right in front of us.
My teammate after a sleepless night.
This project started with a deceptively simple question:
Can we identify return trips in New York City taxi data?
A taxi trip b is a return trip for a if:
b picks up within 8 hours after a drops off
b picks up within r meters of a’s dropoff
b drops off within r meters of a’s pickup
where r ∈ {50, 100, 150, 200} meters
The baseline implementation looked straightforward.
In practice, the first version took 250 seconds — more than four minutes.
My beginner approach looked like this:
val joined = trips.alias("a").crossJoin(trips.alias("b"))
If you have ~10 million trips, that means:
10M × 10M = 100 trillion comparisons
Even after adding time filters, distance calculations, and reasonable Spark tuning, it still ran around 250 seconds. It was clear Spark wasn’t the problem — my thinking was.
I shifted the question from:
“How do I make Spark faster?”
to
“Why am I comparing trips that will never match?”
That shift unlocked the solution.
NYC is around 1,200 km², roughly 50 km × 50 km.
Our matching radius is 50–200 meters.
A tiny dot inside a huge canvas.
So I mapped NYC into grid cells where:
Any two points inside the same cell are guaranteed to be < r meters apart.
Instead of joining across the entire city, a trip in cell (i, j) only needs to compare with trips in these 9 spatial neighborhoods:
+-----------+-----------+-----------+ | (i-1,j-1) | (i-1, j) | (i-1,j+1) | +-----------+-----------+-----------+ | (i, j-1) | (i, j) | (i, j+1) | +-----------+-----------+-----------+ | (i+1,j-1) | (i+1, j) | (i+1,j+1) | +-----------+-----------+-----------+
Just 9 boxes, instead of the entire city.
This alone removed ~99% of unnecessary comparisons.
Return trips must happen within 8 hours.
So I discretized time into 8-hour buckets:
time_idx = floor(timestamp / 8 hours)
Return-trip candidates satisfy:
time_idx_B ∈ [time_idx_A, time_idx_A + 1]
Again, only local neighbors — not the whole month.
I duplicated each trip in tripAB for each valid combination:
9 spatial neighbor cells
× 2 time buckets
= 18 rows per trip
Usually duplication is bad, but here it made the join trivial.
The join key became three integers:
(lat_idx, lon_idx, time_idx)
Spark can hash-partition this instantly — no trig functions, no shuffle explosions, no Cartesian nightmares.
The expensive haversine distance and time checks were applied after the join, on a tiny set of candidates.
The optimized pipeline didn’t run in four minutes.
It didn’t run in one minute.
It ran in:
👉 55 milliseconds
From 250 seconds → 0.055 seconds.
A speedup of over 4,500×.
Not from tuning Spark, but from giving Spark far less work.
Don’t optimize the join. Reduce the search space.
Most pairs should never be compared; remove them before the join exists.
Geography and time are not fields — they are structure.
Use them to build natural partitions the data actually lives in.
Spark isn’t slow. Wrong data shapes are slow.
Make your data small, local, and integer-joinable.
Trigonometry is expensive. Run it last, not first.
Sometimes duplication speeds things up.
If it turns a hard join into a cheap one, do it.
The best optimization is a better question.
“Why am I comparing these rows at all?”
2 months into my time at Zalo, we won our first hackathon - thanks to our wonderful team!
We built a scalable e-commerce chatbot in the Zalo super app - helping customers find products, place orders, and purchase effortlessly.