VariantFormer × 1000 Genomes: Imputed Gene Expression for the Community
We started Strand AI to build modality transformation models for biology, tools that fill in the missing pieces across patient datasets so researchers can do better science.
While we're developing our own models, we thought it'd be fun to run VariantFormer (CZI Biohub's 1.2B-parameter DNA-to-RNA model) on new data and give the results back to the community.
What we did
We took the 1000 Genomes Project expansion pack, over 500 individuals that weren't in the original training set, and generated imputed RNA-seq expression for samples that never had expression measured.
The result: imputed gene expression across 4,500 genes, 45 tissues, and 538 samples.
We also tuned the inference pipeline to run 37× faster on cheaper A100s instead of H100s. Way cheaper, still fast.
Explore the data
We built an interactive visualizer so you can poke around. Filter by tissue, gene, or population and explore expression patterns across the 1000 Genomes cohort.
You can explore the data here or grab the full dataset download from the explorer.
Why we're sharing this
Our main business is licensing multimodal biological datasets, but this one's free. It's a nice way to give back while showing what modality transformation can do.
Have ideas for what models or datasets we should run next? Reach out at founders@strandai.com.
Thanks to the team at Chan Zuckerberg Initiative / Biohub and the 1000 Genomes Project for the foundational work that made this possible.