Yunha Hwang
@micro_yunha
Building genomic intelligence @tatta_bio
ID: 1125797908262027264
https://www.yunhahwang.com/ 07-05-2019 16:21:52
510 Tweet
1,1K Followers
1,1K Following
Arne Elofsson @arneelof.bsky.social Yes, it's all about the data you train on. Ultimately, we believe pLM/gLMs are just storing coevolution matrices. ESM2 was never provided with pairs of proteins, so it never "stored" this information (unless the pair of "proteins" are also "domains" in another orgranism). (1/2)
Patrick Bryant We expect any LLM (any task) to be highly dependent on space it was trained on. This was the motivation behind the OMG database and subsequent training, to diversify and expand to new sequence space (new protein families not in uniprot or multi-protein-spanning sequences).
We are releasing OMG📷, an Open MetaGenomic dataset on Hugging Face. Similar to FineWeb for NLP, OMG is a massive dataset for open-science in genomics. We train a genomic language model gLM2 on OMG, demonstrating new capabilities like unsupervised protein-protein interaction.
Thanks to Sergey Ovchinnikov for the colab. These results are super cool... This is a homotetramer, I used colabfold to predict the structure and find interfaces, then overlayed onto the co-evolution plot and the results, although rough, are pretty interesting
See our work profiled in the new Asimov Press pandemic prevention mini-issue! Learn why we think antivirals are needed for "Day Zero" of the next pandemic – and get a glimpse into the research that we + others are carrying out to try and make that possible.
Fine-tuning protein language models boosts predictions across diverse tasks | Nature Communications - Finetune pLMs (ESM2, ProtT5, Ankh) on different tasks (GB1, GFP, AAV, Location, Meltome, Stability, Disorder Prediction, and Secondary Structure Prediction) - Explore various PEFT
September 3rd — Kaiyi Jiang , EVOLVEpro September 17th — Jeff Ruffolo , ProseLM October 1st — Amy Lu , CHEAP October 15th — Kapil Devkota , Ray-gun October 29th — Andre Cornman , The OMG dataset & gLM With more announced soon✨
🎉Our paper 'Beware of data leakage from protein LLM pretraining' was accepted at #MLCB2024! Meet Leon and Tobias at the spotlight talk and poster session on Thursday in Seattle to chat about how to address this important problem!! Jakub Bartoszewicz x.com/jmbartoszewicz…