Pierre Colombo (@pierrecolombo6) Twitter Tweets • TwiCopy

Duarte Alves

9 months ago

🚀 Excited to announce EuroBERT: a new multilingual encoder model family for European & global languages! 🌍 🔹 EuroBERT is trained on a massive 5 trillion-token dataset across 15 languages and includes recent architecture advances such as GQA, RoPE & RMSNorm. 🔹

thumb_up_off_alt59

chat_bubble_outline1

repeat12

shareShare

Duarte Alves

@duartemralves

9 months ago

🧵 (3/7) 🌐 EuroBERT is open-source: 👉 Models (210M, 610M, 2.1B params) 👉 Training snapshots 👉 Full training framework Explore here: [huggingface.co/EuroBERT](huggingface.co/EuroBERT) Code coming soon! [github.com/Nicolas-BZRD/E…](github.com/Nicolas-BZRD/E…)

thumb_up_off_alt7

chat_bubble_outline1

repeat2

shareShare

Duarte Alves

@duartemralves

9 months ago

🧵 (5/7) Nicolas Boizard Hippolyte Gisserot-Boukhlef Andre Martins Ayoub Hammal Caio Céline Hudelot, Emmanuel Malherbe, Etienne Malaboeuf Fanny Jourdan Gabriel Hautreux João Alves Kevin El-Haddad Manuel Faysse Maxime Peyrard Nuno M. Guerreiro Patrick Fernandes Ricardo Rei Pierre Colombo

thumb_up_off_alt8

chat_bubble_outline1

repeat1

shareShare

Duarte Alves

@duartemralves

9 months ago

🧵 (7/7) 📖 Check out our blog post for more insights: huggingface.co/blog/EuroBERT/… 📄 Read more in our paper: arxiv.org/abs/2503.05500

thumb_up_off_alt6

chat_bubble_outline0

repeat1

shareShare

Manuel Faysse

@manuelfaysse

9 months ago

🚨 Introducing EuroBERT, a family of multilingual encoder models (210M to 2.1B params) trained on 5T tokens with a 8,192 sequence length and all the modern bells and whistles! It's open source, and hopefully the perfect base model to train multilingual embeddings ! (1/N)

thumb_up_off_alt89

chat_bubble_outline4

repeat19

shareShare

Manuel Faysse

@manuelfaysse

9 months ago

Here is a blogpost with more details: huggingface.co/blog/EuroBERT/… Work led by my amazing PhD colleagues Nicolas Boizard Hippolyte Gisserot-Boukhlef Duarte Alves Pierre Colombo (5/N)

thumb_up_off_alt5

chat_bubble_outline1

repeat1

shareShare

Antoine Chaffin

@antoine_chaffin

9 months ago

More encoder upgrade and multilingual this time (so no excuse not to try it)! Great work from the team, have been in some discussions regarding these models and was really looking forward to the release! 🚀 Congratulations Nicolas Boizard and Manuel Faysse!

thumb_up_off_alt18

chat_bubble_outline1

repeat1

shareShare

Benjamin Clavié

@bclavie

9 months ago

More BERTs for the Modern era. This is super exciting, encoders are no longer dead 😄 The coolest aspect: with so many new proofs that encoders are small-param-count beasts, this'll hopefully spark a lot more research onto making them even better in creative ways...

thumb_up_off_alt63

chat_bubble_outline2

repeat5

shareShare

tomaarsen

@tomaarsen

9 months ago

An assembly of 18 European companies, labs, and universities have banded together to launch 🇪🇺 EuroBERT! It's a state-of-the-art multilingual encoder for 15 European languages, designed to be finetuned for retrieval, classification, etc. Details in 🧵

thumb_up_off_alt92

chat_bubble_outline4

repeat16

shareShare

tomaarsen

@tomaarsen

9 months ago

Models: huggingface.co/EuroBERT

thumb_up_off_alt5

chat_bubble_outline1

repeat1

shareShare

Maziyar PANAHI

@maziyarpanahi

9 months ago

This is pretty cool! If people only knew how much we still rely on encoder-only models in AI products! 🔥

thumb_up_off_alt30

chat_bubble_outline3

repeat4

shareShare

𝚐𝔪𝟾𝚡𝚡𝟾

@gm8xx8

9 months ago

EuroBERT: Scaling Multilingual Encoders for European Languages

thumb_up_off_alt27

chat_bubble_outline3

repeat3

shareShare

Fanny Jourdan

@fannyjrd_

9 months ago

EuroBERT is out and it's insane! 🇪🇺 It's the most powerful multilingual encoder model family at SOTA across a wide range of tasks: retrieval, classif, regression, maths, and code. 3 sizes: 210M, 610M, and 2.1B parameters, with support sequence lengths up to 8,192 tokens. 📖 ⤵️

thumb_up_off_alt4

chat_bubble_outline1

repeat1

shareShare

Fanny Jourdan

@fannyjrd_

9 months ago

🤩 For more details, check out Nicolas Boizard' excellent thread: x.com/N1colAIs/statu…

thumb_up_off_alt2

chat_bubble_outline0

repeat1

shareShare

Igor Tica

@itica007

9 months ago

[EuroBERT: Multilingual Embedding Model] 🔥 There is a new open embedding model in town - EuroBERT claims superior performance across a diverse set of benchmarks, spanning multilingual capabilities, mathematics, and coding. It even outperforms ModernBERT on code and math

thumb_up_off_alt4

chat_bubble_outline2

repeat1

shareShare

Peter Sarlin

@petersarlin

9 months ago

Not just yet another LLM 🇪🇺 EuroBERT provides open multilingual encoder models to power retrieval, classification & embeddings across 15 languages. But it is yet another model trained on AMD compute platforms. 🚀 European companies, labs, and universities have made this come

thumb_up_off_alt187

chat_bubble_outline6

repeat29

shareShare

eu/acc

@euacchq

9 months ago

More info there: huggingface.co/blog/EuroBERT/…

thumb_up_off_alt4

chat_bubble_outline0

repeat1

shareShare

Nicolas Boizard

@n1colais

9 months ago

Great to see the community using EuroBERT! As hoped, it’s proving to be an excellent foundation model, especially for information retrieval tasks across multiple languages after just one epoch of finetuning. Check it out: huggingface.co/Omartificial-I…: Eng.Omar

thumb_up_off_alt5

chat_bubble_outline1

repeat1

shareShare

Nicolas Boizard

@n1colais

9 months ago

Eng.Omar *huggingface.co/Omartificial-I… congratulations: Eng.Omar

thumb_up_off_alt2

chat_bubble_outline0

repeat1

shareShare

Manuel Faysse

@manuelfaysse

6 months ago

🚨We are moving Visual Document Retrieval Evaluation to MTEB! Starting today, ViDoRe V1 and V2, but soon joined by many other benchmarks will benefit from first-class support in MTEB, enabling adding models and tasks more easily and more collaboratively! More in 🧵(1/N)

thumb_up_off_alt35

chat_bubble_outline1

repeat6

shareShare