HPLT (@hplt_eu) 's Twitter Profile
HPLT

@hplt_eu

Horizon Europe - High Performance Language Technology (HPLT)

ID: 1542506822573051910

linkhttp://hplt-project.org calendar_today30-06-2022 13:54:35

59 Tweet

254 Takipçi

16 Takip Edilen

Silicon Vikings (also on Blueskye and Threads) (@siliconvikings) 's Twitter Profile Photo

Europe’s largest private #AI lab #helyes @silo_AI, #tribetampere Turun yliopisto - University of Turku’s research group TurkuNLP + HPLT released the 1st multilingual large language model (LLM) for all Nordic languages + English + programming languages. By Tech.eu tech.eu/2024/05/15/sil… #NordicMade

Ona de Gibert (@onadegibert) 's Twitter Profile Photo

LREC COLING 2024 has arrived! We will presenting our work on how we built the HPLT datasets! 📅 Friday 24th of May ⏰ 9.20h-9.40h 📍Room Londra ⌛️Session D3-S1-R3 - Multilinguality, Machine Translation, and Translation Aids II

Ona de Gibert (@onadegibert) 's Twitter Profile Photo

Still recovering from the excitement of #lreccoling2024, where we presented the HPLT resources! We introduce: - monoHPLT: monolingual collection covering 75 languages - biHPLT: parallel data for 18 language pairs - multiHPLT: synthetic data obtained through pivoting

Still recovering from the excitement of #lreccoling2024, where we presented the <a href="/hplt_eu/">HPLT</a> resources! We introduce:
- monoHPLT:  monolingual collection covering 75 languages
- biHPLT: parallel data for 18 language pairs
- multiHPLT: synthetic data obtained through pivoting
Institute of Formal and Applied Linguistics (@ufal_cuni) 's Twitter Profile Photo

📢 Job offer: Work with us! 🤓 Institute of Formal and Applied Linguistics Matematicko-fyzikální fakulta Univerzity Karlovy is looking for 🖥️ a Front-End and ⌨️ a Back-End Java developer to work on 🇪🇺 European Open Science Cloud. More details are at ufal.mff.cuni.cz/jobs. The application deadline is 🗓️ Aug 28.

Institute of Formal and Applied Linguistics (@ufal_cuni) 's Twitter Profile Photo

The MT Marathon continues on its third day! We already had great talks by Ondrej Bojar, @prajdabre1, Vilém Zouhar, and Elizabeth Salesky 👏 and a poster session with 10 posters 🖼️. Today, we continue with more talks, and of course, the week-long hackathon continues with interesting projects.

The MT Marathon continues on its third day! We already had great talks by <a href="/OndrejBojar/">Ondrej Bojar</a>, @prajdabre1, <a href="/zouharvi/">Vilém Zouhar</a>, and <a href="/esalesk/">Elizabeth Salesky</a> 👏 and a poster session with 10 posters 🖼️. Today, we continue with more talks, and of course, the week-long hackathon continues with interesting projects.
HPLT (@hplt_eu) 's Twitter Profile Photo

🚀 INTRODUCING THE LATEST HPLT MONOLINGUAL DATASETS! TL;DR: 🔍 4.5 PB of web crawls 📄 21 billion documents 💝 careful extraction, dedup, annotation and cleaning 💥 193 languages! Explore and download the new HPLT Monolingual Datasets NOW! hplt-project.org/datasets/v2.0 #HPLT

Jan Hajic (@hajicjan) 's Twitter Profile Photo

Just finished: very interesting talk about Open Source and LLMs by Percy Liang from Stanford University at EMNLP 2025 : what we can and cannot do with closed vs. (various levels of) open(ess) in LLMs. Very relevant for HPLT & all future LLM projects. CLARIN ERIC Matematicko-fyzikální fakulta Univerzity Karlovy Univerzita Karlova

Just finished: very interesting talk about Open Source and LLMs by <a href="/percyliang/">Percy Liang</a> from <a href="/Stanford/">Stanford University</a> at <a href="/emnlpmeeting/">EMNLP 2025</a> : what we can and cannot do with closed vs. (various levels of) open(ess) in LLMs. Very relevant for <a href="/hplt_eu/">HPLT</a> &amp; all future LLM projects. <a href="/CLARINERIC/">CLARIN ERIC</a> <a href="/matfyz/">Matematicko-fyzikální fakulta Univerzity Karlovy</a> <a href="/UniKarlova/">Univerzita Karlova</a>
HPLT (@hplt_eu) 's Twitter Profile Photo

We are speaking about HPLT datasets and HPTL Analytics as a way to inspect them for quantitative analysis at #LI2024 today. We are introducing samples as a new feature! If you 😍 or 🤮 our dataset, this is now the time to tell us! If you want us to take a look to yours, DM us.

We are speaking about HPLT datasets and HPTL Analytics as a way to inspect them for quantitative analysis at #LI2024 today. 
We are introducing samples as a new feature! 
If you 😍 or 🤮 our dataset, this is now the time to tell us!  If you want us to take a look to yours, DM us.
HPLT (@hplt_eu) 's Twitter Profile Photo

Join us on a new edition of the Winter School! "Pretraining Data Quality 🧐 and Multilingual Evaluation of LLMs👀" 🪂Feb. 3–5, 2025, Norway More info and registration: wiki.nlpl.eu/Community/trai… Jointly organised by HPLT and the Nordic Language Processing Laboratory (NLPL)

Join us on a new edition of the Winter School! 

"Pretraining Data Quality 🧐 and Multilingual Evaluation of LLMs👀" 

🪂Feb. 3–5, 2025, Norway

More info and registration: wiki.nlpl.eu/Community/trai…

Jointly organised by <a href="/hplt_eu/">HPLT</a> and the Nordic Language Processing Laboratory (NLPL)
HPLT (@hplt_eu) 's Twitter Profile Photo

🥳 Amazing performance of the #HPLT v2 dataset! HuggingFace multilingual evaluation + HPLT English internal evaluation show that HPLT v2 is one of the best datasets to train LLMs. Downloads and more at either HPLT ➡️ hplt-project.org/hplt-v2-datase… or HF ➡️huggingface.co/datasets/HPLT/…

🥳 Amazing performance of the #HPLT v2 dataset! 

HuggingFace multilingual evaluation + HPLT English internal evaluation show that HPLT v2 is one of the best datasets to train LLMs. 

Downloads and more at either HPLT ➡️ hplt-project.org/hplt-v2-datase… or HF  ➡️huggingface.co/datasets/HPLT/…
HPLT (@hplt_eu) 's Twitter Profile Photo

We are happy to announce the second release of HPLT bilingual datasets: - 50 English-centric language pairs = 380M parallel sentences (HPLT) 🤩 - 1,275 non-English-centric language pairs = 16.7B parallel sentences (MultiHPLT) 😮 Available at the HPLT dataset catalogue and OPUS.

HPLT (@hplt_eu) 's Twitter Profile Photo

New paper on the HPLT v2 dataset making-of: - pipeline documentation and code - extensive analysis of the quality and characteristics - evaluation of the performance of language models and machine translation systems trained on it 🤓Happy reading! arxiv.org/pdf/2503.10267

New paper on the HPLT v2 dataset making-of: 

- pipeline documentation and code
- extensive analysis of the quality and characteristics
- evaluation of the performance of language models and machine translation systems trained on it

🤓Happy reading! arxiv.org/pdf/2503.10267
HPLT (@hplt_eu) 's Twitter Profile Photo

HPLT v2 datasets now enriched with register labels from Turun yliopisto - University of Turku. As Amanda Myntti and Veronika Laippala's show: "Appropriate metadata increases the value of a dataset". - Blog post: hplt-project.org/register-labels - Datasets + register labels (to be merged): hplt-project.org/datasets/v2.0

HPLT v2 datasets now enriched with register labels from <a href="/UniTurku/">Turun yliopisto - University of Turku</a>. As Amanda Myntti and Veronika Laippala's show: "Appropriate metadata increases the value of a dataset". 

- Blog post: hplt-project.org/register-labels
- Datasets + register labels (to be merged): hplt-project.org/datasets/v2.0
HPLT (@hplt_eu) 's Twitter Profile Photo

HPLT stopped by MTSummit2025 last week. We exchanged info with participants at a crowded poster session about HPLT v2 datasets while v3 is still in the oven. Next stop, ACL 2025!

HPLT stopped by <a href="/MTSummit2025/">MTSummit2025</a> last week. We exchanged info with participants at a crowded poster session about HPLT v2 datasets while v3 is still in the oven. Next stop, <a href="/aclmeeting/">ACL 2025</a>!
HPLT (@hplt_eu) 's Twitter Profile Photo

Great use of HPLT v2 datasets! Eager to hear more about #HPLT? Join us at ACL 2025: - BoF "Multilingualism: from data crawling to evaluation" on July 29, 16:00 - Poster "An Expanded Massive Multilingual Dataset for High-Performance Language Technologies" on July 30, 11:00

HPLT (@hplt_eu) 's Twitter Profile Photo

It's happening now. Our HPLT v2 dataset language coverage is awesome, provides competitive and stable results and complements other data beautifully. We are at ACL 2025, come and say hi! #hplt #datasets

It's happening now. Our HPLT v2 dataset language coverage is awesome, provides competitive and stable results and complements other data beautifully. We are at <a href="/aclmeeting/">ACL 2025</a>, come and say hi!  #hplt #datasets