Enrico Shippole (@enricoshippole) 's Twitter Profile
Enrico Shippole

@enricoshippole

We need better evaluations

ID: 1195196276557541376

calendar_today15-11-2019 04:26:55

1,1K Tweet

2,2K Takipçi

46 Takip Edilen

Simo Ryu (@cloneofsimo) 's Twitter Profile Photo

I really hate some people of ML research community that really love attaching new name to existing research and calling it theirs "Let me add one line of modification and call it something brand new, so next time they will cite mine instead of the original work" + Extra

Shayne Longpre (@shayneredford) 's Twitter Profile Photo

Thrilled our global data ecosystem audit was accepted to #ICLR2025! Empirically, we find: 1⃣ Soaring synthetic text data: ~10M tokens (pre-2018) to 100B+ (2024). 2⃣ YouTube is now 70%+ of speech/video data but could block third-party collection. 3⃣ <0.2% of data from

Thrilled our global data ecosystem audit was accepted to #ICLR2025!

Empirically, we find:

1⃣ Soaring synthetic text data: ~10M tokens (pre-2018) to 100B+ (2024).

2⃣ YouTube is now 70%+ of speech/video data but could block third-party collection.

3⃣ &lt;0.2% of data from
tomaarsen (@tomaarsen) 's Twitter Profile Photo

I just released Sentence Transformers v4.1; featuring ONNX and OpenVINO backends for rerankers offering 2-3x speedups and improved hard negatives mining which helps prepare stronger training datasets. Details in 🧵

I just released Sentence Transformers v4.1; featuring ONNX and OpenVINO backends for rerankers offering 2-3x speedups and improved hard negatives mining which helps prepare stronger training datasets. 

Details in 🧵
Enrico Shippole (@enricoshippole) 's Twitter Profile Photo

Looks like I am not going anywhere for a long time. One of the reasons LLMs rarely ever produce usable outputs without significant guidance for my line of work.

Enrico Shippole (@enricoshippole) 's Twitter Profile Photo

Tested this out, and there are noticeable issues above ~20 seconds of generated audio. Input for generation needs to be chunked for it to work properly. Voice cloning also likely needs fine-tuning to work out-of-distribution. Additionally, support for streaming for prod uses.

Alexander Doria (@dorialexander) 's Twitter Profile Photo

Breaking: pleias releases a new generation of small reasoning models for RAG and source synthesis. Pleias-RAG-350M and Pleias-RAG-1B come with built-in support for source citation, SOTA performance and an accuracy comparable to models ten times their size.

Breaking: <a href="/pleiasfr/">pleias</a> releases a new generation of small reasoning models for RAG and source synthesis. Pleias-RAG-350M and Pleias-RAG-1B come with built-in support for source citation, SOTA performance and an accuracy comparable to models ten times their size.
Vik Paruchuri (@vikparuchuri) 's Twitter Profile Photo

We shipped an alpha version of the new Surya OCR model. No hype, just facts: - 90+ languages (focus on en, romance langs, zh, ar, ja, ko) - LaTeX and formatting - Char/word/line bboxes - ~500M non-embed params - 10-20 pages/s

We shipped an alpha version of the new Surya OCR model.  No hype, just facts:

- 90+ languages (focus on en, romance langs, zh, ar, ja, ko)
- LaTeX and formatting
- Char/word/line bboxes
- ~500M non-embed params
- 10-20 pages/s
Simo Ryu (@cloneofsimo) 's Twitter Profile Photo

10B parameter DiT trained on 80M images, all owned by Freepik . Model commercially usable, raw model without distillation, open sourced. Proud to demonstrate first model-training project with our client Freepik: "F-Lite", from fal

10B parameter DiT trained on 80M images, all owned by <a href="/freepik/">Freepik</a> . Model commercially usable, raw model without distillation, open sourced.

Proud to demonstrate first model-training project with our client <a href="/freepik/">Freepik</a>: "F-Lite", from <a href="/FAL/">fal</a>