Alex Strick van Linschoten(@strickvl) 's Twitter Profileg
Alex Strick van Linschoten

@strickvl

Machine Learning Engineer (@zenml_io), researcher (& author of a few books). (Mastodon): @[email protected] (ML) and @[email protected] (Maths)

ID:244511094

linkhttps://mlops.systems calendar_today29-01-2011 13:38:59

1,2K Tweets

1,5K Followers

454 Following

Argilla(@argilla_io) 's Twitter Profile Photo

💥After months of work, we're thrilled to introduce ⚗️distilabel 1.0.0!

🚀More flexible, robust, and powerful.

🙌 Let's empower the community to build the most impactful datasets for Open Source AI!

Blogpost: argilla.io/blog/introduci…
Github: github.com/argilla-io/dis…

account_circle
Andy Carnevale(@AndyCarnevale) 's Twitter Profile Photo

Alex Strick van Linschoten Jeremy Howard In my experience, using XML makes a big difference in the quality of output. The cool thing is that you can just make up your own XML tags and Claude will just play along!

docs.anthropic.com/claude/docs/us…

account_circle
Naheed Mustafa(@NaheedMustafa) 's Twitter Profile Photo

The persistence in western media of the view that a person speaking on their own condition is an unreliable narrator of that condition is shameful. This also goes to the heart of the cultural churn in newsrooms here.
(Not a comment on the tweet. Just a related thought).

account_circle
The Atlantic(@TheAtlantic) 's Twitter Profile Photo

AI understands just a tiny fraction of the thousands of languages used worldwide. That means the promised chatbot revolution could shut out billions of people, and even push some languages to extinction, Matteo Wong writes: theatlantic.com/technology/arc…

account_circle
Sebastian Ruder(@seb_ruder) 's Twitter Profile Photo

Command R+ has strong multilingual capabilities. Its tokenizer also compresses multilingual text much better than other tokenizers. For example, in comparison the OpenAI tokenizer uses:
- 1.18x more tokens for Portuguese
- 1.54x more tokens for Chinese
- 1.67x more tokens for…

Command R+ has strong multilingual capabilities. Its tokenizer also compresses multilingual text much better than other tokenizers. For example, in comparison the OpenAI tokenizer uses: - 1.18x more tokens for Portuguese - 1.54x more tokens for Chinese - 1.67x more tokens for…
account_circle
Alex Strick van Linschoten(@strickvl) 's Twitter Profile Photo

Is it just me or did Apple remove the Colemak keyboard support from iOS devices? Suddenly my device no longer has it… 😱

account_circle
tomaarsen(@tomaarsen) 's Twitter Profile Photo

Big update for the Massive Text Embedding Benchmark (MTEB) intended to simplify finding a good embedding model! Model filtering, search, memory usage, model size in parameters.
The updated leaderboard: huggingface.co/spaces/mteb/le…
Details in 🧵:

Big update for the Massive Text Embedding Benchmark (MTEB) intended to simplify finding a good embedding model! Model filtering, search, memory usage, model size in parameters. The updated leaderboard: huggingface.co/spaces/mteb/le… Details in 🧵:
account_circle
Stefano Giomo(@stgiomo) 's Twitter Profile Photo

Zach Mueller Colaboratory Chris Holdgraf 🐘 @[email protected] Just `pip install testcell` ;-)

KAGGLE: kaggle.com/code/artste/in…
COLAB: colab.research.google.com/github/artste/…
GITHUB: github.com/artste/testcell

@TheZachMueller @GoogleColab @choldgraf Just `pip install testcell` ;-) KAGGLE: kaggle.com/code/artste/in… COLAB: colab.research.google.com/github/artste/… GITHUB: github.com/artste/testcell
account_circle
Stefano Giomo(@stgiomo) 's Twitter Profile Photo

Zach Mueller Colaboratory Chris Holdgraf 🐘 @[email protected] I developed the 🧪 %%testcell 🧪 cell-magic with a similar need in mind: make it easy to do exploratory programming, try out various experiments without the risk of a single function or variable 'polluting' the global scope.

Launching blog post: artste.github.io/blog/posts/int…

account_circle
Jay Alammar(@JayAlammar) 's Twitter Profile Photo

In a lot of other models, Arabic/اللغة العربية (and other non-English languages) are more expensive to use because the tokenizer basically punishes other languages by requiring a lot more tokens to represent their text.

Command R+ improves the tokenization, requiring less tokens…

account_circle
Sebastian Ruder(@seb_ruder) 's Twitter Profile Photo

Command R+ (⌘ R+) is our most capable model (with open weights!) yet! I’m particularly excited about its multilingual capabilities. It should do pretty well in 10 languages (English, French, Spanish, Italian, German, Portuguese, Japanese, Korean, Arabic, and Chinese).

You can…

account_circle
Pablo Montalvo(@m_olbap) 's Twitter Profile Photo

It was hard to find quality OCR data... until today! Super excited to announce the release of the 2 largest public OCR datasets ever 📜 📜

OCR is critical for document AI: here, 26M+ pages, 18b text tokens, 6TB! Thanks to UCSF Library, Industry Documents Library and PDF Association
🧶 ↓

It was hard to find quality OCR data... until today! Super excited to announce the release of the 2 largest public OCR datasets ever 📜 📜 OCR is critical for document AI: here, 26M+ pages, 18b text tokens, 6TB! Thanks to @ucsf_library, @industrydocs and @PDFAssociation 🧶 ↓
account_circle
Alex Strick van Linschoten(@strickvl) 's Twitter Profile Photo

Publishing a new dataset today, this time a unique collection of translations from the 2006-2009 period in Afghanistan. Back then I founded and ran a media monitoring organisation/startup in Kabul, staffed by a small team of amazing translators.
mlops.systems/posts/2024-04-…

account_circle
حسام شبات(@HossamShabat) 's Twitter Profile Photo

I have been working nonstop for the past 6 months covering what’s happening in Gaza, but what I saw today while visiting Al-Shifa hospital was unlike anything I’ve ever witnessed before :

Israeli occupation forces executed 300 Palestinians in and around the hospital, and this…

account_circle
Devansh (⚡, 🥷)(@0xAsm0d3us) 's Twitter Profile Photo

If you, like many, think relying just on `cat` command's output is enough to be sure about the integrity of a bash file. Think twice, you could get hacked. Read below 👇

If you, like many, think relying just on `cat` command's output is enough to be sure about the integrity of a bash file. Think twice, you could get hacked. Read below 👇
account_circle