煒清 WeiChing Lin (@thesuperching) Twitter Tweets • TwiCopy

Gate.io

5 hours ago

🔥The 9th Round of Easy Loan, Earn $40 Reward is in progress❗️ ⏰ Promotion Period: January 15th - Feburary 15th, 2025 👉 Register now and check more details at gate.io/campaigns/358

thumb_up_off_alt34

chat_bubble_outline39

repeat6

shareShare

MIPROv2, our new state-of-the-art optimizer for LM programs, is live in DSPy Stanford NLP Group! It's even faster, cheaper, and more accurate than MIPRO. MIPROv2 proposes instructions, bootstraps demonstrations, and optimizes combinations. Let’s dive into a visual 🧵of how it works!

thumb_up_off_alt434

chat_bubble_outline10

repeat80

shareShare

ARC Prize

@arcprize

5 months ago

Introducing SnakeBench, an experimental benchmark side quest We made 50 LLMs battle each other in head-to-head snake 🐍 2.8K matches showed which models are the best at snake real-time strategy and spatial reasoning Here’s the top match between o3-mini and DeepSeek-R1 🧵

thumb_up_off_alt1,1K

chat_bubble_outline44

repeat149

shareShare

Bernal Jiménez

@bernaaaljg

4 months ago

Introducing ✨HippoRAG 2 ✨ 📣 📣 “From RAG to Memory: Non-Parametric Continual Learning for Large Language Models” HippoRAG 2 is a memory framework for LLMs that elevates our brain-inspired HippoRAG system to new levels of performance and robustness. 🔓 Unlocks Memory

thumb_up_off_alt132

chat_bubble_outline3

repeat44

shareShare

METR

@metr_evals

4 months ago

When will AI systems be able to carry out long projects independently? In new research, we find a kind of “Moore’s Law for AI agents”: the length of tasks that AIs can do is doubling about every 7 months.

thumb_up_off_alt4,4K

chat_bubble_outline158

repeat826

shareShare

Kawin Ethayarajh

@ethayarajh

4 months ago

There are actually three lines here, one for scaling pre-training, one for post-training, and one for test-time compute. Every paradigm shift seems to be making progress more efficient -- we're successfully playing whack-a-mole. That said, these tasks are all software/ML

thumb_up_off_alt44

chat_bubble_outline4

repeat1

shareShare

METR

@metr_evals

3 months ago

METR tested pre-release versions of o3 + o4-mini on tasks involving autonomy and AI R&D. For each model, we examined how capable it is on our tasks & how often it tries to “hack” them. We detail our findings in a new report, a summary of which is included in OpenAI's system card.

thumb_up_off_alt449

chat_bubble_outline12

repeat64

shareShare

Andrew Lampinen

@andrewlampinen

2 months ago

How do language models generalize from information they learn in-context vs. via finetuning? We show that in-context learning can generalize more flexibly, illustrating key differences in the inductive biases of these modes of learning — and ways to improve finetuning. Thread: 1/

thumb_up_off_alt751

chat_bubble_outline7

repeat146

shareShare

Omar Khattab

@lateinteraction

2 months ago

DSPy's biggest strength is also the reason it can admittedly be hard to wrap your head around it. It's basically say: LLMs & their methods will continue to improve but not equally in every axis, so: - What's the smallest set of fundamental abstractions that allow you to build

thumb_up_off_alt869

chat_bubble_outline46

repeat125

shareShare

Noam Brown

@polynoamial

2 months ago

The paper itself does a good job highlighting the limitation. But notice the difference in the plot from the paper vs the plots that are commonly shared. The paper is here: arxiv.org/pdf/2503.14499

thumb_up_off_alt117

chat_bubble_outline2

repeat2

shareShare

Sully

@sullyomarr

a month ago

honestly incredible how little you need to do to win these days respond quickly, don't be afraid to try things, go out of your comfort zone, show up everyday, and you're already in the top 1%

thumb_up_off_alt136

chat_bubble_outline16

repeat7

shareShare

Andrej Karpathy

@karpathy

a month ago

My sleep scores during recent travel were in the 90s. Now back in SF I am consistently back down to 70s, 80s. I am increasingly convinced that this is due to traffic noise from a nearby road/intersection where I live - every ~10min, a car, truck, bus, or motorcycle with a very

thumb_up_off_alt11,11K

chat_bubble_outline1,1K

repeat761

shareShare

James Miller

@jimdmiller

a month ago

thumb_up_off_alt2,2K

chat_bubble_outline123

repeat346

shareShare

Nirant

@nirantk

a month ago

MR's tylercowen has a nerd-famous maxim from early 2020s — "context is that is scarce" The 2025 version of that would be: "feedback is that is scarce" Every system (a company is a system) that has a good signal to noise feedback system — fixes things 10x faster all else equal

thumb_up_off_alt3

chat_bubble_outline3

repeat1

shareShare

Bespoke Labs

@bespokelabsai

a month ago

Understanding what’s in the data is a high leverage activity when it comes to training/evaluating models and agents. This week we will drill down into a few popular benchmarks and share some custom viewers that will help pop up various insights. Our viewer for GPQA (Google

thumb_up_off_alt46

chat_bubble_outline1

repeat10

shareShare

Andreas Kirsch 🇺🇦

@blackhc

a month ago

I'm late to review the "Illusion of Thinking" paper, so let me collect some of the best threads by and critical takes by Lisan al Gaib in one place and sprinkle some of my own thoughts in as well. The paper is rather critical of reasoning LLMs (LRMs): x.com/MFarajtabar/st…

thumb_up_off_alt1,1K

chat_bubble_outline30

repeat229

shareShare

Epoch AI

@epochairesearch

a month ago

SWE-bench Verified is one of the main benchmarks to assess AI coding skills. But what does it actually measure? We found that it's one of the best tests of AI coding, but limited by its focus on simple bug fixes in familiar repositories. Here’s a summary of our article 🧵

thumb_up_off_alt386

chat_bubble_outline5

repeat38

shareShare

Polina Kirichenko

@polkirichenko

24 days ago

Excited to release AbstentionBench -- our paper and benchmark on evaluating LLMs’ *abstention*: the skill of knowing when NOT to answer! Key finding: reasoning LLMs struggle with unanswerable questions and hallucinate! Details and links to paper & open source code below! 🧵1/9

thumb_up_off_alt590

chat_bubble_outline11

repeat81

shareShare

jack morris

@jxmnop

18 days ago

my definition of science: the continual process of (a) generating artifacts that are surprising or useful and (b) explaining them these days there's exponential growth in papers, and in useful artifacts, but we don't often discover good explanations so i was happy to stumble

thumb_up_off_alt433

chat_bubble_outline17

repeat40

shareShare

Jia-Bin Huang

@jbhuang0604

16 days ago

Why More Researchers Should be Content Creators Just trying something new! I recorded one of my recent talks, sharing what I learned from starting as a small content creator. youtu.be/0W_7tJtGcMI We all benefit when there are more content creators!

thumb_up_off_alt161

chat_bubble_outline5

repeat20

shareShare

Philipp Schmid

@_philschmid

8 days ago

What is context Engineering? “Context Engineering is the discipline of designing and building dynamic systems that provides the right information and tools, in the right format, at the right time, to give a LLM everything it needs to accomplish a task.” Read it:

thumb_up_off_alt528

chat_bubble_outline15

repeat81

shareShare

煒清 WeiChing Lin

Gate.io

Michael Ryan

ARC Prize

Bernal Jiménez

METR

Kawin Ethayarajh

METR

Andrew Lampinen

Omar Khattab

Noam Brown

Sully

Andrej Karpathy

James Miller

Nirant

Bespoke Labs

Andreas Kirsch 🇺🇦

Epoch AI

Polina Kirichenko

jack morris

Jia-Bin Huang

Philipp Schmid