煒清 WeiChing Lin (@thesuperching) 's Twitter Profile
煒清 WeiChing Lin

@thesuperching

engineer on data @🇹🇼
a little contrarian, with weak frontal lobe.
(۶•̀ᴗ•́)۶//

ID: 1624748894

calendar_today27-07-2013 06:16:12

2,2K Tweet

198 Takipçi

1,1K Takip Edilen

Michael Ryan (@michaelryan207) 's Twitter Profile Photo

MIPROv2, our new state-of-the-art optimizer for LM programs, is live in DSPy Stanford NLP Group! It's even faster, cheaper, and more accurate than MIPRO. MIPROv2 proposes instructions, bootstraps demonstrations, and optimizes combinations. Let’s dive into a visual 🧵of how it works!

ARC Prize (@arcprize) 's Twitter Profile Photo

Introducing SnakeBench, an experimental benchmark side quest We made 50 LLMs battle each other in head-to-head snake 🐍 2.8K matches showed which models are the best at snake real-time strategy and spatial reasoning Here’s the top match between o3-mini and DeepSeek-R1 🧵

Bernal Jiménez (@bernaaaljg) 's Twitter Profile Photo

Introducing ✨HippoRAG 2 ✨ 📣 📣 “From RAG to Memory: Non-Parametric Continual Learning for Large Language Models” HippoRAG 2 is a memory framework for LLMs that elevates our brain-inspired HippoRAG system to new levels of performance and robustness. 🔓 Unlocks Memory

Introducing ✨HippoRAG 2 ✨

📣 📣 “From RAG to Memory: Non-Parametric Continual Learning for Large Language Models”

HippoRAG 2 is a memory framework for LLMs that elevates our brain-inspired HippoRAG system to new levels of performance and robustness.

🔓 Unlocks Memory
METR (@metr_evals) 's Twitter Profile Photo

When will AI systems be able to carry out long projects independently? In new research, we find a kind of “Moore’s Law for AI agents”: the length of tasks that AIs can do is doubling about every 7 months.

When will AI systems be able to carry out long projects independently?

In new research, we find a kind of “Moore’s Law for AI agents”: the length of tasks that AIs can do is doubling about every 7 months.
Kawin Ethayarajh (@ethayarajh) 's Twitter Profile Photo

There are actually three lines here, one for scaling pre-training, one for post-training, and one for test-time compute. Every paradigm shift seems to be making progress more efficient -- we're successfully playing whack-a-mole. That said, these tasks are all software/ML

There are actually three lines here, one for scaling pre-training, one for post-training, and one for test-time compute. Every paradigm shift seems to be making progress more efficient -- we're successfully playing whack-a-mole.

That said, these tasks are all software/ML
METR (@metr_evals) 's Twitter Profile Photo

METR tested pre-release versions of o3 + o4-mini on tasks involving autonomy and AI R&D. For each model, we examined how capable it is on our tasks & how often it tries to “hack” them. We detail our findings in a new report, a summary of which is included in OpenAI's system card.

METR tested pre-release versions of o3 + o4-mini on tasks involving autonomy and AI R&D. For each model, we examined how capable it is on our tasks & how often it tries to “hack” them. We detail our findings in a new report, a summary of which is included in OpenAI's system card.
Andrew Lampinen (@andrewlampinen) 's Twitter Profile Photo

How do language models generalize from information they learn in-context vs. via finetuning? We show that in-context learning can generalize more flexibly, illustrating key differences in the inductive biases of these modes of learning — and ways to improve finetuning. Thread: 1/

How do language models generalize from information they learn in-context vs. via finetuning? We show that in-context learning can generalize more flexibly, illustrating key differences in the inductive biases of these modes of learning — and ways to improve finetuning. Thread: 1/
Omar Khattab (@lateinteraction) 's Twitter Profile Photo

DSPy's biggest strength is also the reason it can admittedly be hard to wrap your head around it. It's basically say: LLMs & their methods will continue to improve but not equally in every axis, so: - What's the smallest set of fundamental abstractions that allow you to build

Noam Brown (@polynoamial) 's Twitter Profile Photo

The paper itself does a good job highlighting the limitation. But notice the difference in the plot from the paper vs the plots that are commonly shared. The paper is here: arxiv.org/pdf/2503.14499

The paper itself does a good job highlighting the limitation. But notice the difference in the plot from the paper vs the plots that are commonly shared.

The paper is here: arxiv.org/pdf/2503.14499
Sully (@sullyomarr) 's Twitter Profile Photo

honestly incredible how little you need to do to win these days respond quickly, don't be afraid to try things, go out of your comfort zone, show up everyday, and you're already in the top 1%

Andrej Karpathy (@karpathy) 's Twitter Profile Photo

My sleep scores during recent travel were in the 90s. Now back in SF I am consistently back down to 70s, 80s. I am increasingly convinced that this is due to traffic noise from a nearby road/intersection where I live - every ~10min, a car, truck, bus, or motorcycle with a very

Nirant (@nirantk) 's Twitter Profile Photo

MR's tylercowen has a nerd-famous maxim from early 2020s — "context is that is scarce" The 2025 version of that would be: "feedback is that is scarce" Every system (a company is a system) that has a good signal to noise feedback system — fixes things 10x faster all else equal

Bespoke Labs (@bespokelabsai) 's Twitter Profile Photo

Understanding what’s in the data is a high leverage activity when it comes to training/evaluating models and agents. This week we will drill down into a few popular benchmarks and share some custom viewers that will help pop up various insights. Our viewer for GPQA (Google

Understanding what’s in the data is a high leverage activity when it comes to training/evaluating models and agents.

This week we will drill down into a few popular benchmarks and share some custom viewers that will help pop up various insights. 

Our viewer for GPQA (Google
Andreas Kirsch 🇺🇦 (@blackhc) 's Twitter Profile Photo

I'm late to review the "Illusion of Thinking" paper, so let me collect some of the best threads by and critical takes by Lisan al Gaib in one place and sprinkle some of my own thoughts in as well. The paper is rather critical of reasoning LLMs (LRMs): x.com/MFarajtabar/st…

Epoch AI (@epochairesearch) 's Twitter Profile Photo

SWE-bench Verified is one of the main benchmarks to assess AI coding skills. But what does it actually measure? We found that it's one of the best tests of AI coding, but limited by its focus on simple bug fixes in familiar repositories. Here’s a summary of our article 🧵

SWE-bench Verified is one of the main benchmarks to assess AI coding skills. But what does it actually measure?

We found that it's one of the best tests of AI coding, but limited by its focus on simple bug fixes in familiar repositories.

Here’s a summary of our article 🧵
Polina Kirichenko (@polkirichenko) 's Twitter Profile Photo

Excited to release AbstentionBench -- our paper and benchmark on evaluating LLMs’ *abstention*: the skill of knowing when NOT to answer! Key finding: reasoning LLMs struggle with unanswerable questions and hallucinate! Details and links to paper & open source code below! 🧵1/9

Excited to release AbstentionBench -- our paper and benchmark on evaluating LLMs’ *abstention*: the skill of knowing when NOT to answer!

Key finding: reasoning LLMs struggle with unanswerable questions and hallucinate!

Details and links to paper & open source code below!
🧵1/9
jack morris (@jxmnop) 's Twitter Profile Photo

my definition of science: the continual process of (a) generating artifacts that are surprising or useful and (b) explaining them these days there's exponential growth in papers, and in useful artifacts, but we don't often discover good explanations so i was happy to stumble

my definition of science:
the continual process of (a) generating artifacts that are surprising or useful and (b) explaining them

these days there's exponential growth in papers, and in useful artifacts, but we don't often discover good explanations

so i was happy to stumble
Jia-Bin Huang (@jbhuang0604) 's Twitter Profile Photo

Why More Researchers Should be Content Creators Just trying something new! I recorded one of my recent talks, sharing what I learned from starting as a small content creator. youtu.be/0W_7tJtGcMI We all benefit when there are more content creators!

Why More Researchers Should be Content Creators

Just trying something new! I recorded one of my recent talks, sharing what I learned from starting as a small content creator. 

youtu.be/0W_7tJtGcMI

We all benefit when there are more content creators!
Philipp Schmid (@_philschmid) 's Twitter Profile Photo

What is context Engineering? “Context Engineering is the discipline of designing and building dynamic systems that provides the right information and tools, in the right format, at the right time, to give a LLM everything it needs to accomplish a task.” Read it:

What is context Engineering?

“Context Engineering is the discipline of designing and building dynamic systems that provides the right information and tools, in the right format, at the right time, to give a LLM everything it needs to accomplish a task.”

Read it: