Matt Mistele (@mattgmistele) 's Twitter Profile
Matt Mistele

@mattgmistele

Eng manager for NLU at @Moveworks, making a daily assistant for all employees and the best place to build AI workflows for work

ID: 2887883744

calendar_today02-11-2014 23:19:47

143 Tweet

145 Takipçi

973 Takip Edilen

Russell Kaplan (@russelljkaplan) 's Twitter Profile Photo

Friend who is a doctor told me everyone in his hospital uses ChatGPT now. Me: “Do you all use o3?” Him: “No, 4o. Isn’t it best to use the latest model? 4 vs 3?” OpenAI we really gotta fix these model names 🤦‍♂️

Andreas Kirsch 🇺🇦 (@blackhc) 's Twitter Profile Photo

I'm late to review the "Illusion of Thinking" paper, so let me collect some of the best threads by and critical takes by Lisan al Gaib in one place and sprinkle some of my own thoughts in as well. The paper is rather critical of reasoning LLMs (LRMs): x.com/MFarajtabar/st…

Jiaxin Wen @ICLR2025 (@jiaxinwen22) 's Twitter Profile Photo

New Anthropic research: We elicit capabilities from pretrained models using no external supervision, often competitive or better than using human supervision. Using this approach, we are able to train a Claude 3.5-based assistant that beats its human-supervised counterpart.

New Anthropic research: We elicit capabilities from pretrained models using no external supervision, often competitive or better than using human supervision.

Using this approach, we are able to train a Claude 3.5-based assistant that beats its human-supervised counterpart.
Alex Albert (@alexalbert__) 's Twitter Profile Photo

Heard that some eng teams at big co's are now are now testing their API designs against LLMs before release. They run evals to see which API structure is easiest for the model to work with and redesign it if the model struggles to understand the format. I expect this to scale to

Andriy Burkov (@burkov) 's Twitter Profile Photo

Reinforcement learning in action: Gemini "thinks" for much longer at the beginning of the debugging. But as the conversation grows, its thinking becomes shorter and shorter. Close to 300k tokens, it almost doesn't think at all. This is because it doesn't matter for the model

Simon Willison (@simonw) 's Twitter Profile Photo

"Design Patterns for Securing LLM Agents against Prompt Injections" is an excellent new paper that provides six design patterns to help protect LLM tool-using systems (call them "agents" if you like) against prompt injection attacks

"Design Patterns for Securing LLM Agents against Prompt Injections" is an excellent new paper that provides six design patterns to help protect LLM tool-using systems (call them "agents" if you like) against prompt injection attacks
Pliny the Liberator 🐉󠅫󠄼󠄿󠅆󠄵󠄐󠅀󠄼󠄹󠄾󠅉󠅭 (@elder_plinius) 's Twitter Profile Photo

I'm not the biggest fan of how the memory (bios tool) in ChatGPT is constantly creating overly simplistic memories, almost always without user permission or a chance to review/edit. So I made a new memory that vastly improves my bios functionality! Now, ChatGPT writes far more

I'm not the biggest fan of how the memory (bios tool) in ChatGPT is constantly creating overly simplistic memories, almost always without user permission or a chance to review/edit. So I made a new memory that vastly improves my bios functionality!

Now, ChatGPT writes far more
Simon Willison (@simonw) 's Twitter Profile Photo

If you use "AI agents" (LLMs that call tools) you need to be aware of the Lethal Trifecta Any time you combine access to private data with exposure to untrusted content and the ability to externally communicate an attacker can trick the system into stealing your data!

If you use "AI agents" (LLMs that call tools) you need to be aware of the Lethal Trifecta

Any time you combine access to private data with exposure to untrusted content and the ability to externally communicate an attacker can trick the system into stealing your data!
Rohan Paul (@rohanpaul_ai) 's Twitter Profile Photo

Tool-calling turns GPT-4.1 into a near-o1-preview without a single gradient step. No retraining, just smarter prompts for near-RL performance. 🤯 pass@1 performance on AIME2024 from 26.7% to 43.3%, bringing it very close to the performance of o1-preview. Swapping one prompt

Tool-calling turns GPT-4.1 into a near-o1-preview without a single gradient step.

No retraining, just smarter prompts for near-RL performance. 🤯

pass@1 performance on AIME2024 from 26.7% to 43.3%, bringing it very close to the performance of o1-preview.

Swapping one prompt
Essential AI (@essential_ai) 's Twitter Profile Photo

[1/5] 🚀 Meet Essential-Web v1.0, a 24-trillion-token pre-training dataset with rich metadata built to effortlessly curate high-performing datasets across domains and use cases!

[1/5]

🚀 Meet Essential-Web v1.0, a 24-trillion-token pre-training dataset with rich metadata built to effortlessly curate high-performing datasets across domains and use cases!
John Loeber 🎢 (@johnloeber) 's Twitter Profile Photo

Nobody understands the true mechanics of signing bonuses, so let me spell it out: Conventionally, a signing bonus always comes with a clawback — if you leave within 12/24/etc months, you have to pay it back But! Signing bonuses are taxed, and the taxes must be paid on day one!

Simon Willison (@simonw) 's Twitter Profile Photo

Context rot is another reason I don't trust the new ChatGPT memory feature where it includes summarized notes from previous conversations automatically - makes it harder to truly reset the context when something rotten makes it in there

Gabriele Berton (@gabriberton) 's Twitter Profile Photo

Few months ago researchers from Meta empirically found out that LayerNorm acts similar to a TanH, squeezing in the weights that are too high into more tractable values They tried to replace the LayerNorm with a TanH, and achieved similar results at higher speed What??? [1/8]

Few months ago researchers from Meta empirically found out that LayerNorm acts similar to a TanH, squeezing in the weights that are too high into more tractable values

They tried to replace the LayerNorm with a TanH, and achieved similar results at higher speed

What??? [1/8]
Matt Mistele (@mattgmistele) 's Twitter Profile Photo

Super cool! Latest in a series of papers showing that LLMs can in fact apply learnings, in ways we might not have expected after the reversal curse paper (subtitled “LLMs trained on ‘A is B’ fail to learn ‘B is A’”, circa GPT-3.5 and GPT-4)

Matt Mistele (@mattgmistele) 's Twitter Profile Photo

Do not be fooled by 3s and 4s star wars 3: 2005 tech & $113M star wars 4: 1977 tech & $11M o3: 2025 tech & $$$ 4o: 2024 tech & $$

Jeremy Howard (@jeremyphoward) 's Twitter Profile Photo

Study by my genius friend to CURE DEAFNESS cancelled because, I shit you not: "Research programs based primarily on artificial and non-scientific categories, including amorphous equity objectives, are antithetical to the scientific inquiry". Decel anti-science brain-dead take.

Dave Kline (@dklineii) 's Twitter Profile Photo

Company culture isn't hard to understand. It's the sum of each individual's behavior. - It's how the CEO interacts with their admin - It's how the veterans welcome new hires - It's how the recruiter responds to you Culture is nothing more than how we treat each other.