Karthik Narasimhan (@karthik_r_n) Twitter Tweets • TwiCopy

Talor Abramovich

a year ago

We're launching EnIGMA, our state-of-the-art AI agent for offensive cybersec! It uses tools like Ghidra & pwntools, can debug, connect to servers, and exploit vulnerabilities to solve CTF challenges. Built with researchers from Princeton, NYU, and TAU. enigma-agent.github.io

thumb_up_off_alt44

chat_bubble_outline2

repeat15

shareShare

Karthik Narasimhan

@karthik_r_n

a year ago

In a year or two from now, 'fine-tuning' will become synonymous with 'training' (as used in the good old ML days). LLMs will be seen more widely as starting points, just like weight initialization or choosing the number of layers for a Transformer. Pick a starting point, curate

thumb_up_off_alt141

chat_bubble_outline11

repeat13

shareShare

Sierra

@sierraplatform

a year ago

Sierra partnered with Casper to launch Luna 2.0, their AI agent delivering 24/7 personalized customer support. From helping with mattress purchases to driving lifelong loyalty, Luna 2.0 is transforming the shopping experience!💤✨️ Learn more: sierra.ai/customers/casp…

thumb_up_off_alt17

chat_bubble_outline0

repeat5

shareShare

John Yang

@jyangballin

a year ago

We're launching SWE-bench Multimodal to eval agents' ability to solve visual GitHub issues. - 617 *brand new* tasks from 17 JavaScript repos - Each task has an image! Existing agents struggle here! We present SWE-agent Multimodal to remedy some issues Led w/ carlos 🧵

thumb_up_off_alt268

chat_bubble_outline8

repeat62

shareShare

Sierra

@sierraplatform

a year ago

Today we're excited to announce a new way to interact with Sierra agents: voice. Learn more about how this new capability is transforming customer interactions in our latest blog post.: sierra.ai/blog/sierra-sp…

thumb_up_off_alt20

chat_bubble_outline0

repeat4

shareShare

Common Sense Machines

@csm_ai

a year ago

Today we're releasing Common Sense Agents, a new backbone for agentic creative computing: 💻 Windows VMs for safe and repeatable workflows 🔧 Long workflows broken down into reusable tasks 🦾Support for off the shelf agents like Claude ⌛️ Data recording + finetuning infra

thumb_up_off_alt118

chat_bubble_outline3

repeat29

shareShare

Karthik Narasimhan

@karthik_r_n

a year ago

The biggest mistake we can make right now is not dreaming big enough, especially w.r.t AI

thumb_up_off_alt155

chat_bubble_outline5

repeat13

shareShare

Kilian Lieret @ICLR

@klieret

10 months ago

SWE-agent 1.0 is the open-source SOTA on SWE-bench Lite! Tons of new features: massively parallel runs; cloud-based deployment; extensive configurability with tool bundles; new command line interface & utilities.

thumb_up_off_alt60

chat_bubble_outline3

repeat18

shareShare

Karthik Narasimhan

@karthik_r_n

10 months ago

The best thing about SWE-agents and tools like cursor is the amount of additional agency they provide us

thumb_up_off_alt13

chat_bubble_outline0

repeat1

shareShare

Sierra

@sierraplatform

9 months ago

In the AI age, agent reliability is key, and Sierra’s 𝜏-bench is setting the standard—shaping academic research, industry applications and next-generation development. Read more: sierra.ai/blog/tau-bench….

thumb_up_off_alt11

chat_bubble_outline0

repeat4

shareShare

Karthik Narasimhan

@karthik_r_n

9 months ago

Interesting tidbits on using dedicated "thinking" steps in agents from Anthropic Also loved seeing full pass^k curves for τ-bench - measuring this was the primary motivation of the benchmark, not just avg scores!

thumb_up_off_alt11

chat_bubble_outline0

repeat0

shareShare

Shunyu Yao

@shunyuyao12

8 months ago

I’m at ICLR to present a poster and give a talk, both related to the second half blogpost. See you there if you wanna chat about it :)

thumb_up_off_alt114

chat_bubble_outline4

repeat7

shareShare

Karthik Narasimhan

@karthik_r_n

7 months ago

Humans evolved to communicate so we could coordinate better. But these days, it feels like we communicate so much, yet coordinate so little.

thumb_up_off_alt20

chat_bubble_outline2

repeat0

shareShare

Sierra

@sierraplatform

7 months ago

Successful agents are the result of collaboration between teams: engineering, operations, customer experience, and marketing. Yet every platform available today except Sierra forces businesses to optimize for one group over another. Our Agent OS enables both no code and

thumb_up_off_alt17

chat_bubble_outline0

repeat5

shareShare

Clay Bavor

@claybavor

7 months ago

Like all great products, the best agents are the product of many teams working together — some technical, some non-technical. Sierra’s Agent OS uniquely supports both no code and programmatic agent development, enabling customer experience and engineering teams alike to build

thumb_up_off_alt10

chat_bubble_outline1

repeat2

shareShare

Alex Zhang

@a1zhang

7 months ago

Can GPT, Claude, and Gemini play video games like Zelda, Civ, and Doom II? 𝗩𝗶𝗱𝗲𝗼𝗚𝗮𝗺𝗲𝗕𝗲𝗻𝗰𝗵 evaluates VLMs on Game Boy & MS-DOS games given only raw screen input, just like how a human would play. The best model (Gemini) completes just 0.48% of the benchmark! 🧵👇

thumb_up_off_alt518

chat_bubble_outline23

repeat71

shareShare

Ben Shi

@benshi34

6 months ago

As we optimize model reasoning over verifiable objectives, how does this affect human understanding of said reasoning to achieve superior collaborative outcomes? In our new preprint, we investigate human-centric model reasoning for knowledge transfer 🧵:

thumb_up_off_alt177

chat_bubble_outline6

repeat39

shareShare

Sierra

@sierraplatform

6 months ago

Last year, we introduced 𝜏-bench, a benchmark for evaluating AI agents on realistic, multi-step tasks involving tool use and domain-specific constraints. It surfaced a critical limitation in LLM-based agents: low repeatability, even under identical conditions. Now, we’re

thumb_up_off_alt29

chat_bubble_outline1

repeat4

shareShare

Sierra

@sierraplatform

6 months ago

Learn more: sierra.ai/blog/benchmark…

thumb_up_off_alt2

chat_bubble_outline0

repeat1

shareShare

Clay Bavor

@claybavor

6 months ago

Today we announced a set of major advances to our agent benchmark, 𝜏-bench. This new benchmark, 𝜏², introduces the notion of "dual control", where AI agents are challenged not just to reason and act, but to coordinate, guide, and assist a user in achieving a shared objective.

thumb_up_off_alt35

chat_bubble_outline4

repeat6

shareShare