tusharkhot (@tusharkhot) 's Twitter Profile
tusharkhot

@tusharkhot

Senior Research Scientist, Allen Institute for AI

ID: 13091972

linkhttps://allenai.org/team/tushark/ calendar_today05-02-2008 13:05:09

105 Tweet

309 Followers

187 Following

Archiki Prasad (@archikiprasad) 's Twitter Profile Photo

I'll be presenting my #NAACL2024 work:✨ADaPT✨ in-person 🇲🇽 tomorrow (June 19) at 11 AM in poster session 7! ADaPT enables LLMs to "adapt" to task complexity & execution failures by decomposing recursively w/ Alexander Koller M Hartmann, P Clark Ashish Sabharwal Mohit Bansal tusharkhot

Bodhisattwa Majumder (@mbodhisattwa) 's Twitter Profile Photo

Can LLMs help accelerate the discovery of data-driven scientific hypotheses? 🧬📊 We benchmark this in DiscoveryBench: 264 discovery tasks from 6 scientific domains, from humanities to biology: arxiv.org/pdf/2407.01725… Ai2 Aristo Team at AI2 Harshit Surana UMass Amherst

Bodhisattwa Majumder (@mbodhisattwa) 's Twitter Profile Photo

To y'all attending #icml24, gearing up to argue what general-purpose generative models/agents can and cannot do for automated scientific discovery. Would love to hear your thoughts and counter-arguments on Wed (7/24) at 1:30, Hall C 4-9 #117. + discuss what's cooking Ai2

To y'all attending #icml24, gearing up to argue what general-purpose generative models/agents can and cannot do for automated scientific discovery. Would love to hear your thoughts and counter-arguments on Wed (7/24) at 1:30, Hall C 4-9 #117. 
+ discuss what's cooking <a href="/allen_ai/">Ai2</a>
Archiki Prasad (@archikiprasad) 's Twitter Profile Photo

Symbolic planners do explicit search for most steps. However, LLMs can solve easy sub-goals directly (Sys-1) and verbalize search for harder ones (Sys-2). In✨System-1.x✨, we train LLMs to efficiently balance Sys-1 & 2. Users can simply control hybridization via the dial x! 🎛️

AK (@_akhaliq) 's Twitter Profile Photo

AppWorld A Controllable World of Apps and People for Benchmarking Interactive Coding Agents Autonomous agents that address day-to-day digital tasks (e.g., ordering groceries for a household), must not only operate multiple apps (e.g., notes, messaging, shopping app) via

AppWorld

A Controllable World of Apps and People for Benchmarking Interactive Coding Agents

Autonomous agents that address day-to-day digital tasks (e.g., ordering groceries for a household), must not only operate multiple apps (e.g., notes, messaging, shopping app) via
Harsh Trivedi (@harsh3vedi) 's Twitter Profile Photo

🔥 Autonomous AI Assistants (e.g., #googleio2024, #WWDC24) and coding agents (e.g., #Devin, #SWEAgent) have garnered a lot of attention recently. We can envision coding agents autonomously completing complex day-to-day tasks across apps using APIs on our behalf. But how can we

ACL 2025 (@aclmeeting) 's Twitter Profile Photo

🏆 ACL Best Theme Paper Award: - OLMo: Accelerating the Science of Language Models by Groeneveld et al. #NLProc #ACL2024NLP

Niranjan (@b_niranjan) 's Twitter Profile Photo

🏆 AppWorld won a #ACL2024NLP Best Resource Paper Award. 🥳 Congrats team! I'm so happy for Harsh Trivedi. The time & care he put in is inspiring. #proudadvisor 🚨He is on the job market.🚨 Hire him! 🌐Check out appworld.dev Stony Brook University Dept. of Computer Science @AI_SBU #NLProc Ai2

Ben Bogin (@ben_bogin) 's Twitter Profile Photo

📢 New Benchmark: SUPER for Setting UP and Executing tasks from Research repositories Reproducibility is crucial in science. We introduce SUPER to evaluate LLMs' capabilities in autonomously running experiments from research repositories. ⬇️ arxiv.org/pdf/2409.07440

📢 New Benchmark: SUPER for Setting UP and Executing tasks from Research repositories

Reproducibility is crucial in science. We introduce SUPER to evaluate LLMs' capabilities in autonomously running experiments from research repositories. ⬇️

arxiv.org/pdf/2409.07440
Ai2 (@allen_ai) 's Twitter Profile Photo

Meet Molmo: a family of open, state-of-the-art multimodal AI models. Our best model outperforms proprietary systems, using 1000x less data. Molmo doesn't just understand multimodal data—it acts on it, enabling rich interactions in both the physical and virtual worlds. Try it

Harsh Trivedi (@harsh3vedi) 's Twitter Profile Photo

📢 I am giving a talk on 🌎 AppWorld & its future works at 13+ universities (USC, UCI, Stanford, Berkeley, Princeton, JHU, ..) and companies (Ai2, Google, Apple, Semantic Machines, ..) in the next 1-2 months 📅 Schedule+Details: appworld.dev/talks x.com/harsh3vedi/sta…

Peter Jansen ( @peterjansen-ai.bsky.social ) (@peterjansen_ai) 's Twitter Profile Photo

Can language models perform end-to-end scientific discovery? In our NeurIPS Spotlight paper, we show: very rarely. Our best model found <20% of discoveries, our best PhDs found nearly all. Paper: arxiv.org/pdf/2406.06769 Code/Web: allenai.github.io/discoveryworld Ai2 Microsoft Research

Can language models perform end-to-end scientific discovery? In our NeurIPS Spotlight paper, we show: very rarely.

Our best model found &lt;20% of discoveries, our best PhDs found nearly all.

Paper: arxiv.org/pdf/2406.06769
Code/Web: allenai.github.io/discoveryworld
<a href="/allen_ai/">Ai2</a> <a href="/MSFTResearch/">Microsoft Research</a>
Stanford NLP Group (@stanfordnlp) 's Twitter Profile Photo

For this week’s NLP Seminar, we are thrilled to host Harsh Trivedi to talk about AppWorld: Reliable Evaluation of Interactive Agents in a World of Apps and People! When: 10/10 Thurs 11am PT Non-Stanford affiliates registration form: forms.gle/UjWyX6dn7mQafj… (closed at 9am PT on

For this week’s NLP Seminar, we are thrilled to host <a href="/harsh3vedi/">Harsh Trivedi</a>  to talk about AppWorld: Reliable Evaluation of Interactive Agents in a World of Apps and People!

When: 10/10 Thurs 11am PT
Non-Stanford affiliates registration form: forms.gle/UjWyX6dn7mQafj…   (closed at 9am PT on
Archiki Prasad (@archikiprasad) 's Twitter Profile Photo

🚨Looking to self-align models on complex problem-solving tasks without gold answers or labels? Checkout my internship work on ✨Self-consistency Preference Optimization (ScPO)✨, where we use the self-consistency concept to help train models by iteratively training consistent

Google Gemini App (@geminiapp) 's Twitter Profile Photo

Introducing Deep Research, your personal agentic AI research assistant. Rolling out starting today in Gemini Advanced. With Deep Research, you can create in-depth research reports on complex topics, complete with source links, giving you hours of research at your fingertips in

Peter Jansen ( @peterjansen-ai.bsky.social ) (@peterjansen_ai) 's Twitter Profile Photo

AI & Scientific Discovery Workshop deadline extended, with a few more days to submit! Submit your archival & non-archival papers broadly in the AI & Scientific Discovery space, for a great opportunity to attend a workshop with like-minded folks. ai-and-scientific-discovery.github.io

Bill Yuchen Lin (@billyuchenlin) 's Twitter Profile Photo

If you're interested in LLMs like o1 and R1 for complex reasoning, check out this paper — we show that logical reasoning tasks are ideal for evaluating and understanding their scaling limits. 🦓 ZebraLogic-Bench is a dataset of 1K constraint satisfaction problems (CSPs)

If you're interested in LLMs like o1 and R1 for complex reasoning, check out this paper — we show that logical reasoning tasks are ideal for evaluating and understanding their scaling limits.

🦓 ZebraLogic-Bench is a dataset of 1K constraint satisfaction problems (CSPs)
Google Gemini App (@geminiapp) 's Twitter Profile Photo

We’re also rolling out a version of 2.0 Flash Thinking that can interact with apps like YouTube, Google Search and @GoogleMaps. These connected apps already make the Gemini app a uniquely helpful AI-powered assistant, and we’re exploring how new reasoning capabilities can