Ashutosh Baheti (@abaheti95) 's Twitter Profile
Ashutosh Baheti

@abaheti95

Research Scientist @DbrxMosaicAI

I'm interested in Large Language Models, Multimodal Models, Reinforcement Learning and making a JARVIS 🤖

ID: 3083528599

linkhttps://abaheti95.github.io/ calendar_today15-03-2015 04:48:22

84 Tweet

344 Followers

409 Following

Alex Trott (@alexrtrott) 's Twitter Profile Photo

Ever wonder what it'd look like if an LLM Judge and a Reward Model had a baby? So did we, which is why we created PGRM -- the Prompt-Guided Reward Model. TLDR: You get the instructability of an LLM judge + the calibration of an RM in a single speedy package (1/n)

Ever wonder what it'd look like if an LLM Judge and a Reward Model had a baby? So did we, which is why we created PGRM -- the Prompt-Guided Reward Model. 

TLDR: You get the instructability of an LLM judge + the calibration of an RM in a single speedy package (1/n)
Jonathan Frankle (@jefrankle) 's Twitter Profile Photo

Not that I have a favorite recent project, but... 🧵 LLM judges are the popular way to evaluate generative models. But they have drawbacks. They're: * Generative, so slow and expensive. * Nondeterministic. * Uncalibrated. They don't know how uncertain they are. Meet PGRM!

Andrew Drozdov (@mrdrozdov) 's Twitter Profile Photo

We built a thing! The Databricks Reranker is now in Public Preview. It's as easy as changing the arguments to your vector search call, and doesn't require any additional setup. Read more: databricks.com/blog/reranking…

Ali Ghodsi (@alighodsi) 's Twitter Profile Photo

Databricks just signed a Series K term sheet at >$100B valuation to scale two flagship products: 🔥 Lakebase — serverless Postgres with true compute/storage separation 🧠 Agent Bricks — agentic framework with built-in reasoning guardrails for enterprise data

Ivan Zhou (@ivanzhouyq) 's Twitter Profile Photo

Automated prompt optimization (GEPA) can push open-source models beyond frontier performance on enterprise tasks — at a fraction of the cost! 🔑 Key results from our research Databricks Mosaic Research: 1⃣ gpt-oss-120b + GEPA beats Claude Opus 4.1 on Information Extraction (+2.2 points) —

Dipendra Misra (@dipendramisra) 's Twitter Profile Photo

Our paper showing that with a general-purpose RLVR recipe, we can get SOTA on the BIRD benchmark is out: arxiv.org/pdf/2509.21459 Our hybrid approach performs offline RL and online RL to fine-tune a 32B model. This, along with self-consistency, was sufficient to get us to SOTA.

Our paper showing that with a general-purpose RLVR recipe, we can get SOTA on the BIRD benchmark is out:

arxiv.org/pdf/2509.21459

Our hybrid approach performs offline RL and online RL to fine-tune a 32B model. This, along with self-consistency, was sufficient to get us to SOTA.
Ivan Zhou (@ivanzhouyq) 's Twitter Profile Photo

🚀 Our Databricks Mosaic Research team are looking for research interns for Summer 2026! Our team explores exciting challenges at the intersection of AI and data, especially in how AI agents can help enterprises reason over knowledge and automate data workflows. We work on

🚀 Our <a href="/databricks/">Databricks</a> Mosaic Research team are looking for research interns for Summer 2026!

Our team explores exciting challenges at the intersection of AI and data, especially in how AI agents can help enterprises reason over knowledge and automate data workflows.

We work on
Databricks (@databricks) 's Twitter Profile Photo

80% of enterprise data is unstructured, locked in PDFs, reports, and diagrams that traditional tools can’t parse or govern. Introducing ai_parse_document, state-of-the-art document intelligence on Databricks. With a single SQL command, teams can now turn any document into

Matei Zaharia (@matei_zaharia) 's Twitter Profile Photo

Great new capability in Databricks powered by our AI research team! We trained a document parsing system that delivers leading quality at 3-5x lower cost and outperforms leading VLMs like GPT-5 and Claude. This is critical to connect AI to so many kinds of data.

Great new capability in Databricks powered by our AI research team! We trained a document parsing system that delivers leading quality at 3-5x lower cost and outperforms leading VLMs like GPT-5 and Claude. This is critical to connect AI to so many kinds of data.
Jonathan Frankle (@jefrankle) 's Twitter Profile Photo

I'm missing NeurIPS BUT my extraordinary Databricks colleagues will be there: 🧱 Erich Elsen (multimodal) 🧱 Ashutosh Baheti, Abhay Gupta, Jose Javier Gonzalez (RL at scale) 🧱 Jacob Portes (search) 🧱 Veronica Qing Lyu (feedback) Hang out with them, and you won't miss me at all 🙂

Abhay Gupta (@gupta__abhay) 's Twitter Profile Photo

I’ll be at #NeurIPS2025 from 2nd-6th Dec. DM if you want to chat about MoEs, scaling training and inference, making GPUs go brrr and other fun stuff !!

Ashutosh Baheti (@abaheti95) 's Twitter Profile Photo

Will be at #NeurIPS2025 from 2nd to 6th Dec. Excited to chat about async RL, Environment Exploration, Agents/Tool use, User Simulator, Synthetic Data Generation or any other topic!! You can find me at the Databricks booth @ Tue 12 - 4pm

Databricks (@databricks) 's Twitter Profile Photo

Today we’re introducing OfficeQA, a new benchmark grounded in ~89,000 pages of U.S. Treasury Bulletins that reflects the complex, document-heavy tasks enterprises actually face. Unlike existing benchmarks, OfficeQA measures economically valuable, real-world reasoning: parsing

Today we’re introducing OfficeQA, a new benchmark grounded in ~89,000 pages of U.S. Treasury Bulletins that reflects the complex, document-heavy tasks enterprises actually face.

Unlike existing benchmarks, OfficeQA measures economically valuable, real-world reasoning: parsing
Krista Opsahl-Ong (@kristahopsalong) 's Twitter Profile Photo

Today we’re releasing OfficeQA — a new benchmark for end-to-end grounded reasoning that reflects the real work enterprises need AI agents to do. More details below 👇

Ashutosh Baheti (@abaheti95) 's Twitter Profile Photo

Interested in training RL agents for long horizon enterprise tasks? 🤖👾 Come work with our phenomenal and cracked team! 🚀

Andrew Drozdov (@mrdrozdov) 's Twitter Profile Photo

Instructed Retriever is a multi-tiered declarative approach for building high quality search agents. It's an example of an "instructed system", which goes beyond prompt tuning and tool calling by passing data among modules which work together to fulfill an information need.

Matei Zaharia (@matei_zaharia) 's Twitter Profile Photo

Lakebase is GA! We think this is going to make it radically simpler and more reliable to work with online databases. You can instantly branch your DB, take snapshots, roll back to a point in time, or create a copy for offline analysis, whether it's humans or AI doing the work.

Matei Zaharia (@matei_zaharia) 's Twitter Profile Photo

Agent memory is a simple and powerful way to do continual learning! With the new MemAlign method from Databricks Research, we can build better LLM judges from examples of human ratings, and they scale with more data. Now in Databricks and MLflow. databricks.com/blog/memalign-…

Ali Ghodsi (@alighodsi) 's Twitter Profile Photo

I now constantly get questions about the SAAS meltdown, role of AI, system of records etc. I don't have an answer to all these. But I do know that we saw an acceleration in our business in Q2, Q3, and now finished the year with accelerating Q4. The question is, why? Short

Wen Sun (@wensun1) 's Twitter Profile Photo

Many recent works try to force GRPO to be on-policy by adding things like extra importance weighting, clipping, masking, data deletion, inference engine edits, router replay… But are these actually needed? We push the other direction: make it maximally off-policy and keep it