RELAI (@reliableai) 's Twitter Profile
RELAI

@reliableai

Making AI reliability accessible and achievable for everyone.

ID: 1459987295289937927

linkhttps://relai.ai/ calendar_today14-11-2021 20:51:33

12 Tweet

150 Followers

0 Following

RELAI (@reliableai) 's Twitter Profile Photo

We evaluated several hallucination detection methods on OpenAI's recently released SimpleQA benchmark. RELAI agents detected over 76% of GPT-4o's hallucinations with just a 5% false positive rate. Even more impressively, RELAI detected nearly 1/3 of GPT-4o's hallucinations with

Soheil Feizi (@feizisoheil) 's Twitter Profile Photo

Honored to have contributed to the House Bipartisan Task Force on Artificial Intelligence's report. Learn more about it here: science.house.gov/press-releases… Access the full report: speaker.gov/wp-content/upl… I hope this effort contributes to advancing AI that is reliable, safe, and

RELAI (@reliableai) 's Twitter Profile Photo

Our leaderboard is now live! Check it out at: relai.ai Let us know if you want to add your model or data to our leaderboard: calendly.com/d/crx2-k7b-pcm

RELAI (@reliableai) 's Twitter Profile Photo

Our legal reasoning benchmark is live! Check it out! Let us know if you want to add your model or data to our leaderboard: calendly.com/d/crx2-k7b-pcm

Soheil Feizi (@feizisoheil) 's Twitter Profile Photo

Looking forward to the Agentic AI Summit in Berkeley! If you’re working on agentic systems and interested in deep AI agent optimization, let’s chat!

Soheil Feizi (@feizisoheil) 's Twitter Profile Photo

How good (or bad) is GPT-5 — and does it matter for you? I’ve been seeing a lot of posts lately debating the quality of GPT-5’s responses. I tried a few of the examples people mentioned. Here’s one from my own experiment (screenshot attached): I asked GPT-5 to solve a simple

How good (or bad) is GPT-5 — and does it matter for you?

I’ve been seeing a lot of posts lately debating the quality of GPT-5’s responses. I tried a few of the examples people mentioned. Here’s one from my own experiment (screenshot attached):

I asked GPT-5 to solve a simple
Soheil Feizi (@feizisoheil) 's Twitter Profile Photo

Introducing Maestro: the holistic optimizer for AI agents. Maestro optimizes the agent graph and tunes prompts/models/tools, fixing agent failure modes that prompt-only or RL weight tuning can’t touch. Maestro outperforms leading prompt optimizers (e.g., MIPROv2, GEPA) on

Introducing Maestro: the holistic optimizer for AI agents.
Maestro optimizes the agent graph and tunes prompts/models/tools, fixing agent failure modes that prompt-only or RL weight tuning can’t touch.

Maestro outperforms leading prompt optimizers (e.g., MIPROv2, GEPA) on
RELAI (@reliableai) 's Twitter Profile Photo

Prompt Tuning ≠ System Tuning. Most AI agent failures are structural; we keep the agent graph frozen (modules & info flow), then wonder why agents hallucinate, misroute tools, or break guidelines. Meet Maestro: the first joint graph + config optimizer for AI agents. It

Soheil Feizi (@feizisoheil) 's Twitter Profile Photo

let’s talk instruction-following: In prod, “did it follow the spec?” matters more than vibes. IFBench is a challenging benchmark to check whether agents/models obey unseen output/format constraints (length windows, HTML/Markdown rules, sectioning, etc.). That’s a real

let’s talk instruction-following:

In prod, “did it follow the spec?” matters more than vibes. IFBench is a challenging benchmark to check whether agents/models obey unseen output/format constraints (length windows, HTML/Markdown rules, sectioning, etc.). 

That’s a real
RELAI (@reliableai) 's Twitter Profile Photo

🎃 Here’s a sweet Halloween treat from RELAI: We built an AI agent that maps the best trick-or-treat route for you—optimized for time, distance, candy variety, and real walking paths. 👉 Try it free: platform.relai.ai/halloween Built at RELAI.ai, where we ship

Soheil Feizi (@feizisoheil) 's Twitter Profile Photo

🚀 Build AI agents that actually work — in just 2 hours! We’re launching Reliable AI Agent Sprints—free, fully virtual sessions to build practical, reliable agentic solutions. This isn’t a flashy demo contest; we’ll design, simulate, evaluate, and optimize real agents,

🚀 Build AI agents that actually work — in just 2 hours!

We’re launching Reliable AI Agent Sprints—free, fully virtual sessions to build practical, reliable agentic solutions. This isn’t a flashy demo contest; we’ll design, simulate, evaluate, and optimize real agents,
RELAI (@reliableai) 's Twitter Profile Photo

One notebook to: "build -> simulate -> evaluate -> optimize" your agentic RAG! 🔗 Notebook: colab.research.google.com/drive/1N9l0PhO…