Abhinav Maurya (@ahmaurya) 's Twitter Profile
Abhinav Maurya

@ahmaurya

AI builder. Scientist (Google, Amazon), engineer (Microsoft), ML research (IIT Bombay, CMU). Chasing epistemological thrills. Making interesting mistakes.

ID: 17162491

linkhttp://ahmaurya.github.io calendar_today04-11-2008 17:46:27

7,7K Tweet

215 Followers

20 Following

Hamel Husain (@hamelhusain) 's Twitter Profile Photo

TOC for the open book "Beyond Naive RAG: Practical Advanced Methods" from our RAG series. This condenses 5 hours of instruction into something you can read in ~30 minutes. Link: maven.com/p/945082/beyon… Ben Clavié Nandan Thakur Orion Weller Antoine Chaffin Bryan Bischof fka Dr. Donut

TOC for the open book "Beyond Naive RAG: Practical Advanced Methods" from our RAG series.  

This  condenses 5 hours of instruction into something you can read in ~30 minutes. 

Link: maven.com/p/945082/beyon…

<a href="/bclavie/">Ben Clavié</a> <a href="/beirmug/">Nandan Thakur</a> <a href="/orionweller/">Orion Weller</a> <a href="/antoine_chaffin/">Antoine Chaffin</a>  <a href="/BEBischof/">Bryan Bischof fka Dr. Donut</a>
Rohan Paul (@rohanpaul_ai) 's Twitter Profile Photo

This survey maps the LLM benchmark landscape and shows where current tests fall short. 283 benchmarks, 3 categories: general, domain specific, target specific. The paper flags 3 traps that skew results. Training data leaks inflate scores. Culture and language bias tilt

This survey maps the LLM benchmark landscape and shows where current tests fall short.

283 benchmarks, 3 categories: general, domain specific, target specific. 

The paper flags 3 traps that skew results.

Training data leaks inflate scores.

Culture and language bias tilt
Georgi Gerganov (@ggerganov) 's Twitter Profile Photo

To run gpt-oss-20b on a 16GB Mac use these commands: brew install llama.cpp llama-server -hf ggml-org/gpt-oss-20b-GGUF --n-cpu-moe 12 -fa -c 32768 --jinja --no-mmap Then open the browser at http://127.0.0.1:8080

Ahmad (@theahmadosman) 's Twitter Profile Photo

GLM 4.5 with Claude Code is the closest thing to Opus 4 imo For Agentic coding tools GLM 4.5 > KIMI K2 > QWEN 3 235B NON-THINKING > Qwen 3 CODER 480B

Raphaël Dabadie🇫🇷 (@raphaeldabadie) 's Twitter Profile Photo

🐺 Introducing the Werewolf Benchmark, an AI test for social reasoning under pressure. Can models lead, bluff, and resist manipulation in live, adversarial play? 👉 We made 7 of the strongest LLMs, both open-source and closed-source, play 210 full games of Werewolf. Below is

🐺 Introducing the Werewolf Benchmark, an AI test for social reasoning under pressure.

Can models lead, bluff, and resist manipulation in live, adversarial play?

👉 We made 7 of the strongest LLMs, both open-source and closed-source, play 210 full games of Werewolf. 

Below is
Ahmad (@theahmadosman) 's Twitter Profile Photo

Comparing & Contrasting Recent LLMs Architecture > DeepSeek-V3/R1 > OLMo 2 > Gemma 3 > Mistral Small 3.1 > Llama 4 > Qwen3 (dense+MoE) > SmolLM3 > Kimi 2 > GPT-OSS Are 2025 LLMs really that different from each other? MoE, MLA, GQA, sliding window, normalization games & more.

Comparing &amp; Contrasting Recent LLMs Architecture

&gt; DeepSeek-V3/R1
&gt; OLMo 2
&gt; Gemma 3
&gt; Mistral Small 3.1
&gt; Llama 4
&gt; Qwen3 (dense+MoE)
&gt; SmolLM3
&gt; Kimi 2
&gt; GPT-OSS

Are 2025 LLMs really that different from each other?

MoE, MLA, GQA, sliding window, normalization games &amp; more.
Ivan Fioravanti ᯅ (@ivanfioravanti) 's Twitter Profile Photo

gpt-oss-120b-mxfp4-Q8 MMLU Pro is here! After ~33 hours on M3 Ultra 512GB 🤯 Overall: 68.1% All default settings in this first run, so reasoning level should be low. I will now benchmark gpt-oss-120b-mxfp4-bf16 (Attention layers not quantized) and medium reasoning for Q8.

gpt-oss-120b-mxfp4-Q8 MMLU Pro is here!
After ~33 hours on M3 Ultra 512GB 🤯

Overall: 68.1% 

All default settings in this first run, so reasoning level should be low. I will now benchmark gpt-oss-120b-mxfp4-bf16 (Attention layers not quantized) and medium reasoning for Q8.
Simone Scardapane (@s_scardapane) 's Twitter Profile Photo

*Alice's book got a (minor) upgrade!* Thanks to the dozens of people who gave feedback, now with 1000% less typos and errors, a novel set of Colab lab sessions, and a brand-new CC-BY-SA license. 🙃 sscardapane.it/alice-book/

*Alice's book got a (minor) upgrade!*

Thanks to the dozens of people who gave feedback, now with 1000% less typos and errors, a novel set of Colab lab sessions, and a brand-new CC-BY-SA license. 🙃

sscardapane.it/alice-book/
SemiAnalysis (@semianalysis_) 's Twitter Profile Photo

Throughout each successive generation of NVIDIA Tensor Cores, NVIDIA continues to add lower precision data types, starting from 16-bit to 4-bits. This is because deep learning workloads are extremely tolerant of low precision. This is especially true for inference, where even

Throughout each successive generation of NVIDIA Tensor Cores, NVIDIA continues to add lower precision data types, starting from 16-bit to 4-bits. This is because deep learning workloads are extremely tolerant of low precision. This is especially true for inference, where even
Zhihu Frontier (@zhihufrontier) 's Twitter Profile Photo

🚀 Introducing slime v0.1.0 — An open-source RL infra powering models like GLM-4.5, built by THUDM & Zhipu AI. Z.ai RL infra 朱小霖 shared a deep dive on Zhihu into how they redefined high-performance RL infra👇 🛠️ What's new in v0.1.0? • High-performance inference for

🚀 Introducing slime v0.1.0  — An open-source RL infra powering models like GLM-4.5, built by THUDM &amp; Zhipu AI.
<a href="/Zai_org/">Z.ai</a> RL infra 朱小霖 shared a deep dive on Zhihu into how they redefined high-performance RL infra👇

🛠️ What's new in v0.1.0?
•  High-performance inference for
Kyle Corbitt (@corbtt) 's Twitter Profile Photo

🚨 We’ve just published a recipe to train a frontier-level deep research agent using RL. With just 30 hours on an H200, any developer can now beat Sonnet-4 on DeepResearch Bench using open-source tools. (Thread 🧵)

🚨 We’ve just published a recipe to train a frontier-level deep research agent using RL.

With just 30 hours on an H200, any developer can now beat Sonnet-4 on DeepResearch Bench using open-source tools.

(Thread 🧵)
Lucas Beyer (bl16) (@giffmana) 's Twitter Profile Photo

When an article about hpc/optimization is called "anatomy of" then you know the authors know their shit. For those who don't know, this is basically HPC's "attention is all you need" but less overcooked:

When an article about hpc/optimization is called "anatomy of" then you know the authors know their shit.

For those who don't know, this is basically HPC's "attention is all you need" but less overcooked:
The Metropolitan Museum of Art (@metmuseum) 's Twitter Profile Photo

On September 7, 1917, American modernist Jacob Lawrence was born. Lawrence is celebrated for his visual interpretations of Black life and history, which he often explored in series. Today, Lawrence is considered one of the leading innovators of modernism in the United States,

On September 7, 1917, American modernist Jacob Lawrence was born. Lawrence is celebrated for his visual interpretations of Black life and history, which he often explored in series.

Today, Lawrence is considered one of the leading innovators of modernism in the United States,
Tim Dettmers (@tim_dettmers) 's Twitter Profile Photo

It feels the coding agent frontier is now open-weights: GLM 4.5 costs only $3/month and is on par with Sonnet Kimi K2.1 Turbo is 3x speed, 7x cheaper vs Opus 4.1, but as good Kimi K2.1 feels clean. The best model for me. GPT-5 is only good for complicated specs -- too slow.

lmarena.ai (formerly lmsys.org) (@lmarena_ai) 's Twitter Profile Photo

🚨 Leaderboard Disrupted! Two new models have entered the Top 10 Text leaderboard: 🔸#6 Qwen3-max-preview (Proprietary) by Qwen 🔸#8 Kimi-K2-0905-preview (Modified MIT) by Kimi.ai tied with 7 others. Note that this puts Kimi-K2-0905-preview in a tight race for

🚨 Leaderboard Disrupted!
Two new models have entered the Top 10 Text leaderboard:

🔸#6 Qwen3-max-preview (Proprietary) by <a href="/Alibaba_Qwen/">Qwen</a>
🔸#8 Kimi-K2-0905-preview (Modified MIT) by <a href="/Kimi_Moonshot/">Kimi.ai</a> tied with 7 others.

Note that this puts Kimi-K2-0905-preview in a tight race for
bolt.new (@boltdotnew) 's Twitter Profile Photo

Claude Code & OpenAI Codex are coming to Bolt. Build enterprise-grade products visually right in your browser.  No setup. No CLI tools. No 💔 error loops. Which agent are you most excited for?

Greg Brockman (@gdb) 's Twitter Profile Photo

We've released a large-scale study on how people are using ChatGPT. Consumer adoption has broadened beyond early-user groups, and lots of economic value is being created through both personal and professional use: openai.com/index/how-peop…

We've released a large-scale study on how people are using ChatGPT.

Consumer adoption has broadened beyond early-user groups, and lots of economic value is being created through both personal and professional use:

openai.com/index/how-peop…
Ethan Mollick (@emollick) 's Twitter Profile Photo

Some useful findings: 1) Working with AI boosts the performance of people solving math, science & ethics questions 2) The biggest boost is for the hardest problems 3) High performers remain highest performing, but low performers gain more 4) People who are good with AI gain most

Some useful findings:
1) Working with AI boosts the performance of people solving math, science &amp; ethics questions
2) The biggest boost is for the hardest problems
3) High performers remain highest performing, but low performers gain more
4) People who are good with AI gain most