Morteza Heidari (@mortezaheidarii) 's Twitter Profile
Morteza Heidari

@mortezaheidarii

Staff machine learning engineer @Intel. Ex- @Philips,
Ph.D. @UofOklahoma
Passionate about #AI #ML #DL #NLP #LLM

ID: 1492509661

linkhttps://morteza89.github.io/ calendar_today08-06-2013 09:54:28

90 Tweet

104 Followers

337 Following

Jared Roesch (@roeschinc) 's Twitter Profile Photo

Thrilled to announce we're open-sourcing the CUDA Tile dialect and bytecode! github.com/NVIDIA/cuda-ti… What's included:     • CUDA Tile MLIR dialect     • Bytecode serialization/deserialization support     • MLIR Python bindings for programmatic IR construction     •

Rohan Paul (@rohanpaul_ai) 's Twitter Profile Photo

Fascinating Google paper: just repeating your prompt 2 times can seriously boost LLM performance, sometimes pushing accuracy from 21% to 97% on certain search tasks. An LLM reads your prompt left to right, so early words get processed before the model has seen the later words

Fascinating Google paper: just repeating your prompt 2 times can seriously boost LLM performance, sometimes pushing accuracy from 21% to 97% on certain search tasks.

An LLM reads your prompt left to right, so early words get processed before the model has seen the later words
Tips Excel (@gudanglifehack) 's Twitter Profile Photo

🚨 Anthropic dropped a FREE 33-page playbook revealing Claude's very own cheat code: The 'Skills' folder. Spend 30 minutes building it, and you’ll never have to explain your process again. Top-tier users don't just type commands, they build systems. Grab your free copy of

🚨 Anthropic dropped a FREE 33-page playbook revealing Claude's very own cheat code:

The 'Skills' folder.

Spend 30 minutes building it,
and you’ll never have to explain your process again.

Top-tier users don't just type commands, they build systems.

Grab your free copy of
Clutch God (@xsports_1) 's Twitter Profile Photo

Anthropic CEO: “50% of all entry-level Lawyers, Consultants, and Finance Professionals will be completely wiped out within the next 1–5 years." grad students and junior hires are cooked.

Google Research (@googleresearch) 's Twitter Profile Photo

Introducing TurboQuant: Our new compression algorithm that reduces LLM key-value cache memory by at least 6x and delivers up to 8x speedup, all with zero accuracy loss, redefining AI efficiency. Read the blog to learn how it achieves these results: goo.gle/4bsq2qI

Haocheng Xi (@haochengxiucb) 's Twitter Profile Photo

Really exciting to see KV-cache compression getting attention. A similar bottleneck shows up beyond LLMs: for world models and autoregressive long-video generation, KV cache can quickly dominate memory and limit long-horizon consistency. Our recent work, Quant VideoGen,

Really exciting to see KV-cache compression getting attention.

A similar bottleneck shows up beyond LLMs: for world models and autoregressive long-video generation, KV cache can quickly dominate memory and limit long-horizon consistency.

Our recent work, Quant VideoGen,
Vivek Galatage (@vivekgalatage) 's Twitter Profile Photo

Inside NVIDIA GPUs: Anatomy of high-performance matmul kernels One of the finest and in-depth posts that everyone MUST read. Amazing work by Aleksa Gordić (水平问题)!! aleksagordic.com/blog/matmul

Inside NVIDIA GPUs: Anatomy of high-performance matmul kernels

One of the finest and in-depth posts that everyone MUST read.  Amazing work by <a href="/gordic_aleksa/">Aleksa Gordić (水平问题)</a>!!

aleksagordic.com/blog/matmul
elvis (@omarsar0) 's Twitter Profile Photo

NEW research from NVIDIA. Post-training agents with RL is powerful but expensive. Every parameter update needs full multi-turn rollouts with environment interactions, making end-to-end RL prohibitively costly for long-horizon agentic tasks. This research offers a practical

NEW research from NVIDIA.

Post-training agents with RL is powerful but expensive.

Every parameter update needs full multi-turn rollouts with environment interactions, making end-to-end RL prohibitively costly for long-horizon agentic tasks.

This research offers a practical
Ben Sigman (@bensig) 's Twitter Profile Photo

30 second explanation of the MemPalace by Milla Jovovich. By day she’s filming action movies, walking Miu Miu fashion shows, and being a mom. By night she’s coding. She’s the most creative, brilliant, and hilarious person I know. I’m honored to be working with her on this

PyTorch (@pytorch) 's Twitter Profile Photo

Improve latency up to 1.68x with NVFP4 and MXFP8 using Diffusers and TorchAO on Blackwell across a suite of different models 🔥. Squeeze out maximum performance with recipes involving selective quantization and regional compilation. 🔗 Read our latest blog from Vasiliy Kuznetsov (Meta)

Improve latency up to 1.68x with NVFP4 and MXFP8 using Diffusers and TorchAO on Blackwell across a suite of different models 🔥. 

Squeeze out maximum performance with recipes involving selective quantization and regional compilation.

🔗 Read our latest blog from <a href="/vkuzo/">Vasiliy Kuznetsov</a> (<a href="/Meta/">Meta</a>)
NVIDIA (@nvidia) 's Twitter Profile Photo

Open-source software never stops. It only accelerates. Dynamo, SGLang, TensorRT LLM, and vLLM are constantly optimized by a vast ecosystem of developers building on top of the NVIDIA platform. The result: your token output keeps improving and token cost keeps

Piotr Nawrot (@p_nawrot) 's Twitter Profile Photo

⚡🔧 PyTorch inference optimization just got a lot simpler Introducing AITune — NVIDIA's new library that automatically finds the fastest inference backend for any PyTorch model. It covers TensorRT, Torch Inductor, TorchAO and more, benchmarks all of them on your model and

⚡🔧 PyTorch inference optimization just got a lot simpler

Introducing AITune — NVIDIA's new library that automatically finds the fastest inference backend for any PyTorch model. It covers TensorRT, Torch Inductor, TorchAO and more, benchmarks all of them on your model and
Morteza Heidari (@mortezaheidarii) 's Twitter Profile Photo

This is a masterclass in optimization. A deep dive into how cache locality, tiling, SIMD/AVX2, and memory management can turn a "simple" matrix transpose into a 71× speedup. Essential reading for anyone interested in high-performance computing.