Morteza Heidari (@mortezaheidarii) Twitter Tweets • TwiCopy

Jared Roesch

5 months ago

Thrilled to announce we're open-sourcing the CUDA Tile dialect and bytecode! github.com/NVIDIA/cuda-ti… What's included: • CUDA Tile MLIR dialect • Bytecode serialization/deserialization support • MLIR Python bindings for programmatic IR construction •

thumb_up_off_alt755

chat_bubble_outline11

repeat120

shareShare

Math Cafe

@riazi_cafe_en

4 months ago

UC Berkeley's "Machine Learning" lecture notes PDF: people.eecs.berkeley.edu/~jrs/papers/ma…

thumb_up_off_alt941

chat_bubble_outline1

repeat154

shareShare

Rohan Paul

@rohanpaul_ai

3 months ago

Fascinating Google paper: just repeating your prompt 2 times can seriously boost LLM performance, sometimes pushing accuracy from 21% to 97% on certain search tasks. An LLM reads your prompt left to right, so early words get processed before the model has seen the later words

thumb_up_off_alt1,1K

chat_bubble_outline77

repeat221

shareShare

Tips Excel

@gudanglifehack

2 months ago

🚨 Anthropic dropped a FREE 33-page playbook revealing Claude's very own cheat code: The 'Skills' folder. Spend 30 minutes building it, and you’ll never have to explain your process again. Top-tier users don't just type commands, they build systems. Grab your free copy of

thumb_up_off_alt3,3K

chat_bubble_outline15

repeat384

shareShare

Clutch God

@xsports_1

2 months ago

Anthropic CEO: “50% of all entry-level Lawyers, Consultants, and Finance Professionals will be completely wiped out within the next 1–5 years." grad students and junior hires are cooked.

thumb_up_off_alt15,15K

chat_bubble_outline1,1K

repeat2,2K

shareShare

Google Research

@googleresearch

2 months ago

Introducing TurboQuant: Our new compression algorithm that reduces LLM key-value cache memory by at least 6x and delivers up to 8x speedup, all with zero accuracy loss, redefining AI efficiency. Read the blog to learn how it achieves these results: goo.gle/4bsq2qI

thumb_up_off_alt9,9K

chat_bubble_outline270

repeat1,1K

shareShare

Haocheng Xi

@haochengxiucb

a month ago

Really exciting to see KV-cache compression getting attention. A similar bottleneck shows up beyond LLMs: for world models and autoregressive long-video generation, KV cache can quickly dominate memory and limit long-horizon consistency. Our recent work, Quant VideoGen,

thumb_up_off_alt466

chat_bubble_outline11

repeat63

shareShare

Vivek Galatage

@vivekgalatage

a month ago

Inside NVIDIA GPUs: Anatomy of high-performance matmul kernels One of the finest and in-depth posts that everyone MUST read. Amazing work by Aleksa Gordić (水平问题)!! aleksagordic.com/blog/matmul

Inside NVIDIA GPUs: Anatomy of high-performance matmul kernels

One of the finest and in-depth posts that everyone MUST read. Amazing work by <a href="/gordic_aleksa/">Aleksa Gordić (水平问题)</a>!!

aleksagordic.com/blog/matmul

thumb_up_off_alt415

chat_bubble_outline2

repeat49

shareShare

elvis

@omarsar0

a month ago

NEW research from NVIDIA. Post-training agents with RL is powerful but expensive. Every parameter update needs full multi-turn rollouts with environment interactions, making end-to-end RL prohibitively costly for long-horizon agentic tasks. This research offers a practical

thumb_up_off_alt405

chat_bubble_outline22

repeat70

shareShare

Amit Shekhar

@amitiitbhu

a month ago

x.com/i/article/2038…

thumb_up_off_alt254

chat_bubble_outline4

repeat41

shareShare

DAIR.AI

@dair_ai

a month ago

x.com/i/article/2038…

thumb_up_off_alt721

chat_bubble_outline12

repeat122

shareShare

VideoCardz.com

@videocardz

a month ago

Intel Arc Pro B70 is now Newegg’s No. 1 best seller in workstation graphics cards videocardz.com/newz/intel-arc…

thumb_up_off_alt191

chat_bubble_outline7

repeat27

shareShare

Jino Rohit

@jino_rohit

a month ago

x.com/i/article/2040…

thumb_up_off_alt353

chat_bubble_outline13

repeat46

shareShare

Hunter Bown

@huntermbown

a month ago

more people should be talking about this github.com/NVIDIA-NeMo/Da…

thumb_up_off_alt1,1K

chat_bubble_outline20

repeat180

shareShare

Ben Sigman

@bensig

a month ago

30 second explanation of the MemPalace by Milla Jovovich. By day she’s filming action movies, walking Miu Miu fashion shows, and being a mom. By night she’s coding. She’s the most creative, brilliant, and hilarious person I know. I’m honored to be working with her on this

thumb_up_off_alt613

chat_bubble_outline44

repeat77

shareShare

PyTorch

@pytorch

a month ago

Improve latency up to 1.68x with NVFP4 and MXFP8 using Diffusers and TorchAO on Blackwell across a suite of different models 🔥. Squeeze out maximum performance with recipes involving selective quantization and regional compilation. 🔗 Read our latest blog from Vasiliy Kuznetsov (Meta)

thumb_up_off_alt71

chat_bubble_outline0

repeat7

shareShare

NVIDIA

@nvidia

a month ago

Open-source software never stops. It only accelerates. Dynamo, SGLang, TensorRT LLM, and vLLM are constantly optimized by a vast ecosystem of developers building on top of the NVIDIA platform. The result: your token output keeps improving and token cost keeps

thumb_up_off_alt306

chat_bubble_outline21

repeat56

shareShare

Piotr Nawrot

@p_nawrot

a month ago

⚡🔧 PyTorch inference optimization just got a lot simpler Introducing AITune — NVIDIA's new library that automatically finds the fastest inference backend for any PyTorch model. It covers TensorRT, Torch Inductor, TorchAO and more, benchmarks all of them on your model and

thumb_up_off_alt387

chat_bubble_outline5

repeat64

shareShare

Morteza Heidari

@mortezaheidarii

a month ago

This is a masterclass in optimization. A deep dive into how cache locality, tiling, SIMD/AVX2, and memory management can turn a "simple" matrix transpose into a 71× speedup. Essential reading for anyone interested in high-performance computing.

thumb_up_off_alt1

chat_bubble_outline0

repeat0

shareShare

Jino Rohit

@jino_rohit

23 days ago

x.com/i/article/2045…

thumb_up_off_alt227

chat_bubble_outline8

repeat24

shareShare