Hailey Collet (@haileystormc) 's Twitter Profile
Hailey Collet

@haileystormc

Mother. Ex Controls Engineer. Software dev. AI enthusiast & tinkerer.

ID: 1044998777017266176

calendar_today26-09-2018 17:14:58

2,2K Tweet

304 Takipçi

58 Takip Edilen

Cassidy Laidlaw (@cassidy_laidlaw) 's Twitter Profile Photo

We built an AI assistant that plays Minecraft with you. Start building a house—it figures out what you’re doing and jumps in to help. This assistant *wasn't* trained with RLHF. Instead, it's powered by *assistance games*, a better path forward for building AI assistants. 🧵

Taelin (@victortaelin) 's Twitter Profile Photo

QuickSort has been synthesized! I've finally added varlen uints to NeoGen, and, for the first time ever, it has invented, on its own, an O(n*log(n)) sorting algorithm! NeoGen isn't an AI. There is no model, no pre-training, no previous knowledge of sorting. In fact, the system

QuickSort has been synthesized!

I've finally added varlen uints to NeoGen, and, for the first time ever, it has invented, on its own, an O(n*log(n)) sorting algorithm!

NeoGen isn't an AI. There is no model, no pre-training, no previous knowledge of sorting. In fact, the system
Fern (@hi_tysam) 's Twitter Profile Photo

I've successfully integrated DiLoCo w/ modded-nanoGPT, and made a few changes that appear to decrease error over the baseline by up to ~8-9%. Experiment notes & future directions for experimentation below!

I've successfully integrated DiLoCo w/ modded-nanoGPT, and made a few changes that appear to decrease error over the baseline by up to ~8-9%.

Experiment notes & future directions for experimentation below!
zed (@zmkzmkz) 's Twitter Profile Photo

EARLY PREPRINT: Softpick: No Attention Sink, No Massive Activations with Rectified Softmax Why do we use softmax in attention, even though we don’t really need non-zero probabilities that sum to one, causing attention sink and large hidden state activations? Let that sink in.

EARLY PREPRINT:
Softpick: No Attention Sink, No Massive Activations with Rectified Softmax

Why do we use softmax in attention, even though we don’t really need non-zero probabilities that sum to one, causing attention sink and large hidden state activations?

Let that sink in.
Hailey Collet (@haileystormc) 's Twitter Profile Photo

I've played with Claude 4.0 (mostly Sonnet afaik) for a few weeks. Code only. It's definitely better with few-turn long sequence, and it's fair bit better with pytroch. 3.7 rarely if ever outperforms it. It suffers many same flaws. Code-wise worthy of 4.0 moniker but def not 🤯

Sukjun (June) Hwang (@sukjun_hwang) 's Twitter Profile Photo

Tokenization has been the final barrier to truly end-to-end language models. We developed the H-Net: a hierarchical network that replaces tokenization with a dynamic chunking process directly inside the model, automatically discovering and operating over meaningful units of data

Hailey Collet (@haileystormc) 's Twitter Profile Photo

I'm still working on the spectral weight normalization thing, and other back logged stuff, but... H-Net + HRM (+ MoE including non-HRM "knowledge store" experts with an HRM-H attention retrieval, and new experts for continual learning).

Mehul Damani @ ICLR (@mehuldamani2) 's Twitter Profile Photo

🚨New Paper!🚨 We trained reasoning LLMs to reason about what they don't know. o1-style reasoning training improves accuracy but produces overconfident models that hallucinate more. Meet RLCR: a simple RL method that trains LLMs to reason and reflect on their uncertainty --

🚨New Paper!🚨
We trained reasoning LLMs to reason about what they don't know.

o1-style reasoning training improves accuracy but produces overconfident models that hallucinate more.

Meet RLCR: a simple RL method that trains LLMs to reason and reflect on their uncertainty --
fofr (@fofrai) 's Twitter Profile Photo

🤯 So I just kept adding stuff, and Qwen just did it... > a man, a woman and a dog are standing against a backdrop, the backdrop is divided equally in thirds, left side is red, middle is white, right side is gold, the woman is wearing a beige t-shirt with a yoda motif, she is

🤯 So I just kept adding stuff, and Qwen just did it...

> a man, a woman and a dog are standing against a backdrop, the backdrop is divided equally in thirds, left side is red, middle is white, right side is gold, the woman is wearing a beige t-shirt with a yoda motif, she is
Hailey Collet (@haileystormc) 's Twitter Profile Photo

GPT5 is a "tock" release (cf Intel). There are some meaningful capability bumps, but it's not a "generational leap." But it's less jagged, and looking at speed&prices *it's a fair bit more efficient*. No wall, just consolidation. (From an API pov / discounting router weirdness.)

Flowers (@flowersslop) 's Twitter Profile Photo

Some of you asked me about my blind test, so I created a quick website for yall to test 4o against 5 yourself. Both have the same system message to give short outputs without formatting because else its too easy to see which one is which. gptblindvoting.vercel.app

Bryan Catanzaro (@ctnzr) 's Twitter Profile Photo

Today we're releasing NVIDIA Nemotron Nano v2 - a 9B hybrid SSM that is 6X faster than similarly sized models, while also being more accurate. Along with this model, we are also releasing most of the data we used to create it, including the pretraining corpus. Links to the

Today we're releasing NVIDIA Nemotron Nano v2 - a 9B hybrid SSM that is 6X faster than similarly sized models, while also being more accurate.

Along with this model, we are also releasing most of the data we used to create it, including the pretraining corpus.

Links to the
Keller Jordan (@kellerjordan0) 's Twitter Profile Photo

Great to see this effort towards rigorous hyperparameter tuning. Two areas for improvement: 1. IIUC, the scaled up run here isn't actually tuned at all - its hparams are set via extrapolation 2. Sensitive hparams need a more granular sweep than power-of-2 x.com/percyliang/sta…