Julen Etxaniz (@juletxara) Twitter Tweets • TwiCopy

Eliahu Horwitz | @ ICLR2025

2 years ago

🚨We uncover a new vulnerability- Pre-Fine-Tuning Weight Recovery With a few LoRA fine-tuned models we recover the pre-fine-tuning weights🏋️of SoTA models, undoing Stable Diffusion personalization training and Mistral alignment😈 Project: vision.huji.ac.il/spectral_detun… 🧵👇

thumb_up_off_alt120

chat_bubble_outline3

repeat32

shareShare

Feng Yao

@fengyao1909

5 months ago

⚡𝐅𝐏𝟖 makes RL faster — but at the cost of performance. We present 𝐅𝐥𝐚𝐬𝐡𝐑𝐋, the first 𝐨𝐩𝐞𝐧–𝐬𝐨𝐮𝐫𝐜𝐞 & 𝐰𝐨𝐫𝐤𝐢𝐧𝐠 𝐑𝐋 𝐫𝐞𝐜𝐢𝐩𝐞 that applies 𝐈𝐍𝐓𝟖/𝐅𝐏𝟖 for rollout 𝐰𝐢𝐭𝐡𝐨𝐮𝐭 𝐥𝐨𝐬𝐢𝐧𝐠 𝐩𝐞𝐫𝐟𝐨𝐫𝐦𝐚𝐧𝐜𝐞 compared to 𝐁𝐅𝟏𝟔! 📝 Blog:

thumb_up_off_alt548

chat_bubble_outline11

repeat84

shareShare

Artificial Analysis

@artificialanlys

5 months ago

We've launched benchmarks of the accuracy of providers offering APIs for gpt-oss-120b We compare providers by running GPQA Diamond 16 times, AIME25 32 times, and IFBench 8 times. We report the median score across these runs alongside minimum, 25th percentile, 75th percentile and

thumb_up_off_alt836

chat_bubble_outline50

repeat84

shareShare

👋 Jan

@jandotai

5 months ago

Introducing Jan-v1: 4B model for web search, an open-source alternative to Perplexity Pro. In our evals, Jan v1 delivers 91% SimpleQA accuracy, slightly outperforming Perplexity Pro while running fully locally. Use cases: - Web search - Deep Research Built on the new version

thumb_up_off_alt3,3K

chat_bubble_outline100

repeat440

shareShare

Skywork

@skywork_ai

5 months ago

Matrix-Game 2.0 — The FIRST open-source, real-time, long-sequence interactive world model Last week, DeepMind's Genie 3 shook the AI world with real-time interactive world models. But... it wasn't open-sourced. Today, Matrix-Game 2.0 changed the game. 🚀 25FPS. Minutes-long

thumb_up_off_alt1,1K

chat_bubble_outline45

repeat342

shareShare

Stella Biderman

@blancheminerva

5 months ago

Are you afraid of LLMs teaching people how to build bioweapons? Have you tried just... not teaching LLMs about bioweapons? @AIEleuther and AI Security Institute joined forces to see what would happen, pretraining three 6.9B models for 500B tokens and producing 15 total models to study

Are you afraid of LLMs teaching people how to build bioweapons? Have you tried just... not teaching LLMs about bioweapons?

@AIEleuther and <a href="/AISecurityInst/">AI Security Institute</a> joined forces to see what would happen, pretraining three 6.9B models for 500B tokens and producing 15 total models to study

thumb_up_off_alt556

chat_bubble_outline28

repeat72

shareShare

Brett Adcock

@adcock_brett

5 months ago

For the first time, a humanoid robot can fold laundry using a neural net We made no changes to the Helix architecture, only new data

thumb_up_off_alt5,5K

chat_bubble_outline566

repeat740

shareShare

ChatGPT

@chatgptapp

5 months ago

good bot

thumb_up_off_alt89,89K

chat_bubble_outline1,1K

repeat5,5K

shareShare

Epoch AI

@epochairesearch

5 months ago

We’ve independently evaluated the GPT-5 model family on our benchmarking suite. Here is what we’ve learned 🧵

thumb_up_off_alt458

chat_bubble_outline14

repeat54

shareShare

jack morris

@jxmnop

5 months ago

OpenAI hasn’t open-sourced a base model since GPT-2 in 2019. they recently released GPT-OSS, which is reasoning-only... or is it? turns out that underneath the surface, there is still a strong base model. so we extracted it. introducing gpt-oss-20b-base 🧵

thumb_up_off_alt6,6K

chat_bubble_outline151

repeat458

shareShare

Rohan Paul

@rohanpaul_ai

5 months ago

A Deep Dive into RL for LLM Reasoning. uts through the confusion around RL tricks for LLM reasoning and gives clear, experimentally backed rules on what actually works and when. A simple recipe, group mean + batch std normalization plus token‑level loss, makes critic‑free PPO

thumb_up_off_alt27

chat_bubble_outline2

repeat6

shareShare

lmarena.ai (formerly lmsys.org)

@lmarena_ai

5 months ago

🚨 Open Model Leaderboard Update New open models entered the Text Arena, and the rankings by provider have reshuffled for August. - Qwen-3-235b-a22b-instruct from Qwen takes the crown 🏆 - GLM-4.5 from Z.ai and gpt-oss-120b by @openAI debut in the top 10! All the

thumb_up_off_alt538

chat_bubble_outline15

repeat93

shareShare

Ai2

@allen_ai

5 months ago

With fresh support of $75M from U.S. National Science Foundation and $77M from @NVIDIA, we’re set to scale our open model ecosystem, bolster the infrastructure behind it, and fast‑track reproducible AI research to unlock the next wave of scientific discovery. 💡

With fresh support of $75M from <a href="/NSF/">U.S. National Science Foundation</a> and $77M from @NVIDIA, we’re set to scale our open model ecosystem, bolster the infrastructure behind it, and fast‑track reproducible AI research to unlock the next wave of scientific discovery. 💡

thumb_up_off_alt659

chat_bubble_outline31

repeat64

shareShare

Google AI Developers

@googleaidevs

5 months ago

Introducing Gemma 3 270M! 🚀 It sets a new standard for instruction-following in compact models, while being extremely efficient for specialized tasks. developers.googleblog.com/en/introducing…

thumb_up_off_alt1,1K

chat_bubble_outline67

repeat283

shareShare

Philipp Schmid

@_philschmid

5 months ago

Introducing Gemma 3 270M, a new compact open model engineered for hyper-efficient AI. Built on the Gemma 3 architecture with 170 million embedding parameters and 100 million for transformer blocks. - Sets a new performance for its size on IFEval. - Built for domain and adoption

thumb_up_off_alt483

chat_bubble_outline25

repeat60

shareShare

AI at Meta

@aiatmeta

5 months ago

Introducing DINOv3: a state-of-the-art computer vision model trained with self-supervised learning (SSL) that produces powerful, high-resolution image features. For the first time, a single frozen vision backbone outperforms specialized solutions on multiple long-standing dense

thumb_up_off_alt3,3K

chat_bubble_outline150

repeat689

shareShare

Nous Research

@nousresearch

5 months ago

Measuring Thinking Efficiency in Reasoning Models: The Missing Benchmark nousresearch.com/measuring-thin… We measured token usage across reasoning models: open models output 1.5-4x more tokens than closed models on identical tasks, but with huge variance depending on task type (up to

thumb_up_off_alt402

chat_bubble_outline20

repeat58

shareShare

lmarena.ai (formerly lmsys.org)

@lmarena_ai

5 months ago

🚨 Leaderboard Update: OpenAI lands another model in the top 10. gpt-5-chat, the default model in ChatGPT, debuts at #5. gpt-5-mini-high and gpt-5-nano-high, the smaller versions gpt-5-high in at #16 and #44. These three reasoning models were configured with the highest

🚨 Leaderboard Update:
<a href="/OpenAI/">OpenAI</a> lands another model in the top 10. gpt-5-chat, the default model in ChatGPT, debuts at #5.

gpt-5-mini-high and gpt-5-nano-high, the smaller versions gpt-5-high in at #16 and #44. These three reasoning models were configured with the highest

thumb_up_off_alt498

chat_bubble_outline40

repeat62

shareShare

ARC Prize

@arcprize

5 months ago

Analyzing the Hierarchical Reasoning Model by Guan Wang We verified scores on hidden tasks, ran ablations, and found that performance comes from an unexpected source ARC-AGI Semi Private Scores: * ARC-AGI-1: 32% * ARC-AGI-2: 2% Our 4 findings:

Analyzing the Hierarchical Reasoning Model by <a href="/makingAGI/">Guan Wang</a>

We verified scores on hidden tasks, ran ablations, and found that performance comes from an unexpected source

ARC-AGI Semi Private Scores:
* ARC-AGI-1: 32%
* ARC-AGI-2: 2%

Our 4 findings:

thumb_up_off_alt1,1K

chat_bubble_outline36

repeat171

shareShare

François Chollet

@fchollet

5 months ago

We were able to reproduce the strong findings of the HRM paper on ARC-AGI-1. Further, we ran a series of ablation experiments to get to the bottom of what's behind it. Key findings: 1. The HRM model architecture itself (the centerpiece of the paper) is not an important factor.

thumb_up_off_alt2,2K

chat_bubble_outline40

repeat257

shareShare