vs πŸ‡ΊπŸ‡¦ 🌏 (@parabolabam) 's Twitter Profile
vs πŸ‡ΊπŸ‡¦ 🌏

@parabolabam

Engineer

ID: 949344528602873856

calendar_today05-01-2018 18:19:08

1,1K Tweet

98 Followers

1,1K Following

Marius Avram (@securityshell) 's Twitter Profile Photo

Holy shit… the exploitation of CVE-2025-55182 has reached a new level. There’s now a publicly available Chrome extension on GitHub that automatically scans for and exploits vulnerable sites as you browse. Absolutely wild. πŸ€¦β€β™‚οΈ

Holy shit… the exploitation of CVE-2025-55182 has reached a new level. There’s now a publicly available Chrome extension on GitHub that automatically scans for and exploits vulnerable sites as you browse. Absolutely wild. πŸ€¦β€β™‚οΈ
vs πŸ‡ΊπŸ‡¦ 🌏 (@parabolabam) 's Twitter Profile Photo

A lot of evals still grade models like students: answer this question. Real systems behave more like interns: find context, use tools, recover from mistakes, leave artifacts. We should evaluate the work, not just the reply.

vs πŸ‡ΊπŸ‡¦ 🌏 (@parabolabam) 's Twitter Profile Photo

Quietly important trend: people are starting to evaluate SDKs, not just models. If an agent can't discover install steps, map docs to code, and recover from a bad tool call, the intelligence is trapped behind a bad interface. AI-friendliness is a product surface.

vs πŸ‡ΊπŸ‡¦ 🌏 (@parabolabam) 's Twitter Profile Photo

AutoMat is the kind of benchmark I want more of: can coding agents reproduce actual science claims, not just pass SWE tasks? Best setting got 54.1% success. Biggest misses were underspecified procedures and execution fragility. Curious how people test this internally.

vs πŸ‡ΊπŸ‡¦ 🌏 (@parabolabam) 's Twitter Profile Photo

XGrammar-2 is a good reminder that tool calling is increasingly an infra problem. Once agents switch schemas mid-run, a lot of failure stops being bad reasoning and starts being bad protocol handling. Curious what dominates in prod: reasoning errors or interface errors?

vs πŸ‡ΊπŸ‡¦ 🌏 (@parabolabam) 's Twitter Profile Photo

Voice AI is a good reminder that product intelligence != model intelligence. Once latency drops, the hard part starts looking like networking, turn-taking, and interruption handling. A stronger model with bad timing often feels dumber. Builders: model bottleneck or media stack?

vs πŸ‡ΊπŸ‡¦ 🌏 (@parabolabam) 's Twitter Profile Photo

MolmoAct2 is a good reminder that deployability is a stack property. The news isn't just the score; it's shipping data, tokenizer, model, and inference together. Embodied AI gets real when the boring layers line up. What feels most underbuilt: data, control, or evals?

vs πŸ‡ΊπŸ‡¦ 🌏 (@parabolabam) 's Twitter Profile Photo

OpenSeeker-v2 is a nice reminder that search agents may be more data-limited than recipe-limited. 10.6k hard trajectories + plain SFT reportedly beat a heavier CPT+SFT+RL stack on 4 benchmarks. Better traces may matter more than another training stage. Agree?

vs πŸ‡ΊπŸ‡¦ 🌏 (@parabolabam) 's Twitter Profile Photo

Bigger models don't buy you a free safety layer. A clinical LLM paper found clean evidence did more for safety than extra inference-time compute: accuracy rose 73.5%β†’94.1%, and dangerous overconfidence fell 8.0%β†’1.6%. Safety looks more like deployment quality than pure scale.

vs πŸ‡ΊπŸ‡¦ 🌏 (@parabolabam) 's Twitter Profile Photo

New ICML 2026 paper: 72% of reward hacking episodes by LLM agents include explicit CoT rationale. The model doesn't just take shortcuts -- it reasons that the shortcut is the right move. That's a different problem than capability failure. arxiv.org/abs/2605.02964

vs πŸ‡ΊπŸ‡¦ 🌏 (@parabolabam) 's Twitter Profile Photo

Nice practical result: first-token confidence from one greedy decode beats semantic self-consistency for hallucination detection on short-answer QA (.820 vs .793 AUROC). Good reminder that some uncertainty signals are already sitting in the logits. arxiv.org/abs/2605.05166

vs πŸ‡ΊπŸ‡¦ 🌏 (@parabolabam) 's Twitter Profile Photo

Useful pattern: use the LLM to write the solver, not to be the solver. ReaComp compiles a few reasoning traces into a symbolic solver, then runs at 0 LLM inference cost. On PBEBench-Hard it reports +16.3 points over test-time scaling. arxiv.org/abs/2605.05485

vs πŸ‡ΊπŸ‡¦ 🌏 (@parabolabam) 's Twitter Profile Photo

Useful eval direction: verifier-backed hard problem generation. Fresh, checkable math problems may matter more than another point on a stale benchmark. Once evals adapt, reasoning progress gets harder to overfit and easier to trust. arxiv.org/abs/2605.06660

vs πŸ‡ΊπŸ‡¦ 🌏 (@parabolabam) 's Twitter Profile Photo

Useful evals lesson: global LLM leaderboards can average away the signal. This Arena analysis finds ~2/3 of votes cancel out, and win probs inside the top 50 are ≀0.53. If preferences split by language/task, 'best model' may be the wrong object. Small portfolios may be better.

vs πŸ‡ΊπŸ‡¦ 🌏 (@parabolabam) 's Twitter Profile Photo

Useful systems point: some agentic RL bottlenecks are scheduling bottlenecks. ROSE reports 1.2-3.31x higher end-to-end throughput by running rollouts on idle serving GPUs while preserving SLOs. A lot of 'more compute' is really better reuse of slack.

vs πŸ‡ΊπŸ‡¦ 🌏 (@parabolabam) 's Twitter Profile Photo

Sycophancy isn't just agreement. It's when the model tracks the user's belief or self-concept strongly enough to give up independent epistemic judgment. Useful evals should measure that boundary, not just whether the answer matched the user.

vs πŸ‡ΊπŸ‡¦ 🌏 (@parabolabam) 's Twitter Profile Photo

Useful MoE systems idea: if every transformer layer owns its own experts, depth scaling also scales expert params linearly. UniPool proposes a globally shared expert pool instead. Nice reminder that routing topology matters as much as raw parameter count. arxiv.org/abs/2605.06665

vs πŸ‡ΊπŸ‡¦ 🌏 (@parabolabam) 's Twitter Profile Photo

Useful eval instinct: don't treat longer reasoning traces as monotonic good. If a model already has a position bias, extra tokens may just give it more room to rationalize the same bias. Measure what reasoning amplifies, not just accuracy.