Mitja Martini (@mitjamartini) 's Twitter Profile
Mitja Martini

@mitjamartini

AI Engineer and SaaS Builder on the side, Cloud Service Solution Designer by day. Sharing lessons learned along the way.

ID: 781089205

linkhttps://mitja.dev calendar_today25-08-2012 21:00:41

1,1K Tweet

222 Takipçi

326 Takip Edilen

Hamel Husain (@hamelhusain) 's Twitter Profile Photo

How do I evaluate agentic workflows? We recommend a two-phased approach, first do error analysis on end-to-end task success/failure. 1 of 5

How do I evaluate agentic workflows?

We recommend a two-phased approach, first do error analysis on end-to-end task success/failure.

1 of 5
Mitja Martini (@mitjamartini) 's Twitter Profile Photo

A friend of my daughter can talk at breakneck speed. Wanted to try if Wispr Flow can keep up with her. To my surprise: It can! (the stat is even watered down by my own slow tests)

A friend of my daughter can talk at breakneck speed. Wanted to try if <a href="/WisprFlow/">Wispr Flow</a>  can keep up with her. To my surprise: It can! (the stat is even watered down by my own slow tests)
Hrishi (@hrishioa) 's Twitter Profile Photo

Kimi is the real deal. Unless it's really Sonnet in a trench coat, this is the best agentic open-source model I've tested - BY A MILE. Here's a slice* of a 4 HOUR run (~1 second per minute) with not much more than 'keep going' from me every 90 minutes or so. The task involved

Mitja Martini (@mitjamartini) 's Twitter Profile Photo

Bach's Notebook for Anna Magdalena Bach is a nice companion soundtrack for agentic engineering with Claude Code. open.spotify.com/intl-de/artist…

Mitja Martini (@mitjamartini) 's Twitter Profile Photo

Prompting with JSON: Philipp‘s script uses Gemini 2.5 Pro to generate an elaborate JSON from a simple prompt and then uses just the JSON as the prompt for Veo 3.

Mitja Martini (@mitjamartini) 's Twitter Profile Photo

An interesting observation that’s relevant for many use cases: „Current gen LLMs are still shit at high recall tasks. Not good.“ I think it was quite a tough task (count, wildcards, indirection) in a not too big context of 15k token. A good eval. Gemini did best, btw.

Mitja Martini (@mitjamartini) 's Twitter Profile Photo

I just finished - for the second time - the AI Evals for Engineers & PMs (maven.com/parlance-labs/…) course by  Hamel Husain and Shreya Shankar about the principles of application-centric LLM evaluation. The course took me from "prompt-and-pray" to a systematic approach of measuring LLM

Mitja Martini (@mitjamartini) 's Twitter Profile Photo

This week, the first cohort of maven.com/kentro/context… by Eleanor Berger and Isaac Flath started. I've already learned a lot and can highly recommend this course.

Google for Developers (@googledevs) 's Twitter Profile Photo

Google Colab is officially coming to Visual Studio Code! ⚡️ You can now connect VS Code notebooks directly to Colaboratory runtimes. Get the best of both worlds: the editor you love, powered by the compute (GPUs/TPUs) you need. → goo.gle/47QTmnB

Mitja Martini (@mitjamartini) 's Twitter Profile Photo

"If a task/job is verifiable, then it is optimizable directly or via reinforcement learning, and a neural net can be trained to work extremely well." Homework assignment: How many of your job's tasks are or can be framed/made verifiable?

Peter Steinberger (@steipete) 's Twitter Profile Photo

npx -y mcporter list --verbose mcporter 0.6.1 shows all your mcps and all copies, if you done plenty of agentic engineering this year, you're in for a treat.

npx -y mcporter list --verbose

mcporter 0.6.1 shows all your mcps and all copies, if you done plenty of agentic engineering this year, you're in for a treat.