Karsten Held (@karstenheld) 's Twitter Profile
Karsten Held

@karstenheld

ID: 2214156872

linkhttp://www.karstenheld.com calendar_today25-11-2013 14:30:20

10 Tweet

20 Followers

54 Following

Karsten Held (@karstenheld) 's Twitter Profile Photo

LLM as a Judge: GPT-5 vs GPT-4o How prompt design impacts AI evaluation. I have tested 3 RAG evaluation prompt types and 4 OpenAI models. Simple prompts work best with GPT-4o, complex prompts with GPT-5. #AI #LLM #PromptEngineering #RAG Complete video: youtu.be/dxXzrMHNonE

Karsten Held (@karstenheld) 's Twitter Profile Photo

One thing becomes clear after 2 weeks of investigation and 1200 EUR spent for tokens: GPT-5 is worse than GPT-4o for "LLM as a judge" evaluations. More expensive, slower, less stable. After 5 iterations GPT-4o outperforms GPT-5 using the same optimized final prompt.

One thing becomes clear after 2 weeks of investigation and 1200 EUR spent for tokens: GPT-5 is worse than GPT-4o for "LLM as a judge" evaluations. More expensive, slower, less stable.

After 5 iterations GPT-4o outperforms GPT-5 using the same optimized final prompt.
Karsten Held (@karstenheld) 's Twitter Profile Photo

LLM as a Judge: New DataRobot study shows that larger models + simple prompts give best accuracy, cost, and stability. datarobot.com/blog/llm-judge… #AI #PromptEngineering #AIEvaluation

LLM as a Judge: New DataRobot study shows that larger models + simple prompts give best accuracy, cost, and stability. 
datarobot.com/blog/llm-judge…

#AI #PromptEngineering #AIEvaluation
Karsten Held (@karstenheld) 's Twitter Profile Photo

Lutz Roeder (Microsoft, #Netron, #Reflector): "We should measure it [AGI] not by imitation, but by the novelty, depth, and reliability of the knowledge it creates." lutzroeder.com/blog/2025-11-0… #AGI #DavidDeutsch