LLM as a Judge: GPT-5 vs GPT-4o
How prompt design impacts AI evaluation. I have tested 3 RAG evaluation prompt types and 4 OpenAI models. Simple prompts work best with GPT-4o, complex prompts with GPT-5.
#AI #LLM #PromptEngineering #RAG
Complete video: youtu.be/dxXzrMHNonE
One thing becomes clear after 2 weeks of investigation and 1200 EUR spent for tokens: GPT-5 is worse than GPT-4o for "LLM as a judge" evaluations. More expensive, slower, less stable.
After 5 iterations GPT-4o outperforms GPT-5 using the same optimized final prompt.
LLM as a Judge: New DataRobot study shows that larger models + simple prompts give best accuracy, cost, and stability.
datarobot.com/blog/llm-judge…
#AI #PromptEngineering #AIEvaluation
Lutz Roeder (Microsoft, #Netron, #Reflector): "We should measure it [AGI] not by imitation, but by the novelty, depth, and reliability of the knowledge it creates."
lutzroeder.com/blog/2025-11-0…
#AGI #DavidDeutsch