Has anybody optimized LLM inference for MCTS? Often I want to take an input prompt, and then get as output the Top 25 possible answers.
Yes, you can ask the LLM to output an array of 25 items, but that's slow. And, just increasing temperature doesn't get the "top" leaf nodes by
Information Retrieval is beyond NP-Hard, it's undecidable.
Proof
Consider a corpus C = d₁, ..., dₙ of documents, each document containing a snippet of syntactically valid Python code.
Ask the question, "Which documents halt?"
Q.E.D.