I wish they had some example completions in the paper and not just eval results. It would be really useful to see if there are any emergent linguistic tilts to the newly diverse responses...
Wouldn't RL training, with the goal of aligning the LLM with the reward function R(x, y), result in the outputs of the trained LLM maximizing said reward function? How different are the rewards of the N outputs in BoN sampling, to justify its cost.
Is Best-of-N Sampling standard practice these days in Inference? Sounds expensive on the face of it. I am surprised because I thought the trend was towards cheaper inference.
For reasoning models, this would actually improve exploration efficiency and hence possibly allow higher performance for the same compute budget. As in, if you want to sample from multiple rollouts for the same prompt, it's more efficient if the model is able to produce diverse thought directions and consider them to find the best response as opposed to going down similar trajectories and waste compute.
Almost all of the efficiency gains have come from shedding bit precision, but the problem is that AI labs are now running out of bits to shed. The move to reduced precision inference has been masking the insane unsustainability of compute scaling as a model improvement paradigm.
I wish they had some example completions in the paper and not just eval results. It would be really useful to see if there are any emergent linguistic tilts to the newly diverse responses...
Isn't the BoN RL formulation similar to DeepSeek's GRPO algorithm? The latter seems to implicitly already captured this?
Wouldn't RL training, with the goal of aligning the LLM with the reward function R(x, y), result in the outputs of the trained LLM maximizing said reward function? How different are the rewards of the N outputs in BoN sampling, to justify its cost.
Is Best-of-N Sampling standard practice these days in Inference? Sounds expensive on the face of it. I am surprised because I thought the trend was towards cheaper inference.
For reasoning models, this would actually improve exploration efficiency and hence possibly allow higher performance for the same compute budget. As in, if you want to sample from multiple rollouts for the same prompt, it's more efficient if the model is able to produce diverse thought directions and consider them to find the best response as opposed to going down similar trajectories and waste compute.
Almost all of the efficiency gains have come from shedding bit precision, but the problem is that AI labs are now running out of bits to shed. The move to reduced precision inference has been masking the insane unsustainability of compute scaling as a model improvement paradigm.
Not standard but one of several techniques, you can see them in our open source inference proxy - https://github.com/codelion/optillm
Cerebras has used optillm for optimising inference with techniques like CePO and LongCePO.