Not only cheaper, but (since in this case money ≈ hardware-cost × time), faster. They claim that training time can even approach inference time:
> EGGROLL's efficiency results in a hundredfold increase in training throughput for billion-parameter models at large population sizes, nearly reaching the throughput of pure batch inference
It really depends on how it scales. If it can scale to LLM sizes via this training method (Which is a big if), then it could mean fundamentally overturning the transformer architecture and replacing it with RNNs in the most optimistic case.
But if not, it could mean as little as some LLM-adjacent tools like vec2text get reworked into RNNs. Or some interesting fine-tuning at least.
Their technique does not claim to compete with gradient descent - it's competition for techniques like Proximal Policy Optimization, so it's more suited for things like creating a reasoning model out of an existing pre-trained model.
What does this actually mean for LLMs? Cheaper training?
Yes. Provided it works as well as they claim.
Not only cheaper, but (since in this case money ≈ hardware-cost × time), faster. They claim that training time can even approach inference time:
> EGGROLL's efficiency results in a hundredfold increase in training throughput for billion-parameter models at large population sizes, nearly reaching the throughput of pure batch inference
It really depends on how it scales. If it can scale to LLM sizes via this training method (Which is a big if), then it could mean fundamentally overturning the transformer architecture and replacing it with RNNs in the most optimistic case.
But if not, it could mean as little as some LLM-adjacent tools like vec2text get reworked into RNNs. Or some interesting fine-tuning at least.
Their technique does not claim to compete with gradient descent - it's competition for techniques like Proximal Policy Optimization, so it's more suited for things like creating a reasoning model out of an existing pre-trained model.