So for those not familar with machine learning, which was the practical business use case for “AI” before LLMs took the world by storm, that is what they are describing as reinforcement learning. Both are valid terms for it.
It’s how you can make an AI that plays Mario Kart. You establish goals that grant points, stuff to avoid that loses points, and what actions it can take each “step”. Then you give it the first frame of a Mario Kart race, have it try literally every input it can put in that frame, then evaluate the change in points that results. You branch out from that collection of “frame 2s” and do the same thing again and again, checking more and more possible future states.
At some point you use certain rules to eliminate certain branches on this tree of potential future states, like discarding branches where it’s driving backwards. That way you can start opptimizing towards the options at any given time that get the most points im the end. Keep the amount of options being evaluated to an amount you can push through your hardware.
Eventually you try enough things enough times that you can pretty consistently use the data you gathered to make the best choice on any given frame.
The jank comes from how the points are configured. Like AI for a delivery robot could prioritize jumping off balconies if it prioritizes speed over self preservation.
Some of these pitfalls are easy to create rules around for training. Others are far more subtle and difficult to work around.
Some people in the video game TAS community (custom building a frame by frame list of the inputs needed to beat a game as fast as possible, human limits be damned) are already using this in limited capacities to automate testing approaches to particularly challenging sections of gameplay.
So it ends up coming down to complexity. Making an AI to play Pacman is relatively simple. There are only 4 options every step, the direction the joystick is held. So you have 4n states to keep track of, where n is the number of steps forward you want to look.
Trying to do that with language, and arguing that you can get reliable results with any kind of consistency, is blowing smoke. They can’t even clearly state what outcomes they are optimizing for with their “reward” function. God only knows what edge cases they’ve overlooked.
My complete out of my ass guess is that they did some analysis on response to previous gpt output, tried to distinguish between positive and negative responses (or at least distinguish against responses indicating that it was incorrect). They then used that as some sort of positive/negative points heuristic.
People have been speculating for a while that you could do that, crank up the “randomness”, have it generate multiple responses behind the scenes and then pit those “pre-responses” against each other and use that criteria to choose the best option of the “pre-responses”. They could even A/B test the responses over multiple users, and use the user responses as further “positive/negative points” reinforcement to feed back into it in a giant loop.
Again, completely pulled from my ass. Take with a boulder of salt.
Sigma Star Saga is an odd RPG game where the random encounters are short side scrolling shmup segments. I really enjoyed the amount of it that I played, but you can get screwed in some encounters as it gives you a random ship each time, and some are worse than others.