Can you train a real world model?
Capture videos from the real world, label them with corresponding "user input (WASD etc.)", and train like this one, can we get a absolutely real world e-print game? The real world can be considered as a complex game I think.
My guess is that it will be heavily limited to the environment they'd have to record the footage in. They probably can't just take random videos from the Internet, since the footage probably requires a fair level of consistency, something videogames do have, think consistent camera shaking during movement or always the same jumping height. So we're probably thinking of very long camera rail systems across hundreds of places worldwide, which will cost quite a lot of money.
I think that essentially this is possible. This is really what the video generating models do, to some extent