“First text, then images, now OpenAI has a model for generating videos,” screamed Mashable the other day. The makers of ChatGPT and Dall-E had just announced Sora, a text-to-video diffusion model. Cue excited commentary all over the web about what will doubtless become known as T2V, covering the usual spectrum – from “Does this mark the end of [insert threatened activity here]?” to “meh” and everything in between.
Sora (the name is Japanese for “sky”) is not the first T2V tool, but it looks more sophisticated than earlier efforts like Meta’s Make-a-Video AI. It can turn a brief text description into a detailed, high-definition film clip up to a minute long. For example, the prompt “A cat waking up its sleeping owner, demanding breakfast. The owner tries to ignore the cat, but the cat tries new tactics, and finally, the owner pulls out his secret stash of treats from underneath the pillow to hold off the cat a little longer,” produces a slick video clip that would go viral on any social network.
Cute, eh? Well, up to a point. OpenAI seems uncharacteristically candid about the tool’s limitations. It may, for example, “struggle with accurately simulating the physics of a complex........