Text to Video Generative AI Is Finally Here and It’s Weird as Hell

March 22, 2023

[ad_1]

I like my AI like I like my foreign cheese varieties, incredibly weird and full of holes, the kind that leaves most definitions of “good” up to individual taste. So color me surprised as I explored the next frontier of public AI models, and found one of the strangest experiences I had since the bizarre AI-generated Seinfeld knockoff Nothing, Forever was first released.

Runway, one of the two startups that helped give us the AI art generator Stable Diffusion, announced on Monday that its first public test for its Gen-2 AI video model was going live soon. The company made the stunning claim it was the “first publicly available text-to-video model out there.” Unfortunately, a more obscure group with a much jankier initial text-to-video model may have beat Runway to the punch.

Google and Meta are already working on their own text-to-image generators, but neither company has been very forthcoming on any news since they were first teased. Since February, the relatively small 45-person team at Runway has been known for its online video editing tools, including its video-to-video Gen-1 AI model that could create and transform existing videos based on text prompts or reference images. Gen-1 could transform a simple render of a stick figure swimming into a scuba diver, or turn a man walking on the street into a claymation nightmare with a generated overlay. Gen-2 is supposed to be the next big step up, allowing users to create 3-second videos from scratch based on simple text prompts. While the company has not let anybody get their hands on it yet, the company shared a few clips based on prompts like “a close up of an eye” and “an aerial shot of a mountain landscape.”

Few people outside the company have been able to experience Runway’s new model, but if you’re still hankering for AI video generation, there’s another option. The AI text to video system called ModelScope was released over the past weekend and already caused some buzz for its occasionally awkward and often insane 2-second video clips. The DAMO Vision Intelligence Lab, a research division of e-commerce giant Alibaba, created the system as a kind of public test case. The system uses a pretty basic diffusion model to create its videos, according to the company’s page describing its AI model.

ModelScope is open source and already available on Hugging Face, though it may be hard to get the system to run without paying a small fee to run the system on a separate GPU server. Tech YouTuber Matt Wolfe has a good tutorial about how to set that up. Of course, you could go ahead and run the code yourself if you have the technical skill and the VRAM to support it.

ModelScope is pretty blatant in where its data comes from. Many of these generated videos contain the vague outline of the Shutterstock logo, meaning the training data likely included a sizable portion of videos and images taken from the stock photo site. It’s a similar issue with other AI image generators like Stable Diffusion. Getty Images has sued Stability AI, the company that brought the AI art generator into the public light, and noted how many Stable Diffusion images create a corrupted version of the Getty watermark.

Of course, that still hasn’t stopped some users from making small movies using the rather awkward AI, like this pudgy-faced Darth Vader visiting a supermarket or of Spider-Man and a capybara teaming up to save the world.

As far as Runway goes, the group is looking to make a name for itself in the ever-more crowded world of AI research. In their paper describing its Gen-1 system, Runway researchers said their model is trained on both images and video of a “large-scale dataset” with text-image data alongside uncaptioned videos. Those researchers found there was simply a lack of video-text datasets with the same quality as other image datasets featuring images scraped from the internet. This forces the company to derive their data from the videos themselves. It will be interesting to see how Runway’s likely more-polished version of text-to-video stacks up, especially compared to when heavy hitters like Google show off more of its longer-form narrative videos.

If Runway’s new Gen-2 waitlist is like the one for Gen-1, then users can expect to wait a few weeks before they fully get their hands on the system. In the meantime, playing around with ModelScope may be a good first option for those looking for more weird AI interpretations. Of course, this is before we’ll be having the same conversations about AI-generated videos that we now do about AI created images.

The following slides are some of my attempts to compare Runway to ModelScope and also test the limits of what text to image can do. I transformed the images into GIF format using the same parameters on each. The framerate on the GIFs is close to what the original AI-created videos.

[ad_2]