本模型基于多阶段文本到视频生成扩散模型, 输入描述文本,返回符合文本描述的视频。仅支持英文输入。
文本到视频生成扩散模型由文本特征提取、文本特征到视频隐空间扩散模型、视频隐空间到视频视觉空间这3个子网络组成,整体模型参数约17亿。支持英文输入。扩散模型采用Unet3D结构,通过从纯高斯噪声视频中,迭代去噪的过程,实现视频生成的功能。
本模型适用范围较广,能基于任意英文文本描述进行推理,生成视频。一些文本生成视频示例如下,上方为输入文本,下方为对应的生成视频:
| Robot dancing in times square. | Clown fish swimming through the coral reef.
| Melting ice cream dripping down the cone.
|
| ------------------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ |
| A waterfall flowing through glacier at night.
| A cat eating food out of a owl, in style of van Gogh.
| Tiny plant sprout coming out of the ground.
|
| Hyper-realistic photo of an abandoned industrial site during a storm.
| Balloon full of water exploding in extreme slow motion.
| Incredibly detailed science fiction scene set on an alien planet, view of a marketplace. Pixel art.
|
官网
https://modelscope.cn/models/damo/text-to-video-synthesis/summary