Encodes a prompt into {video_context, audio_context, attention_mask} and returns them as a safetensors file. Consumed by the main generation Space.
{video_context, audio_context, attention_mask}