Models / Alibaba
Video

Wan 2.7

Video generation and editing with multi-reference control and temporal transfer

About model

Wan 2.7 is Wan AI's video generation and editing model, introducing instruction-based and reference-based video editing alongside temporal feature transfer. It supports text-to-video, image-to-video, and reference-to-video workflows with up to 5 simultaneous reference inputs for multi-subject compositions. The model accepts joint image, video, and audio references for synchronized subject+voice control, supports real human inputs as references or first frames, and generates native 1080p video from 2 to 15 seconds.

Max References

5

Multi-subject with mixed image/video/audio inputs

Native Resolution

1080p

Across all generation and editing modes

Flexible Duration

2-15s

T2V and I2V with 2-10s for R2V

Model key capabilities
  • Video Editing: Instruction-based and reference-based editing to modify subjects or scenes globally via text prompts or reference media
  • Temporal Feature Transfer: Clone motion, camera moves, effects, and style directly from a reference video into new generations
  • Multi-Reference Control: Up to 5 simultaneous references with joint subject+voice referencing via combined image, video, and audio inputs
  • Flexible Generation: T2V, I2V, and R2V modes with first/last frame control, 3x3 grid-to-video, real human inputs, and native 1080p up to 15s
Quickstart guides
  • API usage

    • cURL
    • Python
    • Typescript

    Endpoint:

    Wan-AI/Wan2.7

    curl --request POST \
      --url https://api.together.xyz/v2/videos \
      --header "Authorization: Bearer $TOGETHER_API_KEY" \
      --header "Content-Type: application/json" \
      --data '{
        "model": "Wan-AI/Wan2.7",
        "prompt": "some penguins building a snowman"
      }'
    
    from together import Together
    
    client = Together()
    
    # Create a video generation job
    job = client.videos.create(
        prompt="A serene sunset over the ocean with gentle waves",
        model="Wan-AI/Wan2.7"
    )
    
    import Together from "together-ai";
    
    const together = new Together();
    
    async function main() {
      // Create a video generation job
      const job = await together.videos.create({
        prompt: "A serene sunset over the ocean with gentle waves",
        model: "Wan-AI/Wan2.7"
      });
    
  • Model card

    Architecture Overview:
    • Unified video generation and editing model supporting text-to-video (T2V), image-to-video (I2V), and reference-to-video (R2V) workflows
    • Instruction-based and reference-based video editing: modify subjects or scenes globally via text prompts or reference media
    • Temporal feature transfer: clone motion, camera moves, effects, and style directly from a reference video
    • I2V supports first/last frame control and 3x3 grid-to-video generation
    • Multi-reference support with up to 5 references for multi-subject, image+voice, and mixed image/video modes
    • Joint subject+voice referencing via combined image, video, and audio inputs
    • Real human image and video inputs supported as references or first frames
    • Native 1080p output across all generation modes

    Training Methodology:
    • Built on the Wan model family with expanded capabilities for video editing and temporal transfer
    • Trained for consistent subject identity preservation across multi-reference and multi-subject scenes
    • Optimized for real human inputs maintaining natural appearance and motion

    Performance Characteristics:
    • T2V and I2V support 2-15 second generation with flexible duration control
    • R2V supports 2-10 second generation
    • Up to 5 simultaneous reference inputs for complex multi-subject compositions
    • Temporal feature transfer preserves motion dynamics, camera work, and visual effects from source video

  • Prompting

    Together AI API Access:
    • Access Wan 2.7 via Together AI APIs using the endpoint Wan-AI/Wan2.7
    • Authenticate using your Together AI API key in request headers
    • Supports text-to-video, image-to-video, and reference-to-video generation modes
    • Reference inputs accept image, video, and audio for joint subject+voice control
    • Available on Together AI serverless infrastructure

  • Applications & use cases

    Video Editing & Post-Production:
    • Instruction-based editing: modify subjects, scenes, and visual elements via text prompts
    • Reference-based editing: transfer style, motion, and camera work from source videos
    • Temporal feature transfer for replicating specific motion dynamics and effects

    Marketing & Brand Content:
    • Campaign video production with consistent character identity across multiple assets
    • Product videos and social media content at native 1080p
    • Brand mascot and spokesperson videos using real human reference inputs

    Creative Production:
    • Multi-subject scene composition with up to 5 reference inputs
    • First/last frame control for precise narrative sequencing
    • 3x3 grid-to-video generation for storyboard-to-video workflows
    • Subject+voice referencing for synchronized character performances

Related models
  • Model provider
    Alibaba
  • Type
    Video
  • Resolution/Duration
    1080p / 2-15s
  • Deployment
    Serverless
  • Input modalities
    Text
    Image
    Video
    Audio
  • Output modalities
    Video
  • Category
    Video