Models / ZAI
Chat
Reasoning

GLM-4.5-Air

106B‑parameter efficient MoE model, 128K‑token context, hybrid reasoning modes, optimized for superior efficiency while maintaining competitive performance.

About model

GLM-4.5-Air delivers competitive AI performance with 106B parameters and 12B activation, offering the same 128K context and hybrid reasoning capabilities as GLM-4.5 but optimized for efficiency. Perfect for cost-conscious deployments requiring sophisticated AI capabilities.

Performance benchmarks

Model

AIME 2025

GPQA Diamond

HLE

LiveCodeBench

MATH500

SWE-bench verified

8.1%

Related open-source models

Competitor closed-source models

Claude Opus 4.6

90.5%

34.2%

78.7%

OpenAI o3

83.3%

24.9%

99.2%

62.3%

OpenAI o1

76.8%

96.4%

48.9%

GPT-4o

49.2%

2.7%

32.3%

89.3%

31.0%

  • API usage

    • cURL
    • Python
    • Typescript

    Endpoint:

    zai-org/GLM-4.5-Air-FP8

    curl -X POST "https://api.together.xyz/v1/chat/completions" \
      -H "Authorization: Bearer $TOGETHER_API_KEY" \
      -H "Content-Type: application/json" \
      -d '{
        "model": "zai-org/GLM-4.5-Air-FP8",
        "messages": [
          {
            "role": "user",
            "content": "What are some fun things to do in New York?"
          }
        ]
    }'
    
    from together import Together
    
    client = Together()
    
    response = client.chat.completions.create(
      model="zai-org/GLM-4.5-Air-FP8",
      messages=[
        {
          "role": "user",
          "content": "What are some fun things to do in New York?"
        }
      ]
    )
    print(response.choices[0].message.content)
    
    import Together from 'together-ai';
    const together = new Together();
    
    const completion = await together.chat.completions.create({
      model: 'zai-org/GLM-4.5-Air-FP8',
      messages: [
        {
          role: 'user',
          content: 'What are some fun things to do in New York?'
         }
      ],
    });
    
    console.log(completion.choices[0].message.content);
    
  • Model card

    Architecture Overview:
    • Compact Mixture-of-Experts design with 106B total parameters and 12B active parameters
    • 128K token context window matching full GLM-4.5 capabilities
    • Optimized MoE routing with reduced width and increased depth for efficiency
    • Grouped-Query Attention with Multi-Token Prediction layer support

    Training Methodology:
    • Shared training pipeline with GLM-4.5 using 15T general + 7T code & reasoning tokens
    • Specialized post-training for efficiency-performance balance
    • Reinforcement learning optimization for agentic task performance
    • FP8 and BF16 mixed precision training for accelerated inference

    Performance Characteristics:
    • Ranked 6th overall with 59.8 score demonstrating competitive efficiency
    • Strong agentic performance with 69.4 on τ-bench and 76.4 on BFCL-v3
    • Solid coding capabilities with 57.6% on SWE-bench Verified
    • Optimal efficiency on performance-scale trade-off boundary

  • Applications & use cases

    Enterprise Applications:
    • Cost-effective conversational AI for high-volume deployments
    • Efficient intelligent agents for standard automation tasks
    • Resource-conscious development environments and coding assistance
    • Scalable customer support and virtual assistant implementations

    Development & Technical:
    • Lightweight coding assistance and software development support
    • Efficient reasoning for educational and training applications
    • Streamlined tool integration for standard agentic workflows
    • Multi-language processing for global accessibility requirements

    Business Solutions:
    • SME and startup-friendly AI integration with competitive performance
    • Batch processing and automated content generation at scale
    • Mobile and edge deployment scenarios requiring efficiency
    • Proof-of-concept and prototyping for AI-powered applications

Related models
  • Model provider
    ZAI
  • Type
    Chat
    Reasoning
  • Main use cases
    Chat
    Function Calling
  • Features
    Function Calling
    JSON Mode
  • Deployment
    Serverless
  • Parameters
    106B
  • Activated parameters
    12B
  • Context length
    128K
  • Input price

    $0.20 / 1M tokens

  • Output price

    $1.10 / 1M tokens

  • Input modalities
    Text
  • Output modalities
    Text
  • Released
    July 19, 2025
  • Quantization level
    FP8
  • External link
  • Category
    Chat