other

Managed API, Serverless, or GPU Cloud: Generative Media Deployment Architecture Guide

May 28, 2026

Teams choosing a generative media deployment architecture in 2026 tend to start by comparing hourly GPU rates against per-token API costs and assuming that whoever is cheaper wins. That framing solves the wrong problem. A team without ML engineers choosing self-hosted GPU infrastructure because the headline rate is lower will spend more money in engineering overhead than they save on compute. A team with fine-tuned models specific to their domain choosing a managed API because it requires less setup will discover the models they need are not available on any platform.

The architecture decision has three variables, and the right answer changes completely depending on where each team sits on each of them.This article provides a decision matrix across team capability, volume, and customization requirements, with specific product recommendations for each outcome.

The Three Variables That Determine Architecture

Generative media deployment architecture is determined by:

  • Team technical capability: Whether the team has ML engineers who can configure and maintain an inference stack. This is not about seniority or intelligence. It is about whether the daily work of configuring vLLM, managing CUDA versions, monitoring GPU utilization, and cycling through model updates is within the team's operational scope. Without that capacity, self-deployment does not save money. It transfers cost from a compute bill to an engineering bill.
  • Volume and utilization: How many images, video clips, or inference requests the workload produces per month, and how evenly that load is distributed. APIs scale to zero during idle periods and bill only for what is used. Self-deployed GPUs bill by the hour regardless of utilization. A GPU running at 10% utilization costs the same per hour as one at 90% but produces ten times less output per dollar.
  • Customization requirements: Whether standard models from managed platforms are sufficient, or whether the workload requires fine-tuned weights, proprietary model architectures, or data handling that prevents sending inputs to third-party servers.

No single variable determines the answer. A team with ML engineers, low volume, and standard model requirements belongs on API. A team without ML engineers, high volume, and compliance requirements that prevent third-party data processing has a harder problem with no clean answer. The matrix below resolves the combinations.

The Decision Matrix

Team capability Monthly volume Custom model or data compliance Recommended architecture Starting products
No ML engineers Any No Managed API gpt-image-2-generate, veo-3.1-lite-generate-001
No ML engineers Any Yes (compliance) Managed API with private endpoints (e.g. Azure OpenAI, Vertex AI private) Consult provider-specific DPA/BAA terms
Has ML engineers Low (under 50M tokens equiv.) No Managed API gpt-image-2-generate, wan2.7-i2v
Has ML engineers Medium (50M to 500M tokens equiv.) No API-first, evaluate self-deployment H100 at $2.00/hr for sustained workloads
Has ML engineers High (above 500M tokens equiv.) No Hybrid: self-deploy baseline, API overflow H100 for steady-state, API for spikes
Has ML engineers Any Yes (fine-tuned model) Self-deployment required H100 or H200 depending on model size
Has ML engineers Any Yes (data localization) Self-deployment with network isolation H200 at $2.60/hr for large-model inference

The most common miscalculation in this matrix is teams without ML engineers choosing self-deployment because the per-GPU rate looks cheaper than per-token API pricing.Independent analysis of real deployments shows total self-hosting cost runs 3-5x the GPU hourly rate once engineering time, model update cycles, and utilization efficiency are included. Below approximately 11 billion tokens per month for LLM equivalents, managed API almost always produces a lower total cost when these factors are counted.

Walkthrough: three representative team types

Early-stage product team, 2-5 developers, no dedicated ML engineer, no fine-tuned models: Managed API is the only viable path. The engineering time required to provision and maintain a production GPU inference stack does not exist on this team, and the volume does not justify the investment regardless. gpt-image-2-generate for image generation and veo-3.1-lite-generate-001 for video cover the standard media types at per-request pricing with no infrastructure overhead.

Mid-size product company, dedicated ML engineer, 200M images per month, standard models: This is the decision point where self-deployment starts to make financial sense. An H100 at $2.00/hr running at 75% utilization across sustained workloads can undercut API pricing at this volume. The hybrid architecture, where the H100 handles baseline load and API handles overflow, is the standard pattern at this scale. The decision requires modeling actual utilization against the API per-request rate for the specific workload.

Enterprise team, full MLOps capacity, proprietary fine-tuned image model, EU data residency requirement: Self-deployment is not optional here. The fine-tuned model does not exist in any managed API catalog. The data residency requirement prevents sending inputs to most commercial APIs. An H200 at $2.60/hr provides 141GB of VRAM, which accommodates larger model variants without tensor parallelism, and can be deployed in EU-region data centers to satisfy data localization requirements.

The API Entry Point: Three Models for Different Volume Profiles

For teams on the managed API path, three models on GMI Cloud cover the standard generative media categories at different price-quality positions.

gpt-image-2-generate(OpenAI, released April 21, 2026) is the premium image generation endpoint. Pricing runs approximately $0.006 per image at low quality to $0.211 at high quality, with token-based billing. The model includes reasoning capabilities and achieves 95%+ text-in-image accuracy. For early-stage teams generating product images, marketing assets, or branded content where quality matters and volume is moderate, this is the appropriate starting point.

veo-3.1-lite-generate-001(Google, released March 31, 2026) is the budget-tier entry point for video generation. At $0.05 per second of output (720p), it is the lowest per-second rate in the Veo 3.1 family. A 5-second clip costs $0.25. The model generates native audio alongside video in a single API call, accepts both text and image inputs, and supports 16:9 and 9:16 aspect ratios for the two dominant social media formats. For teams testing video generation before committing to a higher-cost tier, this is the right model to prototype with.

wan2.7-i2v(Alibaba, image-to-video) covers the use case where teams have existing visual assets and need to add motion. At $0.625 per generation at 720p, with support for clips up to 15 seconds and first-and-last-frame composition control, it handles product animation, character motion, and B-roll generation from still photography. For teams with established image libraries looking to extend them into video, wan2.7-i2v provides the highest degree of input control among managed video APIs.

The Self-Deployment Entry Point: H100 and H200 on GMI Cloud

For teams where the decision matrix points to self-deployment, the choice between H100 and H200 follows a straightforward rule based on model memory requirements.

H100 at $2.00/hris the right starting point for generative media workloads running models up to 70B parameters at quantized precision. The 80GB of HBM3 at 3.35 TB/s handles the majority of open-source image and video generation models at production batch sizes. For teams deploying Stable Diffusion 3.5, Wan 2.7 variants, or custom 7B-30B parameter models, H100 provides the best cost-performance profile at this tier. Pre-configured with CUDA 12.x, TensorRT-LLM, and vLLM, which reduces initial deployment time.

H200 at $2.60/hris warranted when the workload hits two specific H100 limitations: models requiring more than 80GB of VRAM at the desired precision, or large-batch pipelines where KV cache pressure at 80GB reduces throughput. The H200's 141GB HBM3e at 4.80 TB/s accommodates larger model variants without quantization tradeoffs and supports batch sizes roughly double what the H100 can sustain. For generative media teams running 70B-parameter video generation models at full precision, or image generation pipelines requiring simultaneous multi-model loading, H200 is the minimum viable GPU tier.

Both GPUs are available on-demand with no minimum commitment atconsole.gmicloud.ai. GPU pricing and availability are atgmicloud.ai/en/pricing.

Start with the Variable That Constrains You First

The three variables in this framework are not equally weighted for every team. For most teams evaluating generative media deployment, the binding constraint is identifiable from a single question: does the team have an ML engineer available to own an inference stack?

If no, the architecture is managed API regardless of volume and cost projections. The math on GPU cost savings does not apply when the engineering time to realize those savings does not exist.

If yes, the second constraint is customization. If the workload requires fine-tuned models or specific data handling, self-deployment is necessary regardless of volume. The volume calculation becomes a question of which GPU tier and how many, not API versus self-deployment.

Volume enters the decision only when the first two constraints do not force a specific answer, which is actually the smaller slice of real-world decisions. For those cases, the matrix provides specific thresholds. For the larger population of decisions where one of the first two variables is binding, the matrix shortens to a single row.

Colin Mo

Build AI Without Limits

GMI Cloud helps you architect, deploy, optimize, and scale your AI strategies

Ready to build?

Explore powerful AI models and launch your project in just a few clicks.

Get Started