AI/ML orchestration on Cloud Run documentation

Cloud Run is a fully managed platform that lets you run your containerized applications, including AI/ML workloads, directly on Google's scalable infrastructure. It handles the infrastructure for you, so you can focus on writing your code instead of spending time on operating, configuring, and scaling your Cloud Run resources. Cloud Run's capabilities provide the following:

Hardware accelerators: access and manage GPUs for inference at scale.
Frameworks support: integrate with the model serving frameworks you already know and trust such as Hugging Face, TGI, and vLLM.
Managed platform: get all the benefits of a managed platform to automate, scale, and enhance the security of your entire AI/ML lifecycle while maintaining flexibility.

Explore our tutorials and best practices to see how Cloud Run can optimize your AI/ML workloads.

Get started for free

Start your proof of concept with $300 in free credit

Get access to Gemini 2.0 Flash Thinking
Free monthly usage of popular products, including AI APIs and BigQuery
No automatic charges, no commitment

View free product offers

Keep exploring with 20+ always-free products

Access 20+ free products for common use cases, including AI APIs, VMs, data warehouses, and more.

Documentation resources

Find quickstarts and guides, review key references, and get help with common issues.

Explore self-paced training, use cases, reference architectures, and code samples with examples of how to use and connect Google Cloud services.

Use case

Run your AI inference applications on Cloud Run with NVIDIA GPUs

Use NVIDIA L4 GPUs on Cloud Run for real-time AI inference, including fast cold-start and scale-to-zero benefits for Large Language Models (LLMs).

GPUs LLMs

Learn more

Use case

Cloud Run: the fastest way to get your AI applications to production

Learn how to use Cloud Run for production-ready AI applications. This guide describes use cases such as traffic splitting for A/B testing prompts, RAG (Retrieval-Augmented Generation) patterns, and connectivity to vector stores.

AI applications traffic splitting for A/B testing RAG patterns vector stores connectivity to vector stores

Learn more

Use case

AI deployment made easy: Deploy your app to Cloud Run from AI Studio or MCP-compatible AI agents

One-click deployment from Google AI Studio to Cloud Run and the Cloud Run MCP (Model Context Protocol) server to enable AI agents in IDEs or agent SDKs and deploy apps.

MCP servers deployments Cloud Run

Learn more

Use case

Supercharging Cloud Run with GPU power: A new era for AI workloads

Integrate NVIDIA L4 GPUs with Cloud Run for cost-efficient LLM serving. This guide emphasizes scale-to-zero and provides deployment steps for models like Gemma 2 with Ollama.

LLMs GPU Ollama Cost Optimization

Learn more

Use case

Still packaging AI models in containers? Do this instead on Cloud Run

Decouple large model files from the container image using Cloud Storage FUSE. Decoupling improves build times, simplifies updates, and creates a more scalable serving architecture.

Model Packaging Cloud Storage FUSE Best Practices Large Models

Learn more

Use case

Package and deploy your machine learning models to Google Cloud with Cog

Use the Cog framework that is optimized for ML serving to simplify packaging and deployment of containers to Cloud Run.

Cog Model Packaging Deployment Tutorial

Learn more

Use case

Deploying & monitoring ML models with Cloud Run— Lightweight, scalable, and cost-efficient

Use Cloud Run for lightweight ML inference and build a cost-effective monitoring stack by using native GCP services like Cloud Logging and BigQuery.

Monitoring MLOps Cost Efficiency Inference

Learn more

Use case

Deploying a Google Cloud generative AI app in a website with Cloud Run

Deploy a simple Flask application that calls the Vertex AI Generative AI API onto a scalable Cloud Run service.

Generative AI Vertex AI Flask Deployment

Learn more

Use case

Deploying Gemma directly from AI Studio to Cloud Run

Use the Gemma Python code from AI Studio and deploy it directly to a Cloud Run instance, leveraging Secret Manager for secure API key handling.

AI Studio Gemma Deployment Tutorial

Learn more

AI/ML orchestration on Cloud Run documentation

Start your proof of concept with $300 in free credit

Keep exploring with 20+ always-free products

Run AI solutions

Inference with GPUs

Troubleshoot

Run your AI inference applications on Cloud Run with NVIDIA GPUs

Cloud Run: the fastest way to get your AI applications to production

AI deployment made easy: Deploy your app to Cloud Run from AI Studio or MCP-compatible AI agents

Supercharging Cloud Run with GPU power: A new era for AI workloads

Still packaging AI models in containers? Do this instead on Cloud Run

Package and deploy your machine learning models to Google Cloud with Cog

Deploying & monitoring ML models with Cloud Run— Lightweight, scalable, and cost-efficient

Deploying a Google Cloud generative AI app in a website with Cloud Run

Deploying Gemma directly from AI Studio to Cloud Run

Related videos