Add runbook for scaling CustomersDot VMs
Summary
This MR adds a comprehensive runbook for scaling CustomersDot VMs, documenting the process discovered and tested in gitlab-com/gl-infra/production-engineering#27880 (closed).
What does this MR do?
Adds docs/customersdot/scaling-vms.md which documents:
- Horizontal scaling (adding new VMs): Complete step-by-step process including infrastructure changes, Teleport setup, provisioning, and deployment
- Vertical scaling (resizing existing VMs): Process for changing machine types with minimal downtime
- Troubleshooting: Common issues encountered during the testing phase and their solutions
- Access and permissions: Who can perform these operations and what access is required
- Timeline expectations: Realistic time estimates for scaling operations
Why is this needed?
This capability is critical for:
- Usage Billing GA preparation
- Handling traffic spikes
- Emergency scaling during incidents
- Planned capacity increases
The testing in #27880 revealed several non-obvious steps and gotchas that need to be documented for future scaling operations.
Key learnings from testing
- Teleport tokens must be created in the same MR as the VM creation
- The
pet_name=customerslabel is required for Ansible discovery - VMs must use a specific Ubuntu 20.04 boot image
- Provisioning and deployment require Fulfillment team involvement
- Vertical scaling can be done with ~2-5 minutes downtime per VM
Related issues and MRs
- Discovery issue: gitlab-com/gl-infra/production-engineering#27880 (closed)
- Node map conversion (staging): ops.gitlab.net/gitlab-com/gl-infra/config-mgmt!12504
- Node map conversion (production): ops.gitlab.net/gitlab-com/gl-infra/config-mgmt!12530
- Example horizontal scaling: ops.gitlab.net/gitlab-com/gl-infra/config-mgmt!12567
- Example vertical scaling: ops.gitlab.net/gitlab-com/gl-infra/config-mgmt!12571