LabKit for Batch Jobs
This issue opened after a discussion with @idawson @mmishaev @maw and myself: https://docs.google.com/document/d/1SusqS4tNQhOPmqjFlDStSaJH2mW5I9K8yBtZ8TSSiis/edit?tab=t.qv0r0t5vmv2g and https://docs.google.com/document/d/1SusqS4tNQhOPmqjFlDStSaJH2mW5I9K8yBtZ8TSSiis/edit?tab=t.0.
Various teams at GitLab work with batch jobs. For instance @idawson and the security teams are looking at running batch jobs for security scanning purposes. Additionally, GitLab Dedicated Provisioning is run as apiVersion: batch/v1, kind: Job jobs.
Batch jobs have some unique requirements, specifically when it comes to scraped metric endpoints, since jobs may appear and disappear within the window of a single scrape, and end-of-run statistics that cover a whole invocation (max-memory, total-cpu-seconds, exit-value).
As part of LabKit, we may want to consider handling batch jobs. This could be a harness for running jobs, posting metrics via a push gateway, etc.
Over time, the harness could be extended to cover wider strategic initiatives, for instance, NATS integration, etc.