Large volume of jobs utilizing resource groups can result in jobs stuck waiting for resource

Summary

Initially reported by a customer in this ticket (internal use).

Jobs can be stuck "waiting for resource" despite their resource group being free. This issue is notably different from #425819 (closed). While similar in that jobs are stuck waiting on a resource group, most of the reported cases in that issue have different circumstances. This specific issue occurs when:

the relevant resource groups are operating in the unordered process mode
there are no manual jobs in the pipeline
the relevant "stuck" job's status is WAITING_FOR_RESOURCE, not created.

The pipeline that replicates this behavior is a bit unique. It is a parent/child pipeline, where 6 children pipeline are spawned. Each child pipeline contains approximately 80 jobs. These jobs are assigned to one resource group in sets of two - one resource group is assigned to a plan job and an apply job. The apply job is dependent on the plan job having completed first. A single resource group is not applied to more than it's two respective jobs.

When executing a pipeline configured this way, at least one random apply job gets incorrectly stuck waiting for it's resource group, despite that resource group being free. There doesn't seem to be any consistency with regards to which job is impacted - it can be any apply job within any one of the 6 child pipelines.

Steps to reproduce

Fork the example project
Run the pipeline.
A single (or multiple) jobs in the subsequent child pipeline(s) will be stuck waiting on their respective resource group.

You may need to run a few pipelines for the behavior to occur, though it has been pretty consistent for me.

Example Project

https://gitlab.com/calebw/resource_group_lock

What is the current bug behavior?

Jobs are incorrectly stuck with a status of WAITING_FOR_RESOURCE despite their resource group being free, and the jobs being next in line as processables.

What is the expected correct behavior?

Jobs are ran as their respective resource group is free.

Relevant logs and/or screenshots

This pipeline in the example project above demonstrates the issue. Within one of the child pipelines, an apply job is stuck waiting for it's resource group.

Using the API, we can see that this resource group is using process_mode: unordered:

We can also see that the above job's status is waiting_for_resource, using the upcoming_jobs endpoint on that resource group:

I've checked the resources object via the rails console by calling Ci::ResourceGroup.find(RESOURCE_GROUP_ID).resources, and build.id: nil is returned, showing no jobs attached to the resource.

Importantly, I have been unable to reproduce this when using the same pipeline configuration but with less jobs. I am wondering if this could be related to #420882 (closed), with relevant sidekiq jobs getting lost and causing this behavior. Similar behavior can be seen in the sidekiq logs (internal use) for this job.

Output of checks

This bug happens on GitLab.com

Edited Dec 12, 2023 by Caleb Williamson