Add method to remove runtime patterns after run #4236

ElenaKhaustova · 2024-10-17T13:09:35Z

Description

After the pattern resolution logic refactoring we process all the patterns (dataset, default, runtime) together.

As a result of run() we return:

  datasets that aren't in the catalog and don't match a pattern in the catalog and include MemoryDataset

kedro/kedro/runner/runner.py

Line 108 in ba98135

    
           # Check if there's any output datasets that aren't in the catalog and don't match a pattern

Before the run() we add a runtime pattern to the catalog, so we could process all intermediate outputs as MemoryDataset.

So currently when we do two consecutive runs #4235 runtime pattern {default} added after the first run affects the next runs, so that all datasets match it and we do not return anything as a result of run() cause we think these datasets are in the catalog.

Development notes

We think that the resolution logic is correct and all patterns should be processed together as we do now. To avoid this behaviour we added method to remove runtime patters after the run, so they only live within the run and do not affect other runs.

Developer Certificate of Origin

We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a Signed-off-by line in the commit message. See our wiki for guidance.

If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.

Checklist

Read the contributing guidelines
Signed off each commit with a Developer Certificate of Origin (DCO)
Opened this PR as a 'Draft Pull Request' if it is work-in-progress
Updated the documentation to reflect the code changes
Added a description of this change in the RELEASE.md file
Added tests to cover my changes
Checked if this change will affect Kedro-Viz, and if so, communicated that with the Viz team

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

noklam · 2024-10-17T13:56:04Z

datasets that aren't in the catalog and don't match a pattern in the catalog and include MemoryDataset

If a dataset is defined in catalog or as a pattern, the catalog would not return as, with the exception of MemoryDataset. In this case it should be still return because the runtime pattern add a memory dataset. Can you clarify why changing the pattern resolution somehow breaks this?

#3475

ElenaKhaustova · 2024-10-17T14:01:34Z

datasets that aren't in the catalog and don't match a pattern in the catalog and include MemoryDataset

If a dataset is defined in catalog or as a pattern, the catalog would not return as, with the exception of MemoryDataset. In this case it should be still return because the runtime pattern add a memory dataset. Can you clarify why changing the pattern resolution somehow breaks this?

#3475

As I tried to explain above after we add runtime pattern - which matches all datasets it remains in the catalog when the second run. So this part of the condition is not valid - don't match a pattern in the catalog. Previously there was a different logic when resolution and runtime patterns were processed separately.

ElenaKhaustova · 2024-10-17T14:06:24Z

datasets that aren't in the catalog and don't match a pattern in the catalog and include MemoryDataset

If a dataset is defined in catalog or as a pattern, the catalog would not return as, with the exception of MemoryDataset. In this case it should be still return because the runtime pattern add a memory dataset. Can you clarify why changing the pattern resolution somehow breaks this?

#3475

if ds in catalog is True for all datasets as they match runtime pattern on the second run

kedro/kedro/runner/runner.py

Line 90 in ba98135

registered_ds = [ds for ds in pipeline.datasets() if ds in catalog]

noklam · 2024-10-17T14:14:46Z

I get this in the second run it will be in the registered_ds. But the logic here is:
free_output = output - register_ds (excluding if is a memory dataset)

So my expectation here is that, even if it's registered, it will still be return.

kedro/kedro/runner/runner.py

Line 110 in ba98135

free_outputs = pipeline.outputs() - (set(registered_ds) - memory_datasets)

Trace:
I run this twice in the test and add some printing statement in runner.py

pipeline.outputs()={'y_test', 'X_test', 'regressor'}
registered_ds=['params:model_options', 'model_input_table']
memory_datasets={'model_input_table', 'params:model_options'}
free_outputs={'y_test', 'X_test', 'regressor'}


pipeline.outputs()={'y_test', 'X_test', 'regressor'}
registered_ds=['X_test', 'params:model_options', 'model_input_table', 'X_train', 'regressor', 'y_test', 'y_train']
memory_datasets={'model_input_table', 'params:model_options'}
free_outputs=set()

This is the result I get with other test, I think the problem is where we define Memory dataset, in 2nd run I expected, y_test, X_test, regressor in the memory_datasets

Identify MemoryDataset in the catalog

    memory_datasets = {
        ds_name
        for ds_name, ds in catalog._datasets.items()
        if isinstance(ds, MemoryDataset)
    }

ElenaKhaustova · 2024-10-17T15:03:37Z

This is the result I get with other test, I think the problem is where we define Memory dataset, in 2nd run I expected, y_test, X_test, regressor in the memory_datasets

They are not in the catalog because we make

        catalog = catalog.shallow_copy(
            extra_dataset_patterns=self._extra_dataset_patterns
        )

before calling _run(), so we modify different catalog object. But runtime pattern is stored in CatalogConfigResolver, so it remains the same between the runs.

In the new catalog, it will be as you expect.

ElenaKhaustova · 2024-10-17T15:31:43Z

Please note that shallow_copy will be removed from the new catalog, but the mechanism to remove runtime patterns after the run will be helpful anyway to not affect consequent runs.

ankatiyar

Given that we are removing shallow_copy stuff in the future, this solution makes sense to me! Thanks @ElenaKhaustova 👍🏾

ElenaKhaustova · 2024-10-17T22:46:27Z

Moved to draft now as want to double-check some cases that @noklam shared.

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

noklam · 2024-10-18T13:35:35Z

For reference I shared my edge cases here: https://github.com/noklam/kedro-runner-bug-investigation/commits/main/

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

ElenaKhaustova · 2024-10-18T13:59:56Z

For reference I shared my edge cases here: https://github.com/noklam/kedro-runner-bug-investigation/commits/main/

The case shared above was also not handled by the previous kendo versions. It relates to the issue raised by the user, so an alternative solution was added to fix both.

The issue is that we were processing runtime patterns separately from others and never returning them in the run() output, even if they were persistent, like in the example above.

In the solution we move the logic of accessing datasets in the catalog after runtime pattern is added, and the logic of accessing MemoryDataset's after the run, so they appear in the catalog in case they were added with a pattern. The same is done for SharedMemoryDataset cause that's the default dataset pattern (runtime pattern) for ParallelRunner.

noklam

Thanks for the change! I think this looks good now. I think we need a regression test case here, the original one shared by the user is a good one, where we run the same catalog twice should get the same result. I can approve this quickly with the test.

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

ElenaKhaustova · 2024-10-21T09:31:33Z

@noklam

Added test to run node twice and double-checked that it failed without changes made.

merelcht

LGTM!

ankatiyar

Thanks @ElenaKhaustova 👍🏾

Added method to remove runtime patterns

5813d01

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

ElenaKhaustova self-assigned this Oct 17, 2024

ElenaKhaustova added 2 commits October 17, 2024 14:29

Added test for remove_runtime_pattern

3bc8831

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

Fixed types match

d0717ae

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

ElenaKhaustova marked this pull request as ready for review October 17, 2024 13:44

ElenaKhaustova requested a review from merelcht as a code owner October 17, 2024 13:44

ElenaKhaustova requested review from noklam and ankatiyar October 17, 2024 13:44

ankatiyar approved these changes Oct 17, 2024

View reviewed changes

ElenaKhaustova marked this pull request as draft October 17, 2024 22:20

ElenaKhaustova added 2 commits October 18, 2024 14:21

Implemented alternative solution

908bfd1

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

Merge branch 'main' into fix/4235-run-output

3bfa5de

ElenaKhaustova added 2 commits October 18, 2024 14:40

Moved catalog validation before it extended with runtime patter

63fab49

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

Removed debug output

519bd47

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

ElenaKhaustova marked this pull request as ready for review October 18, 2024 14:00

ElenaKhaustova requested a review from ankatiyar October 18, 2024 14:00

noklam reviewed Oct 18, 2024

View reviewed changes

ElenaKhaustova added 2 commits October 21, 2024 10:19

Merge branch 'main' into fix/4235-run-output

d8c1611

Added test to call run twice

31ed1c5

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

ElenaKhaustova requested a review from noklam October 21, 2024 09:32

noklam approved these changes Oct 21, 2024

View reviewed changes

merelcht approved these changes Oct 21, 2024

View reviewed changes

ankatiyar approved these changes Oct 21, 2024

View reviewed changes

ElenaKhaustova merged commit 3818a2a into main Oct 21, 2024
28 checks passed

ElenaKhaustova deleted the fix/4235-run-output branch October 21, 2024 13:17

ElenaKhaustova mentioned this pull request Oct 21, 2024

0.19.9 introduced error, output is saved only once when running the pipeline #4235

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add method to remove runtime patterns after run #4236

Add method to remove runtime patterns after run #4236

Add method to remove runtime patterns after run #4236

Add method to remove runtime patterns after run #4236

Conversation

Description

Development notes

Developer Certificate of Origin

Checklist

Identify MemoryDataset in the catalog

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment