[go: up one dir, main page]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add method to remove runtime patterns after run #4236

Merged
merged 9 commits into from
Oct 21, 2024

Conversation

ElenaKhaustova
Copy link
Contributor
@ElenaKhaustova ElenaKhaustova commented Oct 17, 2024

Description

After the pattern resolution logic refactoring we process all the patterns (dataset, default, runtime) together.

As a result of run() we return:

  datasets that aren't in the catalog and don't match a pattern in the catalog and include MemoryDataset

# Check if there's any output datasets that aren't in the catalog and don't match a pattern

Before the run() we add a runtime pattern to the catalog, so we could process all intermediate outputs as MemoryDataset.

So currently when we do two consecutive runs #4235 runtime pattern {default} added after the first run affects the next runs, so that all datasets match it and we do not return anything as a result of run() cause we think these datasets are in the catalog.

Development notes

We think that the resolution logic is correct and all patterns should be processed together as we do now. To avoid this behaviour we added method to remove runtime patters after the run, so they only live within the run and do not affect other runs.

Developer Certificate of Origin

We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a Signed-off-by line in the commit message. See our wiki for guidance.

If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.

Checklist

  • Read the contributing guidelines
  • Signed off each commit with a Developer Certificate of Origin (DCO)
  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the RELEASE.md file
  • Added tests to cover my changes
  • Checked if this change will affect Kedro-Viz, and if so, communicated that with the Viz team

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
@ElenaKhaustova ElenaKhaustova self-assigned this Oct 17, 2024
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
@ElenaKhaustova ElenaKhaustova marked this pull request as ready for review October 17, 2024 13:44
@noklam
Copy link
Contributor
noklam commented Oct 17, 2024

datasets that aren't in the catalog and don't match a pattern in the catalog and include MemoryDataset

If a dataset is defined in catalog or as a pattern, the catalog would not return as, with the exception of MemoryDataset. In this case it should be still return because the runtime pattern add a memory dataset. Can you clarify why changing the pattern resolution somehow breaks this?

#3475

@ElenaKhaustova
Copy link
Contributor Author
ElenaKhaustova commented Oct 17, 2024

datasets that aren't in the catalog and don't match a pattern in the catalog and include MemoryDataset

If a dataset is defined in catalog or as a pattern, the catalog would not return as, with the exception of MemoryDataset. In this case it should be still return because the runtime pattern add a memory dataset. Can you clarify why changing the pattern resolution somehow breaks this?

#3475

As I tried to explain above after we add runtime pattern - which matches all datasets it remains in the catalog when the second run. So this part of the condition is not valid - don't match a pattern in the catalog. Previously there was a different logic when resolution and runtime patterns were processed separately.

@ElenaKhaustova
Copy link
Contributor Author

datasets that aren't in the catalog and don't match a pattern in the catalog and include MemoryDataset

If a dataset is defined in catalog or as a pattern, the catalog would not return as, with the exception of MemoryDataset. In this case it should be still return because the runtime pattern add a memory dataset. Can you clarify why changing the pattern resolution somehow breaks this?

#3475

if ds in catalog is True for all datasets as they match runtime pattern on the second run

registered_ds = [ds for ds in pipeline.datasets() if ds in catalog]

@noklam
Copy link
Contributor
noklam commented Oct 17, 2024

I get this in the second run it will be in the registered_ds. But the logic here is:
free_output = output - register_ds (excluding if is a memory dataset)

So my expectation here is that, even if it's registered, it will still be return.

free_outputs = pipeline.outputs() - (set(registered_ds) - memory_datasets)

Trace:
I run this twice in the test and add some printing statement in runner.py

pipeline.outputs()={'y_test', 'X_test', 'regressor'}
registered_ds=['params:model_options', 'model_input_table']
memory_datasets={'model_input_table', 'params:model_options'}
free_outputs={'y_test', 'X_test', 'regressor'}


pipeline.outputs()={'y_test', 'X_test', 'regressor'}
registered_ds=['X_test', 'params:model_options', 'model_input_table', 'X_train', 'regressor', 'y_test', 'y_train']
memory_datasets={'model_input_table', 'params:model_options'}
free_outputs=set()

This is the result I get with other test, I think the problem is where we define Memory dataset, in 2nd run I expected, y_test, X_test, regressor in the memory_datasets

Identify MemoryDataset in the catalog

    memory_datasets = {
        ds_name
        for ds_name, ds in catalog._datasets.items()
        if isinstance(ds, MemoryDataset)
    }

@ElenaKhaustova
Copy link
Contributor Author
ElenaKhaustova commented Oct 17, 2024

This is the result I get with other test, I think the problem is where we define Memory dataset, in 2nd run I expected, y_test, X_test, regressor in the memory_datasets

They are not in the catalog because we make

        catalog = catalog.shallow_copy(
            extra_dataset_patterns=self._extra_dataset_patterns
        )

before calling _run(), so we modify different catalog object. But runtime pattern is stored in CatalogConfigResolver, so it remains the same between the runs.

In the new catalog, it will be as you expect.

@ElenaKhaustova
Copy link
Contributor Author

Please note that shallow_copy will be removed from the new catalog, but the mechanism to remove runtime patterns after the run will be helpful anyway to not affect consequent runs.

Copy link
Contributor
@ankatiyar ankatiyar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we are removing shallow_copy stuff in the future, this solution makes sense to me! Thanks @ElenaKhaustova 👍🏾

@ElenaKhaustova ElenaKhaustova marked this pull request as draft October 17, 2024 22:20
@ElenaKhaustova
Copy link
Contributor Author

Moved to draft now as want to double-check some cases that @noklam shared.

@noklam
Copy link
Contributor
noklam commented Oct 18, 2024

For reference I shared my edge cases here: https://github.com/noklam/kedro-runner-bug-investigation/commits/main/

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
@ElenaKhaustova
Copy link
Contributor Author

For reference I shared my edge cases here: https://github.com/noklam/kedro-runner-bug-investigation/commits/main/

The case shared above was also not handled by the previous kendo versions. It relates to the issue raised by the user, so an alternative solution was added to fix both.

The issue is that we were processing runtime patterns separately from others and never returning them in the run() output, even if they were persistent, like in the example above.

In the solution we move the logic of accessing datasets in the catalog after runtime pattern is added, and the logic of accessing MemoryDataset's after the run, so they appear in the catalog in case they were added with a pattern. The same is done for SharedMemoryDataset cause that's the default dataset pattern (runtime pattern) for ParallelRunner.

@ElenaKhaustova ElenaKhaustova marked this pull request as ready for review October 18, 2024 14:00
Copy link
Contributor
@noklam noklam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the change! I think this looks good now. I think we need a regression test case here, the original one shared by the user is a good one, where we run the same catalog twice should get the same result. I can approve this quickly with the test.

@ElenaKhaustova
Copy link
Contributor Author

@noklam

Added test to run node twice and double-checked that it failed without changes made.

Copy link
Member
@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Contributor
@ankatiyar ankatiyar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ElenaKhaustova 👍🏾

@ElenaKhaustova ElenaKhaustova merged commit 3818a2a into main Oct 21, 2024
28 checks passed
@ElenaKhaustova ElenaKhaustova deleted the fix/4235-run-output branch October 21, 2024 13:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants