-
Notifications
You must be signed in to change notification settings - Fork 127
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle duplicate documents during processing #2487
Comments
Hi @JaCoB1123 this is possible by reducing the number of parallel processes to 1. Otherwise there will always be a race condition that is more likely to occur on larger files, of course. This cannot "really" be fixed, imho, without doing some locking to first obtain the checksum of a file which will slow down processing quite a lot. It often is not so hard to have the ingestion part done in a more robust way to exclude duplicate files before they are even transferred to the server. |
I guessed that it wouldn't be as easy. I thought if there was a table of running jobs and that contained the hash, you could check this one before running a job (or even make it the primary key?). That would probably when having multiple files for one document though. Another solution might be to check for duplicates before inserting after the job. Then the job might have been wasted but at least no duplicates would be introduced. Would that maybe be a minor change and still handle most problems? Decreasing the number of parallel processes to 1 is way easier though. I'll try how much it influences the time when adding documents. |
The hash is computed, but this can take a while and there will be race conditions if there are multiple parallel processes doing this. A job could be anything and could be tasked to process multiple files (that means it must be part of the task doing the work). But you are right, there are of course ways for preventing this better. |
I had duplicates in Docspell in some occasions because the duplicate check only handles processed documents. When I have a big document that takes long to process, and rerun dsc to upload local files, it will be added to the processing queue again.
The text was updated successfully, but these errors were encountered: