-
Notifications
You must be signed in to change notification settings - Fork 259
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add upstream changes and clean up where possible #42
Conversation
Conflicts: chardet/compat.py
- Switch to using nose - Calculate coverage with coveralls - Use conda for setting up Python since Travis is flakey about that
Before we were cleaning up the known encoding string for each file of that encoding. This was completely unnecessary.
Much like "filterWithoutEnglishLetters", it doesn't actually strip out non-English characters. Instead, it's a filter that is used by Latin1Prober to inaccurately remove HTML tags. In reality, it removes a lot of other stuff too, but I'm not sure if that was intentional yet. Also cleaned up some of the pylint warnings for charsetprober.py.
- Made all classes use super() instead of explicitly listing calling parent __init__ - Moved latin1 test file that's actually ascii into the correct dir - Removed some commented out lines of code that have been there forever - Made sure all attributes are defined in __init__ instead of at will - Other pylint warning fixes
…uff. Clean up stuff includes: - Stop importing chardet from setup.py now that we have a dependency (enum34 on everything other than Python 3.4). - Add a lot of notes to NOTES.rst about how chardet actually works. - Removed sections of frequency rank tables that we do not actually use. It was just wasting memory. - Removed "m" prefix from attributes all over. Will fix snake case and things like that in a later commit. - Added a lot of comments to UniversalDetector. - Added the ability to ignore certain encodings when running unit tests. This was necessary because we don't actually support some of the encodings we were being tested on! - Removed constants.py now that we have enums.py - Switched to using logging instead of printing to sys.stderr. This actually should help a lot for debugging failed unit tests. - Made a CLI sub-package for chardetect. In the future, we'll reorganize more things into sub-packages.
I'm sure there are plenty of line length and alignment issues after all the find-and-replace stuff I've done. Will fix those next.
@sigmavirus24 I know this is a total bear of a PR at this point, but most of the changes are purely cosmetic. The main substantive changes are:
I'd like to propose a more substantial refactoring of things to make it simpler to retrain some of the models, which were never updated after they were established ~15 years ago, but I'll file a separate issue about that. If you get a chance to look this over, I'd appreciate your thoughts. |
Wow, I hadn't seen landscape.io before. That seems pretty nice. I'm fine with using that if you are. |
That was a mishap of all the massive find-and-replacing I was doing. That said, I'd honestly prefer to make everything public that doesn't currently need/use a property. |
@sigmavirus24 I believe I've addressed all of your comments at this point. |
With the Hungarian probers commented out, and the missing encodings list updated to include ISO-8859-6 (since we don't detect that), only one unit test failure remains, and it's a mix up for a document where the difference between Windows-1253 and ISO-8859-7 would lead to a single character being improperly decoded.
I've gone ahead and disabled our Hungarian probers (like the Mozilla authors did). This fixes all but one of our unit test failures. The remaining failure is for a Greek XML file where we misidentify ISO-8859-7 as Windows-1253, which leads to a one character difference in the decoded output. Without retraining our models, I'm not sure how we could actually fix that one. The difference is: - δραστήριου τετραπληγικού. Η εργασία είναι στην Αθήνα, στον Άγιο Νικόλαο (Αχαρνών). Τα καθήκοντά αφορούν στην υποστήριξη των υλικών καθημερινών αναγκών, την
? ^
+ δραστήριου τετραπληγικού. Η εργασία είναι στην Αθήνα, στον ¶γιο Νικόλαο (Αχαρνών). Τα καθήκοντά αφορούν στην υποστήριξη των υλικών καθημερινών αναγκών, την
? ^ |
I'm going to merge this because I believe I've addressed all of the objections @sigmavirus24 and we have much bigger future direction things to think about at the moment. |
…haul Add upstream changes and clean up where possible
So I just realized, @dan-blanchard that you added a reliance on |
@sigmavirus24 why is the enum34 reliance an issue for requests? Since requests is probably the main way that chardet gets used, we can change things for it if necessary, but IMHO the code is a lot cleaner using enums. |
Because Kenneth insists on vendoring requests' dependencies. Which means dependencies with dependencies need to either vendor them or they're axed. Maybe once we can convince Kenneth to stop vendoring it won't be a problem, but this will prevent requests users from getting any of the bug fixes introduced in this mega pull request. Disclaimer: I think Enums are superior too. |
@sigmavirus24 I just made our enums inherit from |
Attribute '_mCharSetProbers' removed from 'UniversalDetector' in 'cchardet' v3.0.0, replaced by '_charset_probers'. Version check adjusted to cover also version 5 in addition to 3 and 4. References: chardet/chardet@be09612 chardet/chardet#42
This is very much a work in progress at the moment, and I'm just creating the PR to make it easier for me to keep track of Travis results.
I have a few goals for this branch:
So far, I've done a little bit of point 1 and updated the Travis testing setup to use nose and report test coverage via Coveralls.