Add upstream changes and clean up where possible #42

dan-blanchard · 2014-12-02T21:05:08Z

This is very much a work in progress at the moment, and I'm just creating the PR to make it easier for me to keep track of Travis results.

I have a few goals for this branch:

Pull in changes from Mozilla's upstream code. There aren't as many as I had initially expected but there are some.
Improve PEP8 compliance all over the place. The previous maintainers tried to keep variable names identical to the C code, presumably to ease the comparison with the Mozilla code, but we're going to be diverging from upstream after pulling in the changes mentioned in 1. Basically, Mozilla seems very likely to abandon their character encoding detector in the near future and switch to using ICU, but ICU doesn't support all of the codecs we currently do, because it is more web-focused. If our goal here is to be a truly universal character encoding detector, we'll need to go our own way in the future in that respect.
Make the unit tests pass, or at the very least make it obvious that the tests are actually failing (instead of ignoring the failures like our current Travis build does).

So far, I've done a little bit of point 1 and updated the Travis testing setup to use nose and report test coverage via Coveralls.

Conflicts: chardet/compat.py

- Switch to using nose - Calculate coverage with coveralls - Use conda for setting up Python since Travis is flakey about that

…uff.

Before we were cleaning up the known encoding string for each file of that encoding. This was completely unnecessary.

Much like "filterWithoutEnglishLetters", it doesn't actually strip out non-English characters. Instead, it's a filter that is used by Latin1Prober to inaccurately remove HTML tags. In reality, it removes a lot of other stuff too, but I'm not sure if that was intentional yet. Also cleaned up some of the pylint warnings for charsetprober.py.

- Made all classes use super() instead of explicitly listing calling parent __init__ - Moved latin1 test file that's actually ascii into the correct dir - Removed some commented out lines of code that have been there forever - Made sure all attributes are defined in __init__ instead of at will - Other pylint warning fixes

…ulous.

…uff. Clean up stuff includes: - Stop importing chardet from setup.py now that we have a dependency (enum34 on everything other than Python 3.4). - Add a lot of notes to NOTES.rst about how chardet actually works. - Removed sections of frequency rank tables that we do not actually use. It was just wasting memory. - Removed "m" prefix from attributes all over. Will fix snake case and things like that in a later commit. - Added a lot of comments to UniversalDetector. - Added the ability to ignore certain encodings when running unit tests. This was necessary because we don't actually support some of the encodings we were being tested on! - Removed constants.py now that we have enums.py - Switched to using logging instead of printing to sys.stderr. This actually should help a lot for debugging failed unit tests. - Made a CLI sub-package for chardetect. In the future, we'll reorganize more things into sub-packages.

I'm sure there are plenty of line length and alignment issues after all the find-and-replace stuff I've done. Will fix those next.

dan-blanchard · 2015-01-05T21:57:22Z

@sigmavirus24 I know this is a total bear of a PR at this point, but most of the changes are purely cosmetic. The main substantive changes are:

PEP8 compliance as much as possible (especially with variable names).
Implemented filter_english_letters (filter_with_english_letters #46)
Switched to using enums (via enum34 and Python 3.4 standard library) instead of constants in a few places.
Added a bunch of comments to UniversalDetector (I'll try to add more to others later).
Added the same language filtering support present in the upstream Mozilla code (which appears to be the only substantive change they made other than commenting out the Hungarian detectors).

I'd like to propose a more substantial refactoring of things to make it simpler to retrain some of the models, which were never updated after they were established ~15 years ago, but I'll file a separate issue about that.

If you get a chance to look this over, I'd appreciate your thoughts.

dan-blanchard · 2015-01-06T16:04:47Z

Further, if we're going to care about style, we should add it as a Travis CI job to at least catch regressions. Or we could add landscape.io. If we want to do the former, I'll send a PR to this branch with a set-up I use on pretty much every other project that makes it significantly easier.

Wow, I hadn't seen landscape.io before. That seems pretty nice. I'm fine with using that if you are.

dan-blanchard · 2015-01-06T16:06:02Z

So I have a few thoughts that I've left as comments. I noticed you selectively made one attribute of the probers public. Given the defined API for the library, why not make everything public? Or leave everything private as it was? I'm not sure I understand the point of it. Seems like unnecessary churn.

That was a mishap of all the massive find-and-replacing I was doing. That said, I'd honestly prefer to make everything public that doesn't currently need/use a property.

dan-blanchard · 2015-01-06T18:24:19Z

@sigmavirus24 I believe I've addressed all of your comments at this point.

With the Hungarian probers commented out, and the missing encodings list updated to include ISO-8859-6 (since we don't detect that), only one unit test failure remains, and it's a mix up for a document where the difference between Windows-1253 and ISO-8859-7 would lead to a single character being improperly decoded.

dan-blanchard · 2015-01-07T15:43:53Z

I've gone ahead and disabled our Hungarian probers (like the Mozilla authors did). This fixes all but one of our unit test failures. The remaining failure is for a Greek XML file where we misidentify ISO-8859-7 as Windows-1253, which leads to a one character difference in the decoded output. Without retraining our models, I'm not sure how we could actually fix that one.

The difference is:

-  δραστήριου τετραπληγικού. Η εργασία είναι στην Αθήνα, στον Άγιο Νικόλαο (Αχαρνών). Τα καθήκοντά αφορούν στην υποστήριξη των υλικών καθημερινών αναγκών, την 
?                                                             ^

+  δραστήριου τετραπληγικού. Η εργασία είναι στην Αθήνα, στον ¶γιο Νικόλαο (Αχαρνών). Τα καθήκοντά αφορούν στην υποστήριξη των υλικών καθημερινών αναγκών, την 
?                                                             ^

dan-blanchard · 2015-01-09T19:01:43Z

I'm going to merge this because I believe I've addressed all of the objections @sigmavirus24 and we have much bigger future direction things to think about at the moment.

…haul Add upstream changes and clean up where possible

sigmavirus24 · 2015-03-29T15:15:07Z

So I just realized, @dan-blanchard that you added a reliance on enum34 which is going to pose problems for requests. Any chance we can revert that work since it isn't necessary or a public API change?

dan-blanchard · 2015-03-30T13:21:01Z

@sigmavirus24 why is the enum34 reliance an issue for requests? Since requests is probably the main way that chardet gets used, we can change things for it if necessary, but IMHO the code is a lot cleaner using enums.

sigmavirus24 · 2015-03-30T13:34:11Z

Because Kenneth insists on vendoring requests' dependencies. Which means dependencies with dependencies need to either vendor them or they're axed. Maybe once we can convince Kenneth to stop vendoring it won't be a problem, but this will prevent requests users from getting any of the bug fixes introduced in this mega pull request.

Disclaimer: I think Enums are superior too.

dan-blanchard · 2015-09-19T15:20:32Z

@sigmavirus24 I just made our enums inherit from object instead of IntEnum because we weren't actually using any of the fancier features. That should eliminate the problem. 02edb35

Attribute '_mCharSetProbers' removed from 'UniversalDetector' in 'cchardet' v3.0.0, replaced by '_charset_probers'. Version check adjusted to cover also version 5 in addition to 3 and 4. References: chardet/chardet@be09612 chardet/chardet#42

dan-blanchard added 7 commits October 10, 2014 20:52

Comment out sections of tables that weren't used to save memory.

8bc4b89

Add 3.4 to list for Travis testing and remove 3.2

20cad49

Bunch of little clean up things

44752d7

Merge branch 'master' into feature/upstream-changes-and-overhaul

7251430

Conflicts: chardet/compat.py

Add if __name__... to test.py and a break to speed things up in loop.

0ecaf05

Modernize testings

9b8b12c

- Switch to using nose - Calculate coverage with coveralls - Use conda for setting up Python since Travis is flakey about that

Fix missing req_path in setup.py

3e13cc7

dan-blanchard added the enhancement label Dec 2, 2014

dan-blanchard added 2 commits December 2, 2014 16:19

Simplify Travis setup and just use pip. conda was overkill for our st…

59f30b7

…uff.

Make tests slightly more efficient.

a5c7484

Before we were cleaning up the known encoding string for each file of that encoding. This was completely unnecessary.

dan-blanchard mentioned this pull request Dec 19, 2014

filter_with_english_letters #46

Closed

dan-blanchard added 11 commits December 20, 2014 22:50

Merge branch 'master' into feature/upstream-changes-and-overhaul

07a5849

Switch to new Travis docker VMs and add PyPy testing.

267c5d8

Fix some pylint warnings in universaldetector.py

7cfa45c

Made latin1 equivalent to windows-1252 when running unit tests.

125575f

Comment out pypy line in .travis.yml. It's 10x slower, which is ridic…

04398ff

…ulous.

Re-enable PyPy on Travis, but disable coverage for it

475ffa6

Fix syntax error in .travis.yml

2eae0d6

Fix coverage logic reversal in .travis.yml

b45c331

Fix TypeError on PyPy in utf8prober.py

b382f22

dan-blanchard mentioned this pull request Jan 1, 2015

Use travis' new build workers #47

Merged

dan-blanchard added 5 commits January 2, 2015 15:39

Get rid of set literal to appease Python 2.6

431bd39

Some minor PEP8 name changes

3fb82c9

Merge branch 'master' into feature/upstream-changes-and-overhaul

6058456

Loads of PEP8 naming convention fixes.

bd9951f

I'm sure there are plenty of line length and alignment issues after all the find-and-replace stuff I've done. Will fix those next.

dan-blanchard assigned sigmavirus24 Jan 5, 2015

dan-blanchard added 8 commits January 6, 2015 10:43

Fix some NOTES.rst formatting issues

4317be7

Update MANIFEST.in to include test files and docs

01e82e3

Remove PyCharm stuff from .gitignore

b8f8b24

Remove flake8: noqa lines.

50f701c

Add missing __version__ import to __init__.py

e42c4d1

Remove unnecessary import sys import from conf.py

0913a91

Switch to using pip for installation in .travis.yml

c7f01c1

Rename SMState to MachineState

1e0f1a5

dan-blanchard added this to the 3.0 milestone Jan 6, 2015

dan-blanchard mentioned this pull request Jan 6, 2015

Merge with cChardet? #19

Open

dan-blanchard added 4 commits January 6, 2015 11:54

Get rid of messy ternary operator in charsetprober.py

4a8084d

Fix __version typo in __init__.py

5449248

Add comment about why we're slicing in filter_with_english_letters

369875d

Made more attributes public.

8e3fc03

dan-blanchard mentioned this pull request Jan 7, 2015

Adding an interface so that we can know the confidence of all encodings? #45

Open

dan-blanchard added a commit that referenced this pull request Jan 9, 2015

Merge pull request #42 from chardet/feature/upstream-changes-and-over…

c8c752a

…haul Add upstream changes and clean up where possible

dan-blanchard merged commit c8c752a into master Jan 9, 2015

dan-blanchard deleted the feature/upstream-changes-and-overhaul branch January 9, 2015 19:01

dan-blanchard mentioned this pull request Jan 9, 2015

Fix failing unit tests. #13

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add upstream changes and clean up where possible #42

Add upstream changes and clean up where possible #42

Add upstream changes and clean up where possible #42

Add upstream changes and clean up where possible #42

Conversation