-
Notifications
You must be signed in to change notification settings - Fork 259
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add upstream changes and clean up where possible #42
Changes from 1 commit
8bc4b89
20cad49
44752d7
7251430
0ecaf05
9b8b12c
3e13cc7
59f30b7
a5c7484
07a5849
267c5d8
c665459
7cfa45c
125575f
d9c42c7
04398ff
475ffa6
2eae0d6
b45c331
b382f22
be09612
431bd39
3fb82c9
6058456
bd9951f
4317be7
01e82e3
b8f8b24
50f701c
e42c4d1
0913a91
c7f01c1
1e0f1a5
4a8084d
5449248
369875d
8e3fc03
da6c0a0
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
…uff. Clean up stuff includes: - Stop importing chardet from setup.py now that we have a dependency (enum34 on everything other than Python 3.4). - Add a lot of notes to NOTES.rst about how chardet actually works. - Removed sections of frequency rank tables that we do not actually use. It was just wasting memory. - Removed "m" prefix from attributes all over. Will fix snake case and things like that in a later commit. - Added a lot of comments to UniversalDetector. - Added the ability to ignore certain encodings when running unit tests. This was necessary because we don't actually support some of the encodings we were being tested on! - Removed constants.py now that we have enums.py - Switched to using logging instead of printing to sys.stderr. This actually should help a lot for debugging failed unit tests. - Made a CLI sub-package for chardetect. In the future, we'll reorganize more things into sub-packages.
- Loading branch information
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,122 @@ | ||
This is just a collection of information that I've found useful or thought | ||
Class Hierarchy for chardet | ||
=========================== | ||
|
||
Universal Detector | ||
------------------ | ||
Has a list of probers. | ||
|
||
CharSetProber | ||
------------- | ||
Mostly abstract parent class. | ||
|
||
CharSetGroupProber | ||
------------------ | ||
Runs a bunch of related probers at the same time and decides which is best. | ||
|
||
SBCSGroupProber | ||
--------------- | ||
SBCS = Single-ByteCharSet. Runs a bunch of SingleByteCharSetProbers. Always | ||
contains the same SingleByteCharSetProbers. | ||
|
||
SingleByteCharSetProber | ||
----------------------- | ||
A CharSetProber that is used for detecting single-byte encodings by using | ||
a "precedence matrix" (i.e., a character bigram model). | ||
|
||
MBCSGroupProber | ||
--------------- | ||
Runs a bunch of MultiByteCharSetProbers. It also uses a UTF8Prober, which is | ||
essentially a MultiByteCharSetProber that only has a state machine. Always | ||
contains the same MultiByteCharSetProbers. | ||
|
||
MultiByteCharSetProber | ||
---------------------- | ||
A CharSetProber that uses both a character unigram model (or "character | ||
distribution analysis") and an independent state machine for trying to | ||
detect and encoding. | ||
|
||
CodingStateMachine | ||
------------------ | ||
Used for "coding scheme" detection, where we just look for either invalid | ||
byte sequences or sequences that only occur for that particular encoding. | ||
|
||
CharDistributionAnalysis | ||
------------------------ | ||
Used for character unigram distribution encoding detection. Takes a mapping | ||
from characters to a "frequency order" (i.e., what frequency rank that byte has | ||
in the given encoding) and a "typical distribution ratio", which is the number | ||
of occurrences of the 512 most frequently used characters divided by the number | ||
of occurrences of the rest of the characters for a typical document. | ||
The "characters" in this case are 2-byte sequences and they are first converted | ||
to an "order" (name comes from ord() function, I believe). This "order" is used | ||
to index into the frequency order table to determine the frequency rank of that | ||
byte sequence. The reason this extra step is necessary is that the frequency | ||
rank table is language-specific (and not encoding-specific). | ||
|
||
|
||
What's where | ||
============ | ||
|
||
|
||
Bigram files | ||
------------ | ||
- hebrewprober.py | ||
- jpcntxprober.py | ||
- langbulgarianmodel.py | ||
- langcyrillicmodel.py | ||
- langgreekmodel.py | ||
- langhebrewmodel.py | ||
- langhungarianmodel.py | ||
- langthaimodel.py | ||
- latin1prober.py | ||
- sbcharsetprober.py | ||
- sbcsgroupprober.py | ||
|
||
|
||
Coding Scheme files | ||
------------------- | ||
- escprober.py | ||
- escsm.py | ||
- utf8prober.py | ||
- codingstatemachine.py | ||
- mbcssmprober.py | ||
|
||
|
||
Unigram files | ||
------------- | ||
- big5freqprober.py | ||
- chardistribution.py | ||
- euckrfreqprober.py | ||
- euctwfreqprober.py | ||
- gb2312freqprober.py | ||
- jisfreqprober.py | ||
|
||
Multibyte probers | ||
----------------- | ||
- big5prober.py | ||
- cp949prober.py | ||
- eucjpprober.py | ||
- euckrprober.py | ||
- euctwprober.py | ||
- gb2312prober.py | ||
- mbcharsetprober.py | ||
- mbcsgroupprober.py | ||
- sjisprober.py | ||
|
||
Misc files | ||
---------- | ||
- __init__.py (currently has detect function in it) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Will this display funky? Should we escape it like: - ``__init__.py`` There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, but I really was just updating that file to make some notes about things for myself. I didn't think anyone else was likely to read it. I'll fix it either way though. :) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Actually, I just checked here and it renders fine as is. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. GitHub's rendering (of reStructuredText) is historically broken. Relying on that has bitten me in the past. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Okay, I'll make all of the filenames have surrounding backticks then for consistent formatting. |
||
- compat.py | ||
- enums.py | ||
- universaldetector.py | ||
- version.py | ||
|
||
|
||
Useful links | ||
============ | ||
|
||
|
||
This is just a collection of information that I've found useful or thought | ||
might be useful in the future: | ||
|
||
- `BOM by Encoding`_ | ||
|
@@ -8,8 +126,8 @@ might be useful in the future: | |
- `What Every Programmer Absolutely...`_ | ||
|
||
- The actual `source`_ | ||
|
||
|
||
.. _BOM by Encoding: | ||
https://en.wikipedia.org/wiki/Byte_order_mark#Representations_of_byte_order_marks_by_encoding | ||
.. _A Composite Approach to Language/Encoding Detection: | ||
|
Large diffs are not rendered by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aren't there supposed to be new-lines between the
---
s and the list? Really between any header and the text below?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The blank lines are optional according to the RST spec.