[go: up one dir, main page]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add upstream changes and clean up where possible #42

Merged
merged 38 commits into from
Jan 9, 2015
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
8bc4b89
Comment out sections of tables that weren't used to save memory.
dan-blanchard Oct 11, 2014
20cad49
Add 3.4 to list for Travis testing and remove 3.2
dan-blanchard Oct 11, 2014
44752d7
Bunch of little clean up things
dan-blanchard Dec 1, 2014
7251430
Merge branch 'master' into feature/upstream-changes-and-overhaul
dan-blanchard Dec 2, 2014
0ecaf05
Add if __name__... to test.py and a break to speed things up in loop.
dan-blanchard Dec 2, 2014
9b8b12c
Modernize testings
dan-blanchard Dec 2, 2014
3e13cc7
Fix missing req_path in setup.py
dan-blanchard Dec 2, 2014
59f30b7
Simplify Travis setup and just use pip. conda was overkill for our s…
dan-blanchard Dec 2, 2014
a5c7484
Make tests slightly more efficient.
dan-blanchard Dec 2, 2014
07a5849
Merge branch 'master' into feature/upstream-changes-and-overhaul
dan-blanchard Dec 21, 2014
267c5d8
Switch to new Travis docker VMs and add PyPy testing.
dan-blanchard Dec 29, 2014
c665459
Add C-equivalent implementation of filter_english_letters.
dan-blanchard Dec 30, 2014
7cfa45c
Fix some pylint warnings in universaldetector.py
dan-blanchard Dec 30, 2014
125575f
Made latin1 equivalent to windows-1252 when running unit tests.
dan-blanchard Dec 30, 2014
d9c42c7
A bunch of little clean up changes.
dan-blanchard Dec 30, 2014
04398ff
Comment out pypy line in .travis.yml. It's 10x slower, which is ridi…
dan-blanchard Dec 30, 2014
475ffa6
Re-enable PyPy on Travis, but disable coverage for it
dan-blanchard Dec 30, 2014
2eae0d6
Fix syntax error in .travis.yml
dan-blanchard Dec 30, 2014
b45c331
Fix coverage logic reversal in .travis.yml
dan-blanchard Dec 30, 2014
b382f22
Fix TypeError on PyPy in utf8prober.py
dan-blanchard Dec 30, 2014
be09612
Switch to using enums instead of constants, and a bunch of cleanup st…
dan-blanchard Jan 2, 2015
431bd39
Get rid of set literal to appease Python 2.6
dan-blanchard Jan 2, 2015
3fb82c9
Some minor PEP8 name changes
dan-blanchard Jan 5, 2015
6058456
Merge branch 'master' into feature/upstream-changes-and-overhaul
dan-blanchard Jan 5, 2015
bd9951f
Loads of PEP8 naming convention fixes.
dan-blanchard Jan 5, 2015
4317be7
Fix some NOTES.rst formatting issues
dan-blanchard Jan 6, 2015
01e82e3
Update MANIFEST.in to include test files and docs
dan-blanchard Jan 6, 2015
b8f8b24
Remove PyCharm stuff from .gitignore
dan-blanchard Jan 6, 2015
50f701c
Remove flake8: noqa lines.
dan-blanchard Jan 6, 2015
e42c4d1
Add missing __version__ import to __init__.py
dan-blanchard Jan 6, 2015
0913a91
Remove unnecessary import sys import from conf.py
dan-blanchard Jan 6, 2015
c7f01c1
Switch to using pip for installation in .travis.yml
dan-blanchard Jan 6, 2015
1e0f1a5
Rename SMState to MachineState
dan-blanchard Jan 6, 2015
4a8084d
Get rid of messy ternary operator in charsetprober.py
dan-blanchard Jan 6, 2015
5449248
Fix __version typo in __init__.py
dan-blanchard Jan 6, 2015
369875d
Add comment about why we're slicing in filter_with_english_letters
dan-blanchard Jan 6, 2015
8e3fc03
Made more attributes public.
dan-blanchard Jan 6, 2015
da6c0a0
Temporarily disable Hungarian probers, and update missing encodings list
dan-blanchard Jan 7, 2015
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Switch to using enums instead of constants, and a bunch of cleanup st…
…uff.

Clean up stuff includes:

-  Stop importing chardet from setup.py now that we have a dependency
   (enum34 on everything other than Python 3.4).
-  Add a lot of notes to NOTES.rst about how chardet actually works.
-  Removed sections of frequency rank tables that we do not actually use.
   It was just wasting memory.
-  Removed "m" prefix from attributes all over.  Will fix snake case and
   things like that in a later commit.
-  Added a lot of comments to UniversalDetector.
-  Added the ability to ignore certain encodings when running unit tests.
   This was necessary because we don't actually support some of the
   encodings we were being tested on!
-  Removed constants.py now that we have enums.py
-  Switched to using logging instead of printing to sys.stderr.  This
   actually should help a lot for debugging failed unit tests.
-  Made a CLI sub-package for chardetect.  In the future, we'll
   reorganize more things into sub-packages.
  • Loading branch information
dan-blanchard committed Jan 2, 2015
commit be09612a5779a51695c5a69b7b73fd1b4ba12a3e
124 changes: 121 additions & 3 deletions NOTES.rst
Original file line number Diff line number Diff line change
@@ -1,4 +1,122 @@
This is just a collection of information that I've found useful or thought
Class Hierarchy for chardet
===========================

Universal Detector
------------------
Has a list of probers.

CharSetProber
-------------
Mostly abstract parent class.

CharSetGroupProber
------------------
Runs a bunch of related probers at the same time and decides which is best.

SBCSGroupProber
---------------
SBCS = Single-ByteCharSet. Runs a bunch of SingleByteCharSetProbers. Always
contains the same SingleByteCharSetProbers.

SingleByteCharSetProber
-----------------------
A CharSetProber that is used for detecting single-byte encodings by using
a "precedence matrix" (i.e., a character bigram model).

MBCSGroupProber
---------------
Runs a bunch of MultiByteCharSetProbers. It also uses a UTF8Prober, which is
essentially a MultiByteCharSetProber that only has a state machine. Always
contains the same MultiByteCharSetProbers.

MultiByteCharSetProber
----------------------
A CharSetProber that uses both a character unigram model (or "character
distribution analysis") and an independent state machine for trying to
detect and encoding.

CodingStateMachine
------------------
Used for "coding scheme" detection, where we just look for either invalid
byte sequences or sequences that only occur for that particular encoding.

CharDistributionAnalysis
------------------------
Used for character unigram distribution encoding detection. Takes a mapping
from characters to a "frequency order" (i.e., what frequency rank that byte has
in the given encoding) and a "typical distribution ratio", which is the number
of occurrences of the 512 most frequently used characters divided by the number
of occurrences of the rest of the characters for a typical document.
The "characters" in this case are 2-byte sequences and they are first converted
to an "order" (name comes from ord() function, I believe). This "order" is used
to index into the frequency order table to determine the frequency rank of that
byte sequence. The reason this extra step is necessary is that the frequency
rank table is language-specific (and not encoding-specific).


What's where
============


Bigram files
------------
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't there supposed to be new-lines between the ---s and the list? Really between any header and the text below?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The blank lines are optional according to the RST spec.

- hebrewprober.py
- jpcntxprober.py
- langbulgarianmodel.py
- langcyrillicmodel.py
- langgreekmodel.py
- langhebrewmodel.py
- langhungarianmodel.py
- langthaimodel.py
- latin1prober.py
- sbcharsetprober.py
- sbcsgroupprober.py


Coding Scheme files
-------------------
- escprober.py
- escsm.py
- utf8prober.py
- codingstatemachine.py
- mbcssmprober.py


Unigram files
-------------
- big5freqprober.py
- chardistribution.py
- euckrfreqprober.py
- euctwfreqprober.py
- gb2312freqprober.py
- jisfreqprober.py

Multibyte probers
-----------------
- big5prober.py
- cp949prober.py
- eucjpprober.py
- euckrprober.py
- euctwprober.py
- gb2312prober.py
- mbcharsetprober.py
- mbcsgroupprober.py
- sjisprober.py

Misc files
----------
- __init__.py (currently has detect function in it)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this display funky? Should we escape it like:

- ``__init__.py``

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but I really was just updating that file to make some notes about things for myself. I didn't think anyone else was likely to read it. I'll fix it either way though. :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I just checked here and it renders fine as is.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GitHub's rendering (of reStructuredText) is historically broken. Relying on that has bitten me in the past.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I'll make all of the filenames have surrounding backticks then for consistent formatting.

- compat.py
- enums.py
- universaldetector.py
- version.py


Useful links
============


This is just a collection of information that I've found useful or thought
might be useful in the future:

- `BOM by Encoding`_
Expand All @@ -8,8 +126,8 @@ might be useful in the future:
- `What Every Programmer Absolutely...`_

- The actual `source`_


.. _BOM by Encoding:
https://en.wikipedia.org/wiki/Byte_order_mark#Representations_of_byte_order_marks_by_encoding
.. _A Composite Approach to Language/Encoding Detection:
Expand Down
7 changes: 2 additions & 5 deletions chardet/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,20 +15,17 @@
# 02110-1301 USA
######################### END LICENSE BLOCK #########################

__version__ = "2.3.0"
from sys import version_info

from .compat import PY2, PY3
from .universaldetector import UniversalDetector


def detect(aBuf):
if (PY2 and isinstance(aBuf, unicode)) or (PY3 and
not isinstance(aBuf, bytes)):
raise ValueError('Expected a bytes object, not a unicode object')

from . import universaldetector
u = universaldetector.UniversalDetector()
u.reset()
u = UniversalDetector()
u.feed(aBuf)
u.close()
return u.result
545 changes: 1 addition & 544 deletions chardet/big5freq.py

Large diffs are not rendered by default.

4 changes: 2 additions & 2 deletions chardet/big5prober.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,8 +34,8 @@
class Big5Prober(MultiByteCharSetProber):
def __init__(self):
super(Big5Prober, self).__init__()
self._mCodingSM = CodingStateMachine(Big5SMModel)
self._mDistributionAnalyzer = Big5DistributionAnalysis()
self._CodingSM = CodingStateMachine(Big5SMModel)
self._DistributionAnalyzer = Big5DistributionAnalysis()
self.reset()

def get_charset_name(self):
Expand Down
72 changes: 36 additions & 36 deletions chardet/chardistribution.py
Original file line number Diff line number Diff line change
Expand Up @@ -48,26 +48,26 @@ class CharDistributionAnalysis(object):
def __init__(self):
# Mapping table to get frequency order from char order (get from
# GetOrder())
self._mCharToFreqOrder = None
self._mTableSize = None # Size of above table
self._CharToFreqOrder = None
self._TableSize = None # Size of above table
# This is a constant value which varies from language to language,
# used in calculating confidence. See
# http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html
# for further detail.
self._mTypicalDistributionRatio = None
self._mDone = None
self._mTotalChars = None
self._mFreqChars = None
self._TypicalDistributionRatio = None
self._Done = None
self._TotalChars = None
self._FreqChars = None
self.reset()

def reset(self):
"""reset analyser, clear any state"""
# If this flag is set to True, detection is done and conclusion has
# been made
self._mDone = False
self._mTotalChars = 0 # Total characters encountered
self._Done = False
self._TotalChars = 0 # Total characters encountered
# The number of characters whose frequency order is less than 512
self._mFreqChars = 0
self._FreqChars = 0

def feed(self, aBuf, aCharLen):
"""feed a character with known length"""
Expand All @@ -77,22 +77,22 @@ def feed(self, aBuf, aCharLen):
else:
order = -1
if order >= 0:
self._mTotalChars += 1
self._TotalChars += 1
# order is valid
if order < self._mTableSize:
if 512 > self._mCharToFreqOrder[order]:
self._mFreqChars += 1
if order < self._TableSize:
if 512 > self._CharToFreqOrder[order]:
self._FreqChars += 1

def get_confidence(self):
"""return confidence based on existing data"""
# if we didn't receive any character in our consideration range,
# return negative answer
if self._mTotalChars <= 0 or self._mFreqChars <= MINIMUM_DATA_THRESHOLD:
if self._TotalChars <= 0 or self._FreqChars <= MINIMUM_DATA_THRESHOLD:
return SURE_NO

if self._mTotalChars != self._mFreqChars:
r = (self._mFreqChars / ((self._mTotalChars - self._mFreqChars)
* self._mTypicalDistributionRatio))
if self._TotalChars != self._FreqChars:
r = (self._FreqChars / ((self._TotalChars - self._FreqChars)
* self._TypicalDistributionRatio))
if r < SURE_YES:
return r

Expand All @@ -102,7 +102,7 @@ def get_confidence(self):
def got_enough_data(self):
# It is not necessary to receive all data to draw conclusion.
# For charset detection, certain amount of data is enough
return self._mTotalChars > ENOUGH_DATA_THRESHOLD
return self._TotalChars > ENOUGH_DATA_THRESHOLD

def get_order(self, aBuf):
# We do not handle characters based on the original encoding string,
Expand All @@ -115,9 +115,9 @@ def get_order(self, aBuf):
class EUCTWDistributionAnalysis(CharDistributionAnalysis):
def __init__(self):
super(EUCTWDistributionAnalysis, self).__init__()
self._mCharToFreqOrder = EUCTWCharToFreqOrder
self._mTableSize = EUCTW_TABLE_SIZE
self._mTypicalDistributionRatio = EUCTW_TYPICAL_DISTRIBUTION_RATIO
self._CharToFreqOrder = EUCTWCharToFreqOrder
self._TableSize = EUCTW_TABLE_SIZE
self._TypicalDistributionRatio = EUCTW_TYPICAL_DISTRIBUTION_RATIO

def get_order(self, aBuf):
# for euc-TW encoding, we are interested
Expand All @@ -134,9 +134,9 @@ def get_order(self, aBuf):
class EUCKRDistributionAnalysis(CharDistributionAnalysis):
def __init__(self):
super(EUCKRDistributionAnalysis, self).__init__()
self._mCharToFreqOrder = EUCKRCharToFreqOrder
self._mTableSize = EUCKR_TABLE_SIZE
self._mTypicalDistributionRatio = EUCKR_TYPICAL_DISTRIBUTION_RATIO
self._CharToFreqOrder = EUCKRCharToFreqOrder
self._TableSize = EUCKR_TABLE_SIZE
self._TypicalDistributionRatio = EUCKR_TYPICAL_DISTRIBUTION_RATIO

def get_order(self, aBuf):
# for euc-KR encoding, we are interested
Expand All @@ -153,9 +153,9 @@ def get_order(self, aBuf):
class GB2312DistributionAnalysis(CharDistributionAnalysis):
def __init__(self):
super(GB2312DistributionAnalysis, self).__init__()
self._mCharToFreqOrder = GB2312CharToFreqOrder
self._mTableSize = GB2312_TABLE_SIZE
self._mTypicalDistributionRatio = GB2312_TYPICAL_DISTRIBUTION_RATIO
self._CharToFreqOrder = GB2312CharToFreqOrder
self._TableSize = GB2312_TABLE_SIZE
self._TypicalDistributionRatio = GB2312_TYPICAL_DISTRIBUTION_RATIO

def get_order(self, aBuf):
# for GB2312 encoding, we are interested
Expand All @@ -172,9 +172,9 @@ def get_order(self, aBuf):
class Big5DistributionAnalysis(CharDistributionAnalysis):
def __init__(self):
super(Big5DistributionAnalysis, self).__init__()
self._mCharToFreqOrder = Big5CharToFreqOrder
self._mTableSize = BIG5_TABLE_SIZE
self._mTypicalDistributionRatio = BIG5_TYPICAL_DISTRIBUTION_RATIO
self._CharToFreqOrder = Big5CharToFreqOrder
self._TableSize = BIG5_TABLE_SIZE
self._TypicalDistributionRatio = BIG5_TYPICAL_DISTRIBUTION_RATIO

def get_order(self, aBuf):
# for big5 encoding, we are interested
Expand All @@ -194,9 +194,9 @@ def get_order(self, aBuf):
class SJISDistributionAnalysis(CharDistributionAnalysis):
def __init__(self):
super(SJISDistributionAnalysis, self).__init__()
self._mCharToFreqOrder = JISCharToFreqOrder
self._mTableSize = JIS_TABLE_SIZE
self._mTypicalDistributionRatio = JIS_TYPICAL_DISTRIBUTION_RATIO
self._CharToFreqOrder = JISCharToFreqOrder
self._TableSize = JIS_TABLE_SIZE
self._TypicalDistributionRatio = JIS_TYPICAL_DISTRIBUTION_RATIO

def get_order(self, aBuf):
# for sjis encoding, we are interested
Expand All @@ -219,9 +219,9 @@ def get_order(self, aBuf):
class EUCJPDistributionAnalysis(CharDistributionAnalysis):
def __init__(self):
super(EUCJPDistributionAnalysis, self).__init__()
self._mCharToFreqOrder = JISCharToFreqOrder
self._mTableSize = JIS_TABLE_SIZE
self._mTypicalDistributionRatio = JIS_TYPICAL_DISTRIBUTION_RATIO
self._CharToFreqOrder = JISCharToFreqOrder
self._TableSize = JIS_TABLE_SIZE
self._TypicalDistributionRatio = JIS_TYPICAL_DISTRIBUTION_RATIO

def get_order(self, aBuf):
# for euc-JP encoding, we are interested
Expand Down
Loading