-
Notifications
You must be signed in to change notification settings - Fork 259
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add type annotations to the project and run mypy on CI #261
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be a great contribution, but I think some of the types aren't quite right. Once those are fixed, we should be good to go though.
chardet/__init__.py
Outdated
from .universaldetector import UniversalDetector | ||
from .version import VERSION, __version__ | ||
|
||
__all__ = ["UniversalDetector", "detect", "detect_all", "__version__", "VERSION"] | ||
|
||
|
||
def detect(byte_str): | ||
def detect(byte_str: str) -> ResultDict: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be Union[bytes, bytearray]
, not str
. This function explicitly does not accept str
.
chardet/chardistribution.py
Outdated
@@ -83,7 +86,7 @@ def reset(self): | |||
# The number of characters whose frequency order is less than 512 | |||
self._freq_chars = 0 | |||
|
|||
def feed(self, char, char_len): | |||
def feed(self, char: Sequence[int], char_len: int) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think most of the places you have Sequence[int]
should be Union[bytes, bytearray]
.
chardet/charsetgroupprober.py
Outdated
if not self._best_guess_prober: | ||
self.get_confidence() | ||
if not self._best_guess_prober: | ||
return None | ||
return self._best_guess_prober.language | ||
|
||
def feed(self, byte_str): | ||
def feed(self, byte_str: bytes) -> int: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Every byte_str
argument should be Union[bytes, bytearray]
.
chardet/charsetprober.py
Outdated
self._state = None | ||
_state: int | ||
|
||
def __init__(self, lang_filter: int = 0) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lang_filter
should be LanguageFilter
. Although, the default of 0 probably makes that tricky. Honestly, all the enums in the chardet.enums
module should probably be overhauled and made into proper Python 3 Enum
or Flag
types.
chardet/chardistribution.py
Outdated
self._freq_chars = None | ||
# Mapping table to get frequency order from char order (get from | ||
# GetOrder()) | ||
_char_to_freq_order: Tuple[int, ...] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are you getting rid of all the attribute initializations from __init__
(here and elsewhere)? In local testing, that would make it an AttributeError
to try to access any of those.
chardet/eucjpprober.py
Outdated
return "Japanese" | ||
|
||
def feed(self, byte_str): | ||
def feed(self, byte_str: bytes) -> int: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This int
is really a ProbingState
.
Thanks for the thorough review and great feedback! I believe I have applied all suggestions and this is ready for another round. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for putting so much time into this project! This is mostly looking good. Just a few minor complaints at this point.
Also, you probably want to rebase from master, because I added a new prober file.
chardet/charsetgroupprober.py
Outdated
self.probers = [] | ||
self._best_guess_prober = None | ||
self.probers: List[CharSetProber] = [] | ||
self.active: Dict[CharSetProber, bool] = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this dictionary when the probers already all have an active
property?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One was not defined in charsetprober.py
, so mypy reported an error. I reverted this change and added the missing property to the parent class.
chardet/charsetprober.py
Outdated
@@ -113,14 +118,13 @@ def remove_xml_tags(buf): | |||
filtered = bytearray() | |||
in_tag = False | |||
prev = 0 | |||
buf = memoryview(buf).cast("c") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way this was before was added specifically as a performance improvement in #252. Calling ord()
every time will slow this down a little.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, thanks. This was done to workaround typeshed bug python/typeshed#8182
I'll revert the change and then ignore the false positive.
chardet/enums.py
Outdated
@@ -31,7 +34,7 @@ class LanguageFilter: | |||
CJK = CHINESE | JAPANESE | KOREAN | |||
|
|||
|
|||
class ProbingState: | |||
class ProbingState(Flag): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This (and the other ones in this file except for LanguageFilter
) should inherit from Enum
, because they're not meaningfully combinable by bitwise operations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense, I changed this to an Enum in the latest revision.
This helps consumers of chardet ensure that they are using the provided API correctly. The project includes a py.typed file for PEP-561 compliance. This also helps ensure internal correctness and consistency. Adding the static type checking caught at least one suspect pattern in chardet/hebrewprober.py where an int value was compared to string (probably leftover from Python 2 support). Type checking will run on all pull requests through GitHub actions and pre-commit.
This helps consumers of chardet ensure that they are using the provided
API correctly. The project includes a py.typed file for PEP-561
compliance.
This also helps ensure internal correctness and consistency. Adding the
static type checking caught at least one suspect pattern in
chardet/hebrewprober.py where an int value was compared to
string (probably leftover from Python 2 support).
Type checking will run on all pull requests through GitHub actions and
pre-commit.