Add type annotations to the project and run mypy on CI #261

jdufresne · 2022-06-26T17:09:52Z

This helps consumers of chardet ensure that they are using the provided
API correctly. The project includes a py.typed file for PEP-561
compliance.

This also helps ensure internal correctness and consistency. Adding the
static type checking caught at least one suspect pattern in
chardet/hebrewprober.py where an int value was compared to
string (probably leftover from Python 2 support).

Type checking will run on all pull requests through GitHub actions and
pre-commit.

dan-blanchard

This would be a great contribution, but I think some of the types aren't quite right. Once those are fixed, we should be good to go though.

dan-blanchard · 2022-06-27T15:18:37Z

chardet/__init__.py

 from .universaldetector import UniversalDetector
 from .version import VERSION, __version__

 __all__ = ["UniversalDetector", "detect", "detect_all", "__version__", "VERSION"]


-def detect(byte_str):
+def detect(byte_str: str) -> ResultDict:


This should be Union[bytes, bytearray], not str. This function explicitly does not accept str.

dan-blanchard · 2022-06-27T15:22:34Z

chardet/chardistribution.py

@@ -83,7 +86,7 @@ def reset(self):
        # The number of characters whose frequency order is less than 512
        self._freq_chars = 0

-    def feed(self, char, char_len):
+    def feed(self, char: Sequence[int], char_len: int) -> None:


I think most of the places you have Sequence[int] should be Union[bytes, bytearray].

dan-blanchard · 2022-06-27T15:24:29Z

chardet/charsetgroupprober.py

        if not self._best_guess_prober:
            self.get_confidence()
            if not self._best_guess_prober:
                return None
        return self._best_guess_prober.language

-    def feed(self, byte_str):
+    def feed(self, byte_str: bytes) -> int:


Every byte_str argument should be Union[bytes, bytearray].

dan-blanchard · 2022-06-27T15:25:07Z

chardet/charsetprober.py

-        self._state = None
+    _state: int
+
+    def __init__(self, lang_filter: int = 0) -> None:


lang_filter should be LanguageFilter. Although, the default of 0 probably makes that tricky. Honestly, all the enums in the chardet.enums module should probably be overhauled and made into proper Python 3 Enum or Flag types.

dan-blanchard · 2022-06-27T15:53:46Z

chardet/chardistribution.py

-        self._freq_chars = None
+    # Mapping table to get frequency order from char order (get from
+    # GetOrder())
+    _char_to_freq_order: Tuple[int, ...]


Why are you getting rid of all the attribute initializations from __init__ (here and elsewhere)? In local testing, that would make it an AttributeError to try to access any of those.

dan-blanchard · 2022-06-27T15:55:57Z

chardet/eucjpprober.py

        return "Japanese"

-    def feed(self, byte_str):
+    def feed(self, byte_str: bytes) -> int:


This int is really a ProbingState.

chardet/hebrewprober.py

chardet/codingstatemachinedict.py

chardet/enums.py

jdufresne · 2022-06-28T03:24:47Z

Thanks for the thorough review and great feedback! I believe I have applied all suggestions and this is ready for another round.

dan-blanchard

Thanks for putting so much time into this project! This is mostly looking good. Just a few minor complaints at this point.

Also, you probably want to rebase from master, because I added a new prober file.

dan-blanchard · 2022-06-28T15:13:00Z

chardet/charsetgroupprober.py

-        self.probers = []
-        self._best_guess_prober = None
+        self.probers: List[CharSetProber] = []
+        self.active: Dict[CharSetProber, bool] = {}


Why do we need this dictionary when the probers already all have an active property?

One was not defined in charsetprober.py, so mypy reported an error. I reverted this change and added the missing property to the parent class.

dan-blanchard · 2022-06-28T21:14:54Z

chardet/charsetprober.py

@@ -113,14 +118,13 @@ def remove_xml_tags(buf):
        filtered = bytearray()
        in_tag = False
        prev = 0
-        buf = memoryview(buf).cast("c")


The way this was before was added specifically as a performance improvement in #252. Calling ord() every time will slow this down a little.

I see, thanks. This was done to workaround typeshed bug python/typeshed#8182

I'll revert the change and then ignore the false positive.

chardet/enums.py

dan-blanchard · 2022-06-28T21:17:22Z

chardet/enums.py

@@ -31,7 +34,7 @@ class LanguageFilter:
    CJK = CHINESE | JAPANESE | KOREAN


-class ProbingState:
+class ProbingState(Flag):


This (and the other ones in this file except for LanguageFilter) should inherit from Enum, because they're not meaningfully combinable by bitwise operations.

That makes sense, I changed this to an Enum in the latest revision.

This helps consumers of chardet ensure that they are using the provided API correctly. The project includes a py.typed file for PEP-561 compliance. This also helps ensure internal correctness and consistency. Adding the static type checking caught at least one suspect pattern in chardet/hebrewprober.py where an int value was compared to string (probably leftover from Python 2 support). Type checking will run on all pull requests through GitHub actions and pre-commit.

dan-blanchard requested changes Jun 27, 2022

View reviewed changes

jdufresne commented Jun 28, 2022

View reviewed changes

chardet/enums.py Show resolved Hide resolved

dan-blanchard requested changes Jun 28, 2022

View reviewed changes

dan-blanchard approved these changes Jun 29, 2022

View reviewed changes

dan-blanchard merged commit c4f7057 into chardet:master Jun 29, 2022

jdufresne deleted the mypy branch July 8, 2022 12:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add type annotations to the project and run mypy on CI #261

Add type annotations to the project and run mypy on CI #261

Add type annotations to the project and run mypy on CI #261

Add type annotations to the project and run mypy on CI #261

Conversation

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment