Basically, the codecs library provides a series of functions from the built-in _codecs library which maintains a registry of search functions (a simple list) that maps ancodings to the right de/encode functions by returning a CodecInfo object once first matched.

codext hooks codecs's functions to insert its own proxy registry between the function calls and the native registry so that new encodings can be added or replace existing ones while using code[cs|xt].open. Indeed, as the proxy registry is called first, the first possible match occurs in a custom codec, while if not existing, the native registry is used.

The open built-in function

Two behaviors are to be considered when using codext:

  1. Encodings added from codext are only added to the proxy codecs registry of codext and are NOT available using open(...) (but well using code[cs|xt].open(...).
  2. Encodings added from codecs are added to the proxy registry AND ALSO to the native registry and are therefore available using open(...).

This difference allows to keep encodings added from codext removable while these added from codecs are not. This is the consequence from the fact that there is no unregister function in the native _codecs library.


Add a custom encoding

New codecs can be added easily using the new function add.

>>> import codext
>>> help(codext.add)
Help on function add in module codext.__common__:

add(ename, encode=None, decode=None, pattern=None, text=True, add_to_codecs=False)
    This adds a new codec to the codecs module setting its encode and/or decode
     functions, eventually dynamically naming the encoding with a pattern and
     with file handling (if text is True).

    :param ename:           encoding name
    :param encode:          encoding function or None
    :param decode:          decoding function or None
    :param pattern:         pattern for dynamically naming the encoding
    :param text:            specify whether the codec is a text encoding
    :param add_to_codecs:   also add the search function to the native registry
                            NB: this will make the codec available in the
                                 built-in open(...) but will make it impossible
                                 to remove the codec later

Here is a simple example of how to add a basic codec:

import codext

def mycodec_encode(text, errors="strict"):
    # do some encoding stuff
    return encoded, len(text)

def mycodec_decode(text, errors="strict"):
    # do some decoding stuff
    return decoded, len(text)

codext.add("mycodec", mycodec_encode, mycodec_decode)

In this first example, we can see that:

  • The decode/encode functions have a signature holding a keyword-argument "errors" for error handling. This comes from the syntax for making a codec for the codecs native library. This argument can have multiple values, namely "strict" for raising an exception when an de/encoding error occurs, while "replace" allows to replace the character at the position of the error with a generic character and also "ignore" that simply ignores the error and continues without adding anything to the resulting string.
  • These functions always return a pair with the resulting string and the length of consumed input text.

Another example for a more complex and dynamic codec:

import codext

def mydyncodec_encode(i):
    def encode(text, error="strict"):
        # do somthing depending on i
        return result, len(text)
    return encode

codext.add("mydyncodec", mydyncodec_encode, pattern=r"mydyn-(\d+)$")

In this second example, we can see that:

  • Only the encoding function is defined.
  • A pattern is defined to match the prefix "mydyn-" and then an integer which is captured and used with mydyncodec_encode(i).

Pattern capture group

A capture group means that the parameter will be used with a dynamic (decorated) encoding function. In order to avoid this, i.e. for matching multiple names leading to the same encoding while calling a static encoding function, we can simply define a non-capturing group, e.g. "(?:my|special_)codec".


Add a custom map encoding

New codecs using encoding maps can be added easily using the new function add_map.

>>> import codext
>>> help(codext.add)
Help on function add_map in module codext.__common__:

add_map(ename, encmap, repl_char='?', sep='', ignore_case=None, no_error=False, intype=None, outype=None, **kwargs)
    This adds a new mapping codec (that is, declarable with a simple character mapping dictionary) to the codecs module
     dynamically setting its encode and/or decode functions, eventually dynamically naming the encoding with a pattern
     and with file handling (if text is True).

    :param ename:         encoding name
    :param encmap:        characters encoding map ; can be a dictionary of encoding maps (for use with the first capture
                           group of the regex pattern) or a function building the encoding map
    :param repl_char:     replacement char (used when errors handling is set to "replace")
    :param sep:           string of possible character separators (hence, only single-char separators are considered) ;
                           - while encoding, the first separator is used
                           - while decoding, separators can be mixed in the input text
    :param ignore_case:   ignore text case while encoding and/or decoding
    :param no_error:      this encoding triggers no error (hence, always in "leave" errors handling)
    :param intype:        specify the input type for pre-transforming the input text
    :param outype:        specify the output type for post-transforming the output text
    :param pattern:       pattern for dynamically naming the encoding
    :param text:          specify whether the codec is a text encoding
    :param add_to_codecs: also add the search function to the native registry
                           NB: this will make the codec available in the built-in open(...) but will make it impossible
                                to remove the codec later

This relies on the add function and simplifies creating new encodings when they can be described as a mapping dictionary.

Here is a simple example of how to add a map codec:

import codext

ENCMAP = {'a': "A", 'b': "B", 'c': "C"}

codext.add_map("mycodec", ENCMAP)

In this first example, we can see that:

  • The decode/encode functions do not have to be declared anymore.
  • ENCMAP is the mapping between characters, it is also used to compute the decoding function.

Another example for a more complex and dynamic codec:

import codext

ENCMAP = [
    {'00': "A", '01': "B", '10': "C", '11': "D"},
    {'00': "D", '01': "C", '10': "B", '11': "A"},
]

codext.add("mydyncodec", ENCMAP, "#", ignore_case=True, intype="bin", pattern=r"mydyn-(\d+)$")

In this second example, we can see that:

  • ENCMAP is now a list of mappings. The capture group in the pattern is used to select the right encoding map. Consequently, using encoding "mydyn-8" will fail with a LookupError as the only possibility are "mydyn-1" and "mydyn-2". Note that the index begins at 1 in the encoding name.
  • Instead of using the default character "?" for replacements, we use "#".
  • The case is ignored ; decoding either "abcd" or "ABCD" will succeed.
  • The binary mode is enabled, meaning that the input text is converted to a binary string for encoding, while it is converted from binary to text when decoding.

Input/Output types

By default, when intype is defined, outype takes the same value. So, if the new encoding uses a pre-conversion to bits (intype="bin") but maps bits to characters (therefore binary conversion to text is not needed), outype shall then be set to "str" (or if it maps bits to ordinals, use outype="ord").


List codecs

Codecs can be listed with the list function, either the whole codecs or only some categories.

>>> codext.list()
['affine', 'ascii', 'ascii85', 'atbash', 'bacon', ..., 'base36', 'base58', 'base62', 'base64', 'base64_codec', ..., 'baudot-tape', 'bcd', 'bcd-extended0', 'bcd-extended1', 'big5', 'big5hkscs', 'braille', 'bz2_codec', 'capitalize', 'cp037', ...]

Codecs categories

  • native: the built-in codecs from the original codecs package
  • non-native: this special category regroups all the categories mentioned hereafter
  • base: baseX codecs (e.g. base, base100)
  • binary: codecs working on strings but applying their algorithms on their binary forms (e.g. baudot, manchester)
  • common: common codecs not included in the native ones or simly added for the purpose of standardization (e.g. octal, ordinal)
  • crypto: codecs related to cryptography algorithms (e.g. barbie, rot, xor)
  • language: language-related codecs (e.g. morse, navajo)
  • other: uncategorized codecs (e.g. letters, url)
  • stegano: steganography-related codecs (e.g. sms, resistor)

Except the native and non-native categories, the other ones are simply the name of the subdirectories (with "s" right-stripped) of the codext package.

>>> codext.list("binary")
['baudot', 'baudot-spaced', 'baudot-tape', 'bcd', 'bcd-extended0', 'bcd-extended1', 'excess3', 'gray', 'manchester', 'manchester-inverted']
>>> codext.list("language")
['braille', 'leet', 'morse', 'navajo', 'radio', 'southpark', 'southpark-icase', 'tom-tom']
>>> codext.list("native")
['ascii', 'base64_codec', 'big5', 'big5hkscs', 'bz2_codec', 'cp037', 'cp273', 'cp424', 'cp437', 'cp500', 'cp775', 'cp850', 'cp852', 'cp855', 'cp857', 'cp858', 'cp860', 'cp861', 'cp862', 'cp863', ...]

Codecs listed, not encodings

Beware that this function only lists the codecs, not the encodings. This means that, for instance, it only lists base (codecs' name) instead of base17, base61, base97, ... (the valid encoding names related to the base codec).


Search for encodings

Natively, codecs provides a lookup function that allows to get the CodecInfo object for the desired encoding. This performs a lookup in the registry based on an exact match. Sometimes, it can be useful to search for available encodings based on a regular expression. Therefore, a search function is added by codext to allow to get a list of encoding names matching the input regex.

>>> codext.search("baudot")
['baudot', 'baudot_spaced', 'baudot_tape']
>>> codext.search("al")
['capitalize', 'octal', 'octal_spaced', 'ordinal', 'ordinal_spaced', 'radio']
>>> codext.search("white")
['whitespace', 'whitespace_after_before']

Also, codext provides an examples function to get some examples of valid encoding names. This is especially useful when it concerns dynamicly named encodings (e.g. rot, shift or dna).

>>> codext.examples("rot")
['rot-14', 'rot-24', 'rot-7', 'rot18', 'rot3', 'rot4', 'rot6', 'rot_1', 'rot_12', 'rot_2']
>>> codext.examples("dna")
['dna-1', 'dna-2', 'dna-5', 'dna1', 'dna4', 'dna5', 'dna6', 'dna8', 'dna_3', 'dna_5']
>>> codext.examples("barbie", 5)
['barbie-1', 'barbie1', 'barbie4', 'barbie_2', 'barbie_4']

Remove a custom encoding

New codecs can be removed easily using the new function remove, which will only remove every codec matching the given encoding name in the proxy codecs registry and NOT in the native one.

>>> codext.encode("test", "bin")
'01110100011001010111001101110100'
>>> codext.remove("bin")
>>> codext.encode("test", "bin")

Traceback (most recent call last):
  [...]
LookupError: unknown encoding: bin

While trying to remove a codec that is in the native registry won't raise a LookupError.

>>> codext.remove("utf-8")
>>> codext.encode("test", "utf-8")
b'test'

Remove or restore codext encodings

It can be useful while playing with encodings e.g. from Idle to be able to remove or restore codext's encodings. This can be achieved using respectively the new clear and reset functions.

>>> codext.clear()
>>> codext.encode("test", "bin")

Traceback (most recent call last):
  [...]
LookupError: unknown encoding: bin
>>> codext.reset()
>>> codext.encode("test", "bin")
'01110100011001010111001101110100'

Guess-decode an arbitrary input

This is done by trying encodings using the breadth-first tree search algorithm. It stops when a given condition (by default, all characters must be printable), in the form of a function applied to the decoded string at the current depth, is met. It returns two results: the decoded string and a tuple with the related encoding names in order of application. The following parameters can be entered:

  • stop_func: can be a function or a regular expression to be matched (automatically converted to a function that uses the re module) ; by default, checks if all input characters are printable.
  • max_depth: the maximum depth for the tree search ; by default 5.
  • codec_categories: a string indicating a codec category or a list of category strings ; by default, None, meaning the whole categories (very slow).
  • found: a list or tuple of currently found encodings, this can be used to save time if the first decoding steps are known ; by default, an empty tuple.

A simple example for a 1-stage base64-encoded string:

>>> codext.guess("VGhpcyBpcyBhIHRlc3Q=")
('This is a test', ('base64',))

An example of a 2-stages base64- then base62-encoded string:

>>> codext.guess("CJG3Ix8bVcSRMLOqwDUg28aDsT7")
('FKU2Ng7lJbR>.IHuzLDv17eLhE6', ('barbie',))

In the second example, we can see that the given encoded string is not decoded as expected. This is the case because the (default) stop condition is too broad and stops if all the characters of the output are printable. If we have a prior knowledge on what we should expect, we can input a simple string or a regex:

>>> codext.guess("CJG3Ix8bVcSRMLOqwDUg28aDsT7", "test")
('This is a test', ('base62', 'base64'))

Instead of a string, we can also pass a function. For this purpose, standard stop functions are predefined in the stopfunc submodule. So, we can for instance use stopfunc.lang_en to stop when we find something that is English (only works if langdetect is installed, which is willingly NOT in the requirements of this package). Note that working this way gives lots of false positives if the text is very short like in the example case. That's why the codec_categories argument is used to only consider baseX codecs. This is also demonstrated in the next examples.

>>> codext.guess("CJG3Ix8bVcSRMLOqwDUg28aDsT7", codext.stopfunc.lang_en, codec_categories="base")
('This is a test', ('base62', 'base64'))

If we know the first encoding, we can set this in the found parameter to save time:

>>> codext.guess("CJG3Ix8bVcSRMLOqwDUg28aDsT7", "test", found=["base62"])
('This is a test', ('base62', 'base64'))

If we are sure that only base (which is a valid category) encodings are used, we can restrict the tree search using the codec_categories parameter to save time:

>>> codext.guess("CJG3Ix8bVcSRMLOqwDUg28aDsT7", "test", codec_categories="base")
('This is a test', ('base62', 'base64'))

Another example of 2-stages encoded string:

>>> codext.guess("LSAuLi4uIC4uIC4uLiAvIC4uIC4uLiAvIC4tIC8gLSAuIC4uLiAt", "test")
('this is a test', ('base64', 'morse'))
>>> codext.guess("LSAuLi4uIC4uIC4uLiAvIC4uIC4uLiAvIC4tIC8gLSAuIC4uLiAt", "test", codec_categories=["base", "language"])
('this is a test', ('base64', 'morse'))

When multiple results are expected, stop and show arguments can be used respectively to avoid stopping while finding a result and to display the intermediate result.

Computation time

Note that, in the very last examples, the first call takes much longer than the second one but requires no knowledge about the possible categories of encodings.

Stop functions

Currently, a few standard stop functions are provided with the stopfunc submodule:

  • flag: searches for the pattern "[Ff][Ll1][Aa4@][Gg9]" (either UTF-8 or UTF-16)
  • lang_**: checks if the given lang (any from the PROFILES_DIRECTORY of the langdetect module if it is installed) is detected (note that it first checks if all characters are printable)
  • printables: checks that every output character is in the set of printables

Hooked codecs functions

In order to select the right de/encoding function and avoid any conflict, the native codecs library registers search functions (using the register(search_function) function), called in order of registration while searching for a codec.

While being imported, codext hooks the following base functions of codecs dealing with the codecs registry: encode, decode, lookup and register. This way, codext holds a private registry that is called before reaching out to the native one, causing the codecs defined in codext to override native codecs with a matching registry search function.