String Ambiguity in Python

In Python there are three semantic types when handling character strings:

  1. Text: by nature can be part of internationalization processes. See Unicode for a standard for the representation, and handling of text in most of the world’s writing systems.

    In Python 2 there is an special type unicode to process text, but sometimes str is also used encoding the content; but in Python 3 str is always represented as Unicode.

  2. Technical Strings: those used for for some special object names (classes, functions, modules, …); the __all__ definition in modules, identifiers, etc. Those values most times requires necessarily to be instances of str type. Try next in Python 2:

    >>> class Foobar(object):
    ...    pass
    >>> Foobar.__name__ = u'foobar'
    TypeError: can only assign string to xxx.__name__, not 'unicode'
    

    In Python 2 str and bytes are synonymous; but in Python 3 are different types and bytes is exclusively used for binary strings.

  3. Binary Strings: binary data (normally not readable by humans) represented as a character string. In Python 3 the main built-in type for this concept is bytes.

Mismatch Semantics Comparison

In Python 2 series, equal comparison for unicode an str types don’t ever match. The following example fails in that version:

>>> s = 'λ'
>>> u = u'λ'
>>> u == s
False

Also a UnicodeWarning is issued with message “Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal.

To correctly compare, use the same type. For example:

>>> from xoutil.eight.text import force
>>> force(s) == force(u)
True

Compatibility Modules

Xoutil has a Python 2 and 3 compatibility package named eight. So these issues related to ambiguity when handling strings (see Text versus binary data) are dealt in the sub-modules:

These modules can be used transparently in both Python versions.

Encoding Hell

To represent a entire range of characters is used some kind of encoding system. Maybe the trending top is the UTF family.

This complex diversity, even when strictly necessary for most applications, represents an actual “hell” for programmers.

For more references see codecs standard module. Also the xoutil.future.codecs, and xoutil.eight.text extension modules.

Changes in 1.8.0 in xoutil.string.

  • xoutil.future.codecs: Moved here functions force_encoding(), safe_decode(), and safe_encode().

  • xoutil.eight.string: Technical string handling. In this module:

    • force(): Replaces old safe_str, and force_str versions.

    • safe_join(): Replaces old version in future module. This function is useless, it’s equivalent to:

      force(vale).join(force(item) for item in iterator)
      
    • force_ascii(): Replaces old normalize_ascii. This function is safe and the result will be of standard str type containing only equivalent ASCII characters from the argument.

  • xoutil.eight.text: Text handling, strings can be part of internationalization processes. In this module:

    • force(): Replaces old safe_str, and force_str versions, but always returning the text type.

    • safe_join(): Replaces old version in future module, but in this case always return the text type. This function is useless, it’s equivalent to:

      force(vale).join(force(item) for item in iterator)
      
  • capitalize_word function was completely removed, use instead standard method word.capitalize().

  • Functions capitalize, normalize_name, normalize_title, normalize_str, parse_boolean, parse_url_int were completely removed.

  • normalize_unicode was completely removed, it’s now replaced by xoutil.eight.text.force().

  • hyphen_name was moved to xoutil.cli.tools.

  • strfnumber was moved as an internal function of ‘xoutil.future.datetime’:mod: module.

  • Function normalize_slug is now deprecated. You should use now slugify().