String Ambiguity in Python¶
In Python there are three semantic types when handling character strings:
Text: by nature can be part of internationalization processes. See Unicode for a standard for the representation, and handling of text in most of the world’s writing systems.
In Python 2 there is an special type unicode to process text, but sometimes
str
is also used encoding the content; but in Python 3str
is always represented as Unicode.Technical Strings: those used for for some special object names (classes, functions, modules, …); the
__all__
definition in modules, identifiers, etc. Those values most times requires necessarily to be instances ofstr
type. Try next in Python 2:>>> class Foobar(object): ... pass >>> Foobar.__name__ = u'foobar' TypeError: can only assign string to xxx.__name__, not 'unicode'
In Python 2
str
andbytes
are synonymous; but in Python 3 are different types andbytes
is exclusively used for binary strings.Binary Strings: binary data (normally not readable by humans) represented as a character string. In Python 3 the main built-in type for this concept is
bytes
.
Mismatch Semantics Comparison¶
In Python 2 series, equal comparison for unicode an str types don’t ever match. The following example fails in that version:
>>> s = 'λ'
>>> u = u'λ'
>>> u == s
False
Also a UnicodeWarning
is issued with message “Unicode equal comparison
failed to convert both arguments to Unicode - interpreting them as being
unequal.
To correctly compare, use the same type. For example:
>>> from xoutil.eight.text import force
>>> force(s) == force(u)
True
Compatibility Modules¶
Xoutil has a Python 2 and 3 compatibility package named
eight
. So these issues related to ambiguity when handling
strings (see Text versus binary data) are dealt in the sub-modules:
text
: tools related with text handling. In Python 2 values are processed with unicode and in Python 3 with standardstr
type.string
: tools that in both versions of Python always use standardstr
type to fulfills technical strings semantics.
These modules can be used transparently in both Python versions.
Encoding Hell¶
To represent a entire range of characters is used some kind of encoding system. Maybe the trending top is the UTF family.
This complex diversity, even when strictly necessary for most applications, represents an actual “hell” for programmers.
For more references see codecs
standard module. Also the
xoutil.future.codecs
, and xoutil.eight.text
extension modules.
Changes in 1.8.0 in xoutil.string
.¶
xoutil.future.codecs
: Moved here functionsforce_encoding()
,safe_decode()
, andsafe_encode()
.xoutil.eight.string
: Technical string handling. In this module:force()
: Replaces oldsafe_str
, andforce_str
versions.safe_join()
: Replaces old version infuture
module. This function is useless, it’s equivalent to:force(vale).join(force(item) for item in iterator)
force_ascii()
: Replaces oldnormalize_ascii
. This function is safe and the result will be of standardstr
type containing only equivalent ASCII characters from the argument.
xoutil.eight.text
: Text handling, strings can be part of internationalization processes. In this module:force()
: Replaces oldsafe_str
, andforce_str
versions, but always returning the text type.safe_join()
: Replaces old version infuture
module, but in this case always return the text type. This function is useless, it’s equivalent to:force(vale).join(force(item) for item in iterator)
capitalize_word
function was completely removed, use instead standard methodword.capitalize()
.Functions
capitalize
,normalize_name
,normalize_title
,normalize_str
,parse_boolean
,parse_url_int
were completely removed.normalize_unicode
was completely removed, it’s now replaced byxoutil.eight.text.force()
.hyphen_name
was moved toxoutil.cli.tools
.strfnumber
was moved as an internal function of ‘xoutil.future.datetime’:mod: module.Function
normalize_slug
is now deprecated. You should use nowslugify()
.