[vtk-developers] Python 3 and unicode

Sun Aug 23 20:59:18 EDT 2015

Hi All,

I'd like to briefly describe the way that the Python 3 wrapping currently
handles unicode, before soliciting advice on how things could be improved.

First, note that there are two ways you can provide a string to the
wrappers: as a unicode str() object, or as an 8-bit bytes() object.  These
are illustrated in the following example:

    a = vtkStringArray()
    a.InsertNextValue("ç")
    a.InsertNextValue("ç".encode('latin1'))

In the first case, the wrappers silently convert the python unicode str()
to a C++ utf-8 string, which is stored in the array. In the second case, a
unicode object is explicitly converted into a latin1-encoded bytes()
object, which is stored as a C++ string.

When you try to get the strings out of the array, python has no way of
knowing what encoding was used.  Right now, it assumes utf-8:

    a.GetValue(0)
    => 'ç'
    a.GetValue(1)
    => UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe7 in position
0: unexpected end of data

This isn't a good situation.  If the string isn't stored as utf-8 (or
ascii), then retrieving it is impossible.  I have an easy fix that provides
a 95% solution, but I'd like advice before implementing it.  Here's the fix:

The wrappers will always try to decode VTK 8-bit strings as utf-8 and
return a Python 3 str() object.  But if that fails, then rather than
raising an exception, the wrappers will keep the string in its original
encoding and return it as a bytes() object.  Going back to the example:

    a.GetValue(0)
    => 'ç'
    a.GetValue(1)
    => b'\xe7'
    b'\xe7'.decode('latin1')
    => 'ç'

In other words, if you use something other than ascii or utf-8, then it's
your responsibility to do the encoding and decoding.  Or, you can leave the
value as an 8-bit bytes() object and simply pass it along (for example if
you are getting a string from an array, and then setting it as the filename
for a reader).

I think this is a good solution, but here is why it's only a 95% solution.
In the above scenario, the wrappers will always attempt to decode 8-bit
strings as utf-8, and only if that fails will they return a raw bytes()
object. So the wrappers might return a unicode str() for something that was
not stored as utf-8!  In this case, you would have to:

a) check to see if the value is a str() or a bytes() object
b) if it is a bytes object, decode as e.g. latin1
c) if it is a str object, encode as utf-8 and then decode as e.g. latin1

Because I'm a utf-8 chauvinist, my main concern is that people who use
utf-8 don't have to do anything special.  It bothers me a bit that people
who use other encodings need to jump through these hoops, but I can live
with that.

Any opinions, advice, or requests for clarification?

 - David
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://public.kitware.com/pipermail/vtk-developers/attachments/20150823/a4a214d5/attachment.html>