[vtk-developers] Python 3 and unicode
david.gobbi at gmail.com
Sun Aug 23 20:59:18 EDT 2015
I'd like to briefly describe the way that the Python 3 wrapping currently
handles unicode, before soliciting advice on how things could be improved.
First, note that there are two ways you can provide a string to the
wrappers: as a unicode str() object, or as an 8-bit bytes() object. These
are illustrated in the following example:
a = vtkStringArray()
In the first case, the wrappers silently convert the python unicode str()
to a C++ utf-8 string, which is stored in the array. In the second case, a
unicode object is explicitly converted into a latin1-encoded bytes()
object, which is stored as a C++ string.
When you try to get the strings out of the array, python has no way of
knowing what encoding was used. Right now, it assumes utf-8:
=> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe7 in position
0: unexpected end of data
This isn't a good situation. If the string isn't stored as utf-8 (or
ascii), then retrieving it is impossible. I have an easy fix that provides
a 95% solution, but I'd like advice before implementing it. Here's the fix:
The wrappers will always try to decode VTK 8-bit strings as utf-8 and
return a Python 3 str() object. But if that fails, then rather than
raising an exception, the wrappers will keep the string in its original
encoding and return it as a bytes() object. Going back to the example:
In other words, if you use something other than ascii or utf-8, then it's
your responsibility to do the encoding and decoding. Or, you can leave the
value as an 8-bit bytes() object and simply pass it along (for example if
you are getting a string from an array, and then setting it as the filename
for a reader).
I think this is a good solution, but here is why it's only a 95% solution.
In the above scenario, the wrappers will always attempt to decode 8-bit
strings as utf-8, and only if that fails will they return a raw bytes()
object. So the wrappers might return a unicode str() for something that was
not stored as utf-8! In this case, you would have to:
a) check to see if the value is a str() or a bytes() object
b) if it is a bytes object, decode as e.g. latin1
c) if it is a str object, encode as utf-8 and then decode as e.g. latin1
Because I'm a utf-8 chauvinist, my main concern is that people who use
utf-8 don't have to do anything special. It bothers me a bit that people
who use other encodings need to jump through these hoops, but I can live
Any opinions, advice, or requests for clarification?
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the vtk-developers