[vtk-developers] Python 3 and unicode

Mon Aug 24 08:27:17 EDT 2015

This seems reasonable to me

On Sun, Aug 23, 2015 at 8:59 PM, David Gobbi <david.gobbi at gmail.com> wrote:

> Hi All,
>
> I'd like to briefly describe the way that the Python 3 wrapping currently
> handles unicode, before soliciting advice on how things could be improved.
>
> First, note that there are two ways you can provide a string to the
> wrappers: as a unicode str() object, or as an 8-bit bytes() object.  These
> are illustrated in the following example:
>
>     a = vtkStringArray()
>     a.InsertNextValue("ç")
>     a.InsertNextValue("ç".encode('latin1'))
>
> In the first case, the wrappers silently convert the python unicode str()
> to a C++ utf-8 string, which is stored in the array. In the second case, a
> unicode object is explicitly converted into a latin1-encoded bytes()
> object, which is stored as a C++ string.
>
> When you try to get the strings out of the array, python has no way of
> knowing what encoding was used.  Right now, it assumes utf-8:
>
>     a.GetValue(0)
>     => 'ç'
>     a.GetValue(1)
>     => UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe7 in
> position 0: unexpected end of data
>
> This isn't a good situation.  If the string isn't stored as utf-8 (or
> ascii), then retrieving it is impossible.  I have an easy fix that provides
> a 95% solution, but I'd like advice before implementing it.  Here's the fix:
>
> The wrappers will always try to decode VTK 8-bit strings as utf-8 and
> return a Python 3 str() object.  But if that fails, then rather than
> raising an exception, the wrappers will keep the string in its original
> encoding and return it as a bytes() object.  Going back to the example:
>
>     a.GetValue(0)
>     => 'ç'
>     a.GetValue(1)
>     => b'\xe7'
>     b'\xe7'.decode('latin1')
>     => 'ç'
>
> In other words, if you use something other than ascii or utf-8, then it's
> your responsibility to do the encoding and decoding.  Or, you can leave the
> value as an 8-bit bytes() object and simply pass it along (for example if
> you are getting a string from an array, and then setting it as the filename
> for a reader).
>
> I think this is a good solution, but here is why it's only a 95%
> solution.  In the above scenario, the wrappers will always attempt to
> decode 8-bit strings as utf-8, and only if that fails will they return a
> raw bytes() object. So the wrappers might return a unicode str() for
> something that was not stored as utf-8!  In this case, you would have to:
>
> a) check to see if the value is a str() or a bytes() object
> b) if it is a bytes object, decode as e.g. latin1
> c) if it is a str object, encode as utf-8 and then decode as e.g. latin1
>
> Because I'm a utf-8 chauvinist, my main concern is that people who use
> utf-8 don't have to do anything special.  It bothers me a bit that people
> who use other encodings need to jump through these hoops, but I can live
> with that.
>
> Any opinions, advice, or requests for clarification?
>
>  - David
>
>
>
>
> _______________________________________________
> Powered by www.kitware.com
>
> Visit other Kitware open-source projects at
> http://www.kitware.com/opensource/opensource.html
>
> Search the list archives at: http://markmail.org/search/?q=vtk-developers
>
> Follow this link to subscribe/unsubscribe:
> http://public.kitware.com/mailman/listinfo/vtk-developers
>
>
>

-- 
William J. Schroeder, PhD
Kitware, Inc.
28 Corporate Drive
Clifton Park, NY 12065
will.schroeder at kitware.com
http://www.kitware.com
(518) 881-4902
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://public.kitware.com/pipermail/vtk-developers/attachments/20150824/2eda0691/attachment.html>