<div dir="ltr"><div>Hi All,</div><div><br></div><div>I'd like to briefly describe the way that the Python 3 wrapping currently handles unicode, before soliciting advice on how things could be improved.</div><div><br></div><div>First, note that there are two ways you can provide a string to the wrappers: as a unicode str() object, or as an 8-bit bytes() object.  These are illustrated in the following example:</div><br><div>    a = vtkStringArray()</div><div>    a.InsertNextValue("ç")</div><div>    a.InsertNextValue("ç".encode('latin1'))</div><div><br></div><div>In the first case, the wrappers silently convert the python unicode str() to a C++ utf-8 string, which is stored in the array. In the second case, a unicode object is explicitly converted into a latin1-encoded bytes() object, which is stored as a C++ string.</div><div><br></div><div>When you try to get the strings out of the array, python has no way of knowing what encoding was used.  Right now, it assumes utf-8:</div><div><br></div><div>    a.GetValue(0)</div><div>    => 'ç'</div><div>    a.GetValue(1)</div><div>    => UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe7 in position 0: unexpected end of data</div><div><br></div><div>This isn't a good situation.  If the string isn't stored as utf-8 (or ascii), then retrieving it is impossible.  I have an easy fix that provides a 95% solution, but I'd like advice before implementing it.  Here's the fix:</div><div><br></div><div>The wrappers will always try to decode VTK 8-bit strings as utf-8 and return a Python 3 str() object.  But if that fails, then rather than raising an exception, the wrappers will keep the string in its original encoding and return it as a bytes() object.  Going back to the example:</div><div><br></div><div><div>    a.GetValue(0)</div><div>    => 'ç'</div><div>    a.GetValue(1)</div><div>    => b'\xe7'</div><div>    b'\xe7'.decode('latin1')</div><div>    => 'ç'</div><div><br></div><div>In other words, if you use something other than ascii or utf-8, then it's your responsibility to do the encoding and decoding.  Or, you can leave the value as an 8-bit bytes() object and simply pass it along (for example if you are getting a string from an array, and then setting it as the filename for a reader).</div><div><br></div><div>I think this is a good solution, but here is why it's only a 95% solution.  In the above scenario, the wrappers will always attempt to decode 8-bit strings as utf-8, and only if that fails will they return a raw bytes() object. So the wrappers might return a unicode str() for something that was not stored as utf-8!  In this case, you would have to:</div><div><br></div><div>a) check to see if the value is a str() or a bytes() object</div>


</div><div>b) if it is a bytes object, decode as e.g. latin1</div><div>c) if it is a str object, encode as utf-8 and then decode as e.g. latin1</div><div><br></div><div>Because I'm a utf-8 chauvinist, my main concern is that people who use utf-8 don't have to do anything special.  It bothers me a bit that people who use other encodings need to jump through these hoops, but I can live with that.</div><div><br></div><div>Any opinions, advice, or requests for clarification?</div><div><br></div><div> - David</div>


<div><br></div><div><br></div>


<div><br></div>


</div>