What does unicodedata.normalize do in python?

Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.
Learn more about Collectives
Teams
Q&A for work
Connect and share knowledge within a single location that is structured and easy to search.
Learn more about Teams
import unicodedata
my_var = "this is a string"
my_var2 = " Esta es una oración que está en español "
my_var3 = unicodedata.normalize('NFKD', my_var2).encode('ascii', 'ignore')
output = my_var + my_var3
print(output)
And python finishes with the following error.
**File "C:/path/to/my/file/testing_file.py", line 5, in <module>
    output = my_var + my_var3
TypeError: Can't convert 'bytes' object to str implicitly
Process finished with exit code 1**
I would like to know what does this code do? This logic is being implemented on another project from another developer and I can't understand it at all. 
How can I solve this problem? I need a string which I will manipulate after.
                Please tag appropriately, I'm assuming this is python 2.x? BTW what normalize does is convert to a standard byte representation for characters that can be represented by more than one byte pattern, e.g. the Spanish inflected ñ can be U+00F1 or a regular n followed by U+0303. Normalization converts all instances of the latter to the former.
– Jared Smith
                Aug 6, 2018 at 15:25
                However, now my script is up and running. I also thank you for your explanation, quite helpful for me.
– Javier Ramirez
                Aug 6, 2018 at 15:40
                No worries. I added the python2.7 tag for future readers. Glad the explanation was helpful, I once spent an embarrassing amount of time tracking down a bug in an application that was related to un-normalized unicode content.
– Jared Smith
                Aug 6, 2018 at 15:42
In Python 3, string.encode() creates a byte string, which cannot be mixed with a regular string. You have to convert the result back to a string again; the method is predictably called decode.
my_var3 = unicodedata.normalize('NFKD', my_var2).encode('ascii', 'ignore').decode('ascii')
In Python 2, there was no hard distinction between Unicode strings and "regular" (byte) strings, but that meant many hard-to-catch bugs were introduced when programmers had careless assumptions about the encoding of strings they were manipulating.
As for what the normalization does, it makes sure characters which look identical actually are identical. For example, ñ can be represented either as the single code point U+00F1 LATIN SMALL LETTER N WITH TILDE or as the combining sequence U+006E LATIN SMALL LETTER N followed by U+0303 COMBINING TILDE. Normalization converts these so that every variation is coerced into the same representation (the D normalization prefers the decomposed, combining sequence) so that strings which represent the same text are also guaranteed to contain exactly the same code points.
Because decomposed characters in many Latin-based languages are often a sequence of a plain ASCII character followed by a number of combining diacritics which are not legacy ASCII characters, converting the string to 7-bit ASCII with the 'ignore' error handler will often strip accents but leave the text almost readable. Götterdämmerung gets converted to Gotterdammerung etc.
                Your explanation was  very clear, now I think I'm able to put in practice <normalize> which I consider is a quite powerful method.  Thank you!
– Javier Ramirez
                Aug 6, 2018 at 16:28
                It worked perfectly to Portuguese - the way was defined by @tripleee above. Thanks!  Like:  unicodedata.normalize('NFKD', 'José João Caminhão Cachaçaria Pêssegó').encode('ascii', 'ignore').decode('utf8').upper()  Output:  JOSE JOAO CAMINHAO CACHACARIA PESSEGO
– Bitart
                May 7, 2021 at 12:59
You need to specify the encoding type.
Then you need to use unicode instead of string as arguments of normalize()
# -*- coding: utf-8 -*-
import unicodedata
my_var = u"this is a string"
my_var2 = u" Esta es una oración que está en español "
my_var3 = unicodedata.normalize(u'NFKD', my_var2).encode('ascii', 'ignore').decode('utf8')
output = my_var + my_var3
print(output)
                Thanks for your time and help. I do not why but PyCharm wasn´t displaying the decode method. It is working now. Thanks again.
– Javier Ramirez
                Aug 6, 2018 at 15:38
        Thanks for contributing an answer to Stack Overflow!
Please be sure to answer the question. Provide details and share your research!
But avoid …
Asking for help, clarification, or responding to other answers.
Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.