Convert a Unicode string to a string in Python (containing extra symbols)

Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.

Learn more about Collectives

Teams

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Learn more about Teams We need to know what Python version you are using, and what it is that you are calling a Unicode string. Do the following on a short unicode_string that includes the currency symbols that are causing the bother: Python 2.x :


    print type(unicode_string), repr(unicode_string)

Python 3.x :


    print type(unicode_string), ascii(unicode_string)

Then edit your question and copy/paste the results of the above print statement. DON'T retype the results. Also look up near the top of your HTML and see if you can find something like this: <meta http-equiv="Content-Type" content="text/html; charset=iso-8859 – John Machin Jul 30, 2009 at 16:13 You should really clarify what you mean by unicode string and python string (giving concrete examples would be the best I guess) as it's clear from comments there are different interpretations of your question. I wonder why you haven't done this although it's over 3,5 years since you asked this question. – Piotr Dobrogost Jan 21, 2013 at 12:45 @jalf: If it is encoded ; it is no longer Unicode e.g.,


    unicode_string = u"I'm unicode string"; bytestring = unicode_string.encode('utf-8'); unicode_again = bytestring.decode('utf-8')

– jfs Dec 21, 2013 at 1:47

title = u"Klüft skräms inför på fédéral électoral große"
import unicodedata
unicodedata.normalize('NFKD', title).encode('ascii', 'ignore')
'Kluft skrams infor pa federal electoral groe'
                +1 answers the question as worded, @williamtroup's problem of not being able to save unicode to a file sounds like an entirely different issue worthy of a separate question
– Mark Roddy
                Jul 30, 2009 at 16:03
                @Mark Roddy: His question as written is how to convert a "Unicode string" (whatever he means by that) containing some currency symbols to a "Python string" (whatever ...)  and you think that a remove-some-diacritics delete-other-non-ascii characters kludge answers his question???
– John Machin
                Jul 30, 2009 at 16:25
                @JohnMachin This answers the question word for word:  The only way to convert a unicode string to a str is to either drop or convert the characters that cannot be represented in ASCII.  So +1 from me.
– Izkata
                Oct 14, 2013 at 21:45
                @lzkata: no, it is not. type(title) == unicode and type(title.encode('utf-8')) == str. No need to corrupt the input, to get a bytestring that can be saved to a file.
– jfs
                Dec 21, 2013 at 1:53
You can use encode to ASCII if you don't need to translate the non-ASCII characters:
>>> a=u"aaaàçççñññ"
>>> type(a)
<type 'unicode'>
>>> a.encode('ascii','ignore')
'aaa'
>>> a.encode('ascii','replace')
'aaa???????'
                Awesome answer. Exactly what I needed. Also, great presentation to show the effect of ignore vs replace
– Jonny Brooks
                Apr 11, 2017 at 12:19
                or a.encode('ascii', 'xmlcharrefreplace') gives 'aaa&#224;&#231;&#231;&#231;&#241;&#241;&#241;'.
– Bob Stein
                Apr 10, 2019 at 17:22
                This breaks if the content of the string is actually unicode, not just ascii characters in a unicode string. Don't do this, you'll get random UnicodeEncodeError exceptions all over the place.
– Doug
                Oct 9, 2013 at 7:31
                This answer helped me. If you know that your string is ascii and you need to cast it back to a non-unicode string, this is very useful.
– VedTopkar
                Oct 16, 2014 at 16:04
If you have a Unicode string, and you want to write this to a file, or other serialised form, you must first encode it into a particular representation that can be stored.  There are several common Unicode encodings, such as UTF-16 (uses two bytes for most Unicode characters) or UTF-8 (1-4 bytes / codepoint depending on the character), etc. To convert that string into a particular encoding, you can use:
>>> s= u'£10'
>>> s.encode('utf8')
'\xc2\x9c10'
>>> s.encode('utf16')
'\xff\xfe\x9c\x001\x000\x00'
This raw string of bytes can be written to a file. However, note that when reading it back, you must know what encoding it is in and decode it using that same encoding.
When writing to files, you can get rid of this manual encode/decode process by using the codecs module. So, to open a file that encodes all Unicode strings into UTF-8, use:
import codecs
f = codecs.open('path/to/file.txt','w','utf8')
f.write(my_unicode_string)  # Stored on disk as UTF-8
Do note that anything else that is using these files must understand what encoding the file is in if they want to read them. If you are the only one doing the reading/writing this isn't a problem, otherwise make sure that you write in a form understandable by whatever else uses the files.
In Python 3, this form of file access is the default, and the built-in open function will take an encoding parameter and always translate to/from Unicode strings (the default string object in Python 3) for files opened in text mode.
                Can anyone explain why, when I encode the Euro symbol to utf8 as shown here, the result is only question marks? Here is an image of my Python, version 2.7.13. (I can encode other unicode objects like u"Klüft", but not the Euros?)
– Nate Anderson
                Apr 4, 2019 at 16:20
file contain unicode-esaped string
\"message\": \"\\u0410\\u0432\\u0442\\u043e\\u0437\\u0430\\u0446\\u0438\\u044f .....\",
for me 
 f = open("56ad62-json.log", encoding="utf-8")
 qq=f.readline() 
 print(qq)                          
 {"log":\"message\": \"\\u0410\\u0432\\u0442\\u043e\\u0440\\u0438\\u0437\\u0430\\u0446\\u0438\\u044f \\u043f\\u043e\\u043b\\u044c\\u0437\\u043e\\u0432\\u0430\\u0442\\u0435\\u043b\\u044f\"}
(qq.encode().decode("unicode-escape").encode().decode("unicode-escape")) 
# '{"log":"message": "Авторизация пользователя"}\n'
Well, if you're willing/ready to switch to Python 3 (which you may not be due to the backwards incompatibility with some Python 2 code), you don't have to do any converting; all text in Python 3 is represented with Unicode strings, which also means that there's no more usage of the u'<text>' syntax. You also have what are, in effect, strings of bytes, which are used to represent data (which may be an encoded string).
http://docs.python.org/3.1/whatsnew/3.0.html#text-vs-data-instead-of-unicode-vs-8-bit
(Of course, if you're currently using Python 3, then the problem is likely something to do with how you're attempting to save the text to a file.)
                In Python 3 strings are Unicode strings. They are never encoded. I found the following text useful: joelonsoftware.com/articles/Unicode.html
– lutz
                Jul 30, 2009 at 16:14
                @lutz: Right, I'd forgotten that Unicode is a character map rather than an encoding.   @John: There isn't enough information at the moment to know what the problem with saving it is. Is he getting an error? Is he not getting any errors, but when opening the file externally he gets mojibake? Without that information, there are far too many possible solutions that could be provided.
– JAB
                Jul 30, 2009 at 16:24
                @Cat: There isn't any information at the moment to know what he's got, let alone what his saving problem is. I've asked him to provide some facts -- see my answer.
– John Machin
                Jul 30, 2009 at 16:35
There is a library that can help with Unicode issues called ftfy. Has made my life easier.
Example 1
import ftfy
print(ftfy.fix_text('uÌˆnicode'))
output -->
ünicode
Example 2 - UTF-8
import ftfy
print(ftfy.fix_text('\xe2\x80\xa2'))
output -->
Example 3 - Unicode
code point
import ftfy
print(ftfy.fix_text(u'\u2026'))
output -->
https://ftfy.readthedocs.io/en/latest/
pip install ftfy
https://pypi.org/project/ftfy/
No answere worked for my case, where I had a string variable containing unicode chars, and no encode-decode explained here did the work.
If I do in a Terminal
echo "no me llama mucho la atenci\u00f3n"
python3
>>> print("no me llama mucho la atenci\u00f3n")
The output is correct:
output: no me llama mucho la atención
But working with scripts loading this string variable didn't work.
This is what worked on my case, in case helps anybody:
string_to_convert = "no me llama mucho la atenci\u00f3n"
print(json.dumps(json.loads(r'"%s"' % string_to_convert), ensure_ascii=False))
output: no me llama mucho la atención
import unicodedata
def unicode_to_ascii(note):
    str_map = {'Š' : 'S', 'š' : 's', 'Đ' : 'D', 'đ' : 'd', 'Ž' : 'Z', 'ž' : 'z', 'Č' : 'C', 'č' : 'c', 'Ć' : 'C', 'ć' : 'c', 'À' : 'A', 'Á' : 'A', 'Â' : 'A', 'Ã' : 'A', 'Ä' : 'A', 'Å' : 'A', 'Æ' : 'A', 'Ç' : 'C', 'È' : 'E', 'É' : 'E', 'Ê' : 'E', 'Ë' : 'E', 'Ì' : 'I', 'Í' : 'I', 'Î' : 'I', 'Ï' : 'I', 'Ñ' : 'N', 'Ò' : 'O', 'Ó' : 'O', 'Ô' : 'O', 'Õ' : 'O', 'Ö' : 'O', 'Ø' : 'O', 'Ù' : 'U', 'Ú' : 'U', 'Û' : 'U', 'Ü' : 'U', 'Ý' : 'Y', 'Þ' : 'B', 'ß' : 'Ss', 'à' : 'a', 'á' : 'a', 'â' : 'a', 'ã' : 'a', 'ä' : 'a', 'å' : 'a', 'æ' : 'a', 'ç' : 'c', 'è' : 'e', 'é' : 'e', 'ê' : 'e', 'ë' : 'e', 'ì' : 'i', 'í' : 'i', 'î' : 'i', 'ï' : 'i', 'ð' : 'o', 'ñ' : 'n', 'ò' : 'o', 'ó' : 'o', 'ô' : 'o', 'õ' : 'o', 'ö' : 'o', 'ø' : 'o', 'ù' : 'u', 'ú' : 'u', 'û' : 'u', 'ý' : 'y', 'ý' : 'y', 'þ' : 'b', 'ÿ' : 'y', 'Ŕ' : 'R', 'ŕ' : 'r'}
    for key, value in str_map.items():
        note = note.replace(key, value)
    asciidata = unicodedata.normalize('NFKD', note).encode('ascii', 'ignore')
    return asciidata.decode('UTF-8')
I have made the following function which lets you control what to keep according to the General_Category_Values in Unicode (https://www.unicode.org/reports/tr44/#General_Category_Values)
def FormatToNameList(name_str):
    import unicodedata
    clean_str = ''
    for c in name_str:
        if unicodedata.category(c) in ['Lu','Ll']:
            clean_str += c.lower()
            print('normal letter: ',c)
        elif unicodedata.category(c) in ['Lt','Lm','Lo']:
            clean_str += c
            print('special letter: ',c)
        elif unicodedata.category(c) in ['Nd']:
            clean_str += c
            print('normal number: ',c)
        elif unicodedata.category(c) in ['Nl','No']:
            clean_str += c
            print('special number: ',c)
        elif unicodedata.category(c) in ['Cc','Sm','Zs','Zl','Zp','Pc','Pd','Ps','Pe','Pi','Pf','Po']:
            clean_str += ' '
            print('space or symbol: ',c)
        else:
            print('other: ',' : ',c,' unicodedata.category: ',unicodedata.category(c))    
    name_list = clean_str.split(' ')
    return clean_str, name_list
if __name__ == '__main__':
     u = 'some3^?"Weirdstr '+ chr(231) + chr(0x0af4)
     [clean_str, name_list] = FormatToNameList(u)
     print(clean_str)
     print(name_list)
See also https://docs.python.org/3/howto/unicode.html
				UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in position 3 2: ordinal not in range(128)
                See more linked questions