print u"This does work β".encode('utf-8')
It's as if the json package ignores the encoding option if stdout is not a terminal.
How can I get the json package to do what I want?
JSON is a text serialization format (that incidentally has a recommended binary encoding), not a binary serialization format. The json module itself only cares about encoding to the extent that it would like to know what Python 2's terrible str type is supposed to represent (is it ASCII bytes? UTF-8 bytes? latin-1 bytes?).
Since Python 2 text handling is, as stated, terrible, the json module is happy to return either str (when ensure_ascii is true, or the stars align in other cases and it's convinced you've told it str is compatible with your expected encoding, and none of the inputs are actually unicode) or unicode (when ensure_ascii is false, most of the time).
Like the rest of Python 2, sys.stdout is a bit wishy-washy. Even if it is set to an encoding='ascii' by your locale settings, it ignores it when you write a str to it (sys.stdout.write('\xe9') should fail, but instead, it treats the str as pre-encoded raw binary data and doesn't bother to verify it matches the expected encoding. But when unicode comes in, it doesn't have that option; unicode is text (not UTF-8 text, not ASCII text, etc.), from the ideal text world of unicorns and rainbows, and that world isn't expressed in tawdry bytes.
So sys.stdout must encode the result, and it does so with the locale determined encoding (sys.stdout.encoding will tell you what it is). When that's ASCII, and it receives something that can't encode to ASCII, it explodes (as it should).
The point is, the json module is always returning text (either unicode, or str that it's convinced is effectively text in the wishy-washy Python 2 world), and sometimes you get lucky and that text happens to be in a format that bypasses checks in sys.stdout.
But you shouldn't be relying on that. If your output must be in a specific encoding, use that encoding. The simplest way to do this (simplest in the sense that it pushes most work to the interpreter to do for you) is to not use sys.stdout (explicitly, or implicitly via print) and write your data to files you open with io.open (a backport of Python 3's open, that properly handles encodings), explicitly specifying encoding='utf-8'. If you must use sys.stdout, and you insist on ignoring the locale encoding, you can rewrap it, e.g.:
with io.open(sys.stdout.fileno(), encoding='utf-8', closefd=False) as encodedout:
json.dump(x, encodedout, ensure_ascii=False, encoding="utf-8")
which temporarily wraps the stdout file descriptor in a modern file-like object (using closefd to avoid closing the underlying descriptor when it's closed).
TL;DR: Switch to Python 3. Python 2 is awful when it comes to non-ASCII text, and its modules are often even worse (json should absolutely be returning a consistent type, or at least just one type for each setting of ensure_ascii, not dynamically selecting based on the inputs and encoding; it's not even the worst either, the csv module is absolutely awful). Also, it's reached end-of-life, and will not be patched for anything from here on out, so continuing to use it leaves you vulnerable to any security problems found between the beginning of this year and the end of time. Among other things, Python 3 uses str exclusively for text (which has the full Unicode support of Py2's unicode type) and modern Python 3 (3.7+) will coerce ASCII locales to UTF-8 (because basically all systems can actually handle the latter), which should fix all your problems. Non-ASCII text will behave the same as ASCII text, and weirdo locales like yours that insist they're ASCII (and therefore won't handle non-ASCII output) will be "fixed" to work as you desire, without manually encoding and decoding, rewrapping file handles, etc.
Consolidating all the comments and answers into one final answer:
Note: this answer is for Python 2.7. Python 3 is likely to be different.
The json spec says that json files are utf-8 encoded. However, the Python json package does not like to take chances and so writes straight ascii and escapes unicode characters in the output.
You can set the ensure_ascii flag to False, in which case the json package will generate unicode output instead of str. In that case, encoding the unicode output is your problem.
There is no way to make the json package generate utf-8 or any other encoding on output. It's either ascii or unicode; take your pick.
The encoding argument was a red herring. That option tells the json package how the input strings are encoded.
Here's what finally worked for me:
ofile = codecs.getwriter('utf-8')(sys.stdout)
json.dump(x, ofile, ensure_ascii=False)
tl;dr: the real mystery was why didn't it barf when just letting stdout go to the terminal. It turned out that stdout.write() was detecting when output was to a terminal and encoding per the $LANG environment variable. When output goes to a file, the unicode is encoded to ascii, and an error results when a non-encodable character is encountered.
There is an environment variable Python uses that can override encoding to the terminal or for redirection, so this should work without wrapping stdout inside the script.
$ export PYTHONIOENCODING=utf8
$ ./tester.py > output.json
–
–
–
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.