Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.

Learn more about Collectives

Teams

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Learn more about Teams

I'm writing a python program which upper-cases all input (a replacement for the non-working tr '[:lowers:]' '[:upper:]' ). The locale is ru_RU.UTF-8 and I use PYTHONIOENCODING=UTF-8 to set the STDIN/STDOUT encodings. This correctly sets sys.stdin.encoding . So, why do I still need to explicitly create a decoding wrapper if sys.stdin already knows the encoding? If I don't create the wrapping reader, the .upper() function doesn't work correctly (does nothing for non-ASCII characters).

import sys, codecs
sys.stdin = codecs.getreader(sys.stdin.encoding)(sys.stdin) #Why do I need this?
for line in sys.stdin:
    sys.stdout.write(line.upper())

Why does stdin have .encoding if it doesn't use it?

@Ark-kun because Python2.x uses bytes for strings... So you need to convert to unicode (using decode) so "upper" can work beyond the ASCII range. Using Python3.x should not show this problem because all strings are unicode – JBernardo Apr 3, 2013 at 3:31 @JBernardo Data which is not used by the class methods shouldn't be in a class. It's as if stdin had a .currentphaseofmoon or .numberoffilesondisk property. My point is that this property does nothing (if this is the case) and is useless and confusing. To compare, .Net's Stream class (byte-based) doesn't have the .Encoding property - only the StreamReader has it. – Ark-kun Apr 3, 2013 at 4:44 @Ark-kun The property by itself does nothing. But it is there because you may need if you want to convert the data to text. Just because you don't use something all the time or because it is not automatic, doesn't mean it is not useful. If the incoming data were binary and it tried to convert to unicode automatically, you would be much more pissed... – JBernardo Apr 3, 2013 at 4:47 Those are very different things. If the process is printing into a terminal, it will try to discover the encoding used by it -- and you can configure your terminal to use any encoding you want (Python gets that information, but .Netmay not). The locale module uses system-wide information – JBernardo Apr 3, 2013 at 8:13

To answer "why", we need to understand Python 2.x's built-in file type, file.encoding, and their relationship.

The built-in file object deals with raw bytes---always reads and writes raw bytes.

The encoding attribute describes the encoding of the raw bytes in the stream. This attribute may or may not be present, and may not even be reliable (e.g. we set PYTHONIOENCODING incorrectly in the case of standard streams).

The only time any automatic conversion is performed by file objects is when writing unicode object to that stream. In that case it will use the file.encoding if available to perform the conversion.

In the case of reading data, the file object will not do any conversion because it returns raw bytes. The encoding attribute in this case is a hint for the user to perform conversions manually.

file.encoding is set in your case because you set the PYTHONIOENCODING variable and the sys.stdin's encoding attribute was set accordingly. To get a text stream we have to wrap it manually as you have done in your example code.

To think about it another way, imagine that we didn't have a separate text type (like Python 2.x's unicode or Python 3's str). We can still work with text by using raw bytes, but keeping track of the encoding used. This is kind of how the file.encoding is meant to be used (to be used for tracking the encoding). The reader wrappers that we create automatically does the tracking and conversions for us.

Of course, automatically wrapping sys.stdin would be nicer (and that is what Python 3.x does), but changing the default behaviour of sys.stdin in Python 2.x will break backwards compatibility.

The following is a comparison of sys.stdin in Python 2.x and 3.x:

# Python 2.7.4
>>> import sys
>>> type(sys.stdin)
<type 'file'>
>>> sys.stdin.encoding
'UTF-8'
>>> w = sys.stdin.readline()
## ... type stuff - enter
>>> type(w)
<type 'str'>           # In Python 2.x str is just raw bytes
>>> import locale
>>> locale.getdefaultlocale()
('en_US', 'UTF-8')

The io.TextIOWrapper class is part of the standard library since Python 2.6. This class has an encoding attribute that is used to convert raw bytes to-and-from Unicode.

# Python 3.3.1
>>> import sys
>>> type(sys.stdin)
<class '_io.TextIOWrapper'>
>>> sys.stdin.encoding
'UTF-8'
>>> w = sys.stdin.readline()
## ... type stuff - enter
>>> type(w)
<class 'str'>        # In Python 3.x str is Unicode
>>> import locale
>>> locale.getdefaultlocale()
('en_US', 'UTF-8')

The buffer attribute provides access to the raw byte stream backing stdin; this is usually a BufferedReader. Note below that it does not have an encoding attribute.

# Python 3.3.1 again
>>> type(sys.stdin.buffer)
<class '_io.BufferedReader'>
>>> w = sys.stdin.buffer.readline()
## ... type stuff - enter
>>> type(w)
<class 'bytes'>      # bytes is (kind of) equivalent to Python 2 str
>>> sys.stdin.buffer.encoding
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: '_io.BufferedReader' object has no attribute 'encoding'

In Python 3 the presence or absence of the encoding attribute is consistent with the type of stream used.

Thanks for the answer. It confirmed my belief that Python 2's design was bad. Objects (unless they are simple data-storage-only structures) shouldn't contain data that they don't use. The file class shouldn't either use its .encoding property or remove it. This design flaw was fixed in Python 3. .Net handles this the same way: there are byte-based Streams (you can only write bytes to them) and encoding-aware TextReader/TextWriter-derived wrapper classes. You can use StreamReader.BaseStream to gain access to the underlying bytes. Console.In is a TextReader (encoding-aware). – Ark-kun Jul 6, 2013 at 20:07 Good to know that Python 3 seems to have also got rid of the non-Unicode strings (though there are still things like BufferedReader.readline()). – Ark-kun Jul 6, 2013 at 20:25

Thanks for contributing an answer to Stack Overflow!

  • Please be sure to answer the question. Provide details and share your research!

But avoid

  • Asking for help, clarification, or responding to other answers.
  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.