Collectives™ on Stack Overflow
Find centralized, trusted content and collaborate around the technologies you use most.
Learn more about Collectives
Teams
Q&A for work
Connect and share knowledge within a single location that is structured and easy to search.
Learn more about Teams
I am reading some mainly HEX input into a Python3 script. However, the system
is set to use
UTF-8
and when piping from Bash shell into the script, I keep
getting the following
UnicodeDecodeError
error
:
UnicodeDecodeError: ('utf-8' codec can't decode byte 0xed in position 0: invalid continuation byte)
I'm using
sys.stdin.read()
in Python3 to read the piped input, according to other SO answers, like this:
import sys
isPipe = 0
if not sys.stdin.isatty() :
isPipe = 1
inpipe = sys.stdin.read().strip()
except UnicodeDecodeError as e:
err_unicode(e)
It works when piping using this way:
# echo "\xed\xff\xff\x0b\x04\x00\xa0\xe1" | some.py
<output all ok!>
However, using the raw format doesn't:
# echo -en "\xed\xff\xff\x0b\x04\x00\xa0\xe1"
# echo -en "\xed\xff\xff\x0b\x04\x00\xa0\xe1" | some.py
UnicodeDecodeError: ('utf-8' codec can't decode byte 0xed in position 0: invalid continuation byte)
and also tried other promising SO answers:
# echo -en "\xed\xff\xff\x0b\x04\x00\xa0\xe1" | python3 -c "open(1,'w').write(open(0).read())"
# echo -en "\xed\xff\xff\x0b\x04\x00\xa0\xe1" | python3 -c "from io import open; open(1,'w').write(open(0).read())"
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/usr/lib/python3.6/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 0: invalid continuation byte
From what I have learned so far, is that when your terminal is encountering a UTF-8 sequence, it is expecting it to be followed by 1-3 other bytes, like this:
UTF-8 is a variable width character encoding capable of encoding all valid code points in Unicode using one to four 8-bit bytes.
So anything after the leading byte (first UTF-8 character in range of 0xC2 - 0xF4
) to be followed by 1-3 continuation bytes, in the
range 0x80 - 0xBF
.
However, I cannot always be sure where my input stream come from, and it may very well be raw data and not the ASCII HEX'ed versions as above. So I need to deal with this raw input somehow.
I've looked at a few alternatives, like:
to use codecs.decode
to use open("myfile.jpg", "rb", buffering=0)
with raw i/o
using bytes.decode(encoding="utf-8", errors="ignore")
from bytes
or just using open(...)
But I don't know if or how they could read a piped input stream, like I want.
How can I make my script handle also a raw byte stream?
PS. Yes, I have read loads of similar SO issues, but none of them are adequately dealing with this UTF-8 input error. The best one is this one.
This is not a duplicate.
–
–
–
I finally managed to work around this issue by not using sys.stdin
!
Instead I used with open(0, 'rb')
. Where:
0
is the file pointer equivalent to stdin
.
'rb'
is using binary mode for reading.
This seem to circumvent the issues with the system trying to interpret your locale character in the pipe. I got the idea after seeing that the following worked, and returned the correct (non-printable) characters:
echo -en "\xed\xff\xff\x0b\x04\x00\xa0\xe1" | python3 -c "with open(0, 'rb') as f: x=f.read(); import sys; sys.stdout.buffer.write(x);"
So to correctly read any pipe data, I used:
if not sys.stdin.isatty() :
with open(0, 'rb') as f:
inpipe = f.read()
except Exception as e:
err_unknown(e)
# This can't happen in binary mode:
#except UnicodeDecodeError as e:
# err_unicode(e)
That will read your pipe data into a python byte string.
The next problem was to determine whether or not the pipe data was coming from a character string (like echo "BADDATA0"
) or from a binary stream. The latter can be emulated by echo -ne "\xBA\xDD\xAT\xA0"
as shown in OP. In my case I just used a RegEx to look for out of bounds non ASCII characters.
if inpipe :
rx = re.compile(b'[^0-9a-fA-F ]+')
r = rx.findall(inpipe.strip())
if r == [] :
print("is probably a HEX ASCII string")
else:
print("is something else, possibly binary")
Surely this could be done better and smarter. (Feel free to comment!)
Addendum: (from here)
mode is an optional string that specifies the mode in which the file is opened. It defaults to r
which means open for reading in text mode. In text mode, if encoding is not specified the encoding used is platform dependent: locale.getpreferredencoding(False)
is called to get the current locale encoding. (For reading and writing raw bytes use binary mode and leave encoding unspecified.) The default mode is 'r' (open for reading text, synonym of 'rt'). For binary read-write access, the mode w+b
opens and truncates the file to 0 bytes. r+b
opens the file without truncation.
... Python distinguishes between binary and text I/O. Files opened in binary mode (including b
in the mode argument) return contents as bytes objects without any decoding. In text mode (the default, or when t
is included in the mode argument), the contents of the file are returned as str, the bytes having been first decoded using a platform-dependent encoding or using the specified encoding if given.
If closefd is False
and a file descriptor rather than a filename was given, the underlying file descriptor will be kept open when the file is closed. If a filename is given, closefd must be True
(the default) otherwise an error will be raised.
–
–
–
with open(sys.stdin.fileno(), mode='rb', closefd=False) as stdin_binary:
raw_input = stdin_binary.read()
# text is the string formed by decoding raw_input as unicode
text = raw_input.decode('utf-8')
except UnicodeDecodeError:
# raw_input is not valid unicode, do something else with it
–