相关文章推荐
才高八斗的骆驼  ·  ImportError: ...·  6 天前    · 
英俊的大熊猫  ·  ansible rename folder ...·  1 年前    · 
另类的酱肘子  ·  ggplot2 ...·  1 年前    · 
Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.

Learn more about Collectives

Teams

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Learn more about Teams

I am reading some mainly HEX input into a Python3 script. However, the system is set to use UTF-8 and when piping from Bash shell into the script, I keep getting the following UnicodeDecodeError error :

UnicodeDecodeError: ('utf-8' codec can't decode byte 0xed in position 0: invalid continuation byte)

I'm using sys.stdin.read() in Python3 to read the piped input, according to other SO answers, like this:

import sys
isPipe = 0
if not sys.stdin.isatty() :
    isPipe = 1
        inpipe = sys.stdin.read().strip()
    except UnicodeDecodeError as e:
        err_unicode(e)

It works when piping using this way:

# echo "\xed\xff\xff\x0b\x04\x00\xa0\xe1" | some.py
<output all ok!>

However, using the raw format doesn't:

# echo -en "\xed\xff\xff\x0b\x04\x00\xa0\xe1"
# echo -en "\xed\xff\xff\x0b\x04\x00\xa0\xe1" | some.py
UnicodeDecodeError: ('utf-8' codec can't decode byte 0xed in position 0: invalid continuation byte)

and also tried other promising SO answers:

# echo -en "\xed\xff\xff\x0b\x04\x00\xa0\xe1" | python3 -c "open(1,'w').write(open(0).read())"
# echo -en "\xed\xff\xff\x0b\x04\x00\xa0\xe1" | python3 -c "from io import open; open(1,'w').write(open(0).read())"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/lib/python3.6/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 0: invalid continuation byte

From what I have learned so far, is that when your terminal is encountering a UTF-8 sequence, it is expecting it to be followed by 1-3 other bytes, like this:

UTF-8 is a variable width character encoding capable of encoding all valid code points in Unicode using one to four 8-bit bytes. So anything after the leading byte (first UTF-8 character in range of 0xC2 - 0xF4) to be followed by 1-3 continuation bytes, in the range 0x80 - 0xBF.

However, I cannot always be sure where my input stream come from, and it may very well be raw data and not the ASCII HEX'ed versions as above. So I need to deal with this raw input somehow.

I've looked at a few alternatives, like:

  • to use codecs.decode

  • to use open("myfile.jpg", "rb", buffering=0) with raw i/o

  • using bytes.decode(encoding="utf-8", errors="ignore") from bytes

  • or just using open(...)

  • But I don't know if or how they could read a piped input stream, like I want.

    How can I make my script handle also a raw byte stream?

    PS. Yes, I have read loads of similar SO issues, but none of them are adequately dealing with this UTF-8 input error. The best one is this one.

    This is not a duplicate.

    It doesn’t matter if (some of) your input happens to be hexadecimal numbers. But by “raw” you mean arbitrary binary input, right? – Davis Herring Oct 27, 2018 at 21:57 @DavisHerring Yes, binary. However, I don't agree that my question is a duplicate just because there may be an embedded answer remotely related to mine, within it. The question (you linked) as formulated, is completely different from mine, and its very unlikely anyone would search for those words when encountering my problem or error. – not2qubit Oct 28, 2018 at 5:42 It's hardly "remotely related": that question concerns reading and writing binary data, but the first three sentences of the one answer answer this question entirely. And I found it by searching for terms related to this question, although I agree that its title is a bit lacking for a "canonical buffer question". – Davis Herring Oct 28, 2018 at 14:55

    I finally managed to work around this issue by not using sys.stdin!

    Instead I used with open(0, 'rb'). Where:

  • 0 is the file pointer equivalent to stdin.
  • 'rb' is using binary mode for reading.
  • This seem to circumvent the issues with the system trying to interpret your locale character in the pipe. I got the idea after seeing that the following worked, and returned the correct (non-printable) characters:

    echo -en "\xed\xff\xff\x0b\x04\x00\xa0\xe1" | python3 -c "with open(0, 'rb') as f: x=f.read(); import sys; sys.stdout.buffer.write(x);"
    

    So to correctly read any pipe data, I used:

    if not sys.stdin.isatty() :
            with open(0, 'rb') as f: 
                inpipe = f.read()
        except Exception as e:
            err_unknown(e)        
        # This can't happen in binary mode:
        #except UnicodeDecodeError as e:
        #    err_unicode(e)
    

    That will read your pipe data into a python byte string.

    The next problem was to determine whether or not the pipe data was coming from a character string (like echo "BADDATA0") or from a binary stream. The latter can be emulated by echo -ne "\xBA\xDD\xAT\xA0" as shown in OP. In my case I just used a RegEx to look for out of bounds non ASCII characters.

    if inpipe :
        rx = re.compile(b'[^0-9a-fA-F ]+') 
        r = rx.findall(inpipe.strip())
        if r == [] :
            print("is probably a HEX ASCII string")
        else:
            print("is something else, possibly binary")
    

    Surely this could be done better and smarter. (Feel free to comment!)

    Addendum: (from here)

    mode is an optional string that specifies the mode in which the file is opened. It defaults to r which means open for reading in text mode. In text mode, if encoding is not specified the encoding used is platform dependent: locale.getpreferredencoding(False) is called to get the current locale encoding. (For reading and writing raw bytes use binary mode and leave encoding unspecified.) The default mode is 'r' (open for reading text, synonym of 'rt'). For binary read-write access, the mode w+b opens and truncates the file to 0 bytes. r+b opens the file without truncation.

    ... Python distinguishes between binary and text I/O. Files opened in binary mode (including b in the mode argument) return contents as bytes objects without any decoding. In text mode (the default, or when t is included in the mode argument), the contents of the file are returned as str, the bytes having been first decoded using a platform-dependent encoding or using the specified encoding if given.

    If closefd is False and a file descriptor rather than a filename was given, the underlying file descriptor will be kept open when the file is closed. If a filename is given, closefd must be True (the default) otherwise an error will be raised.

    You should pass closefd=False into open so that the with statement doesn't close stdin when it finishes. Also opening and reading a file in binary can't raise a UnicodeDecodeError. That gets thrown when a bytes is decoded into a string, which occurs when you read a file as text (use open without 'b' and read the file) or when you use the bytes.decode function. – daz Oct 29, 2018 at 20:56 @daz Yes, I see now that my first open() trials was not using the b flag. Also, using closefd=False doesn't seem to make any difference. So why do you think that's important? Then again I haven't tried interrupting the flow from the input. – not2qubit Oct 29, 2018 at 23:02 It probably won't matter for whatever script you are using. But without closefd, stdin is closed by the with statement so you wouldn't be able to use it afterwards. The standard streams are always expected to be opened. – daz Oct 30, 2018 at 0:49 with open(sys.stdin.fileno(), mode='rb', closefd=False) as stdin_binary: raw_input = stdin_binary.read() # text is the string formed by decoding raw_input as unicode text = raw_input.decode('utf-8') except UnicodeDecodeError: # raw_input is not valid unicode, do something else with it Adding some information to your answer, for example why it is better to use sys.stdin.buffer.raw over sys.stdin would make this a much better answer. From review – Pranav Hosangadi Jul 8, 2020 at 20:32