utf 8 - How am I suppposed to handle the BOM while text processing using sys.stdin in Python 3?

Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.

Learn more about Collectives

Teams

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Learn more about Teams

Note: The possible duplicate concerns an older version of Python and this question has already generated unique answers.

I have been working on a script to process Project Gutenberg Texts texts into an internal file format for an application I am developing. In the script I process chapter headings with the re module. This works very well except in one case: the first line. My regex will always fail on the first Chapter marker at the first line if it includes the ^ caret to require the regex match to be at the beginning of the line because the BOM is consumed as the first character. (Example regex: ^Chapter ).

What I've discovered is that if I do not include the caret, it won't fail on the first line, and then <feff> is included in the heading after I've processed it. An example:

<h1><feff>Chapter I</h1>
The advice according to this SO question (from which I learned of the BOM) is to fix your script to not consume/corrupt the BOM. Other SO questions talk about decoding the file with a codec but discuss errors I never encounter and do not discuss the syntax for opening a file with the template decoder.
To be clear: 
I generally use pipelines of the following format:
cat -s <filename> | <other scripts> | python <scriptname> [options] > <outfile>
And I am opening the file with the following syntax:
import sys
fin = sys.stdin
if '-i' in sys.argv: # For command line option "-i <infile>"
    fin = open(sys.argv[sys.argv.index('-i') + 1], 'rt')
for line in fin:
    ...Processing here...
My question is what is the proper way to handle this? Do I remove the BOM before processing the text? If so, how? Or do I use a decoder on the file before processing it (I am reading from stdin, so how would I accomplish this?)
The files are stored in UTF-8 encoding with DOS endings (\r\n). I convert them in vim to UNIX file format before processing using set ff=unix (I have to do several manual pre-processing tasks before running the script).
                Hmm, fin = sys.argv[sys.argv.index('-i') + 1] should give you a filename in fin. It should then be opened with an open call that you have not shown and that is the place where you could declare that you want to filter the BOM out. Could you please show your open instruction?
– Serge Ballesta
                Jul 23, 2018 at 14:04
                @Serge I apologize. I typed it from memory and forgot to include the open. However, I mostly use sys.stdin because I've been using it in pipelines. I would especially like to know how to declare it with sys.stdin.
– mas
                Jul 23, 2018 at 15:21
                Python 3 should transparently normalize the line endings with text files (Python 2 had 'Ur'  for opening a file for reading with line-ending normalization) . The gist of the proposed duplicate is to use the utf-8-sig encoding when opening the file to transparently ignore the BOM, too.
– tripleee
                Jul 23, 2018 at 15:32
                If you are preprocessing the files anyway, it might be the easiest to chop it off in that process. Check the first character and remove it if it is the "zero-width non-breaking space".
– lenz
                Jul 23, 2018 at 15:36
As a complement to the existing answer, it is possible to filter the UTF8 BOM from stdin with the codecs module. Simply you must use sys.stdin.buffer to access the underlying byte stream and decode it with a StreamReader
import sys
import codecs
# trick to process sys.stdin with a custom encoding
fin = codecs.getreader('utf_8_sig')(sys.stdin.buffer, errors='replace')
if '-i' in sys.argv: # For command line option "-i <infile>"
    fin = open(sys.argv[sys.argv.index('-i') + 1], 'rt',
               encoding='utf_8_sig', errors='replace')
for line in fin:
    ...Processing here...
                This actually seems to be the most elegant solution as it (seems to me) to be more portable than the other solutions. How will it handle non-utf-8 encoded scripts? Will it choke?
– mas
                Jul 23, 2018 at 16:17
                My comment seems to be due to a misunderstanding of BOMs and character encoding. After perusing this Unix.SE discussion as well as this quora question I have come to the conclusion that I will probably never need to worry about the BOM except to remove it and therefore take this answer as the final, most elegant and portable solution.
– mas
                Jul 23, 2018 at 17:08
In Python 3, stdin should be auto-decoded properly, but if it's not working for you (and for Python 2) you need to specify PythonIOEncoding before invoking your script like
PYTHONIOENCODING="UTF-8-SIG" python <scriptname> [options] > <outfile>
Notice that this setting also makes stdout working with UTF-8-SIG, so your <outfile> will maintain the original encoding.
For your -i parameter, just do open(path, 'rt', encoding="UTF-8-SIG")
                Can I just export an environment variable or does your solution require PYTHONIOENCODING="UTF-8-SIG" to be declared while I'm running the script?
– mas
                Jul 23, 2018 at 15:41
                If you set it earlier in the script, it will remain set for the duration of the script (unless you explicitly unset or change it, of course). You need to export it so it's visible to subprocesses such as Python.
– tripleee
                Jul 23, 2018 at 15:46
                @tripleee: When you say "set it earlier in the script" is the script you are referring to my python script of my pipeline-command? I am curious because if I can just write this into my python script toward the top that might be the simplest solution.
– mas
                Jul 23, 2018 at 15:53
                The shell script containing the pipeline. If the Python script is simple you can embed it in the shell script but I would probably look into doing the preprocessing in Python too instead.
– tripleee
                Jul 23, 2018 at 15:55
You really don't need to import codecs or anything to deal with this. As lenz suggested in comments just check for the BOM and throw it out.
for line in input:
    if line[0] == "\ufeff":
        line = line[1:] # trim the BOM away
    # the rest of your code goes here as usual
In Python 3.9 default encoding for standard input seems to be utf-8, at least on Linux:
In [2]: import sys
In [3]: sys.stdin
Out[3]: <_io.TextIOWrapper name='<stdin>' mode='r' encoding='utf-8'>
sys.stdin has the method reconfigure():
sys.stdin.reconfigure("utf-8-sig")
which should be called before any attempt of reading the standard input. This will decode the BOM, which will no longer appear when reading sys.stdin.
        Thanks for contributing an answer to Stack Overflow!
Please be sure to answer the question. Provide details and share your research!
But avoid …
Asking for help, clarification, or responding to other answers.
Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.