Collectives™ on Stack Overflow
Find centralized, trusted content and collaborate around the technologies you use most.
Learn more about Collectives
Teams
Q&A for work
Connect and share knowledge within a single location that is structured and easy to search.
Learn more about Teams
Note: The
possible duplicate
concerns an older version of Python and this question has already generated unique answers.
I have been working on a script to process
Project Gutenberg Texts
texts into an internal file format for an application I am developing. In the script I process chapter headings with the
re
module. This works very well except in one case: the first line. My regex will always fail on the first Chapter marker at the first line if it includes the ^ caret to require the regex match to be at the beginning of the line because the BOM is consumed as the first character. (Example regex:
^Chapter
).
What I've discovered is that if I do not include the caret, it won't fail on the first line, and then
<feff>
is included in the heading after I've processed it. An example:
<h1><feff>Chapter I</h1>
The advice according to this SO question (from which I learned of the BOM) is to fix your script to not consume/corrupt the BOM. Other SO questions talk about decoding the file with a codec but discuss errors I never encounter and do not discuss the syntax for opening a file with the template decoder.
To be clear:
I generally use pipelines of the following format:
cat -s <filename> | <other scripts> | python <scriptname> [options] > <outfile>
And I am opening the file with the following syntax:
import sys
fin = sys.stdin
if '-i' in sys.argv: # For command line option "-i <infile>"
fin = open(sys.argv[sys.argv.index('-i') + 1], 'rt')
for line in fin:
...Processing here...
My question is what is the proper way to handle this? Do I remove the BOM before processing the text? If so, how? Or do I use a decoder on the file before processing it (I am reading from stdin, so how would I accomplish this?)
The files are stored in UTF-8 encoding with DOS endings (\r\n
). I convert them in vim to UNIX file format before processing using set ff=unix
(I have to do several manual pre-processing tasks before running the script).
–
–
–
–
As a complement to the existing answer, it is possible to filter the UTF8 BOM from stdin with the codecs module. Simply you must use sys.stdin.buffer
to access the underlying byte stream and decode it with a StreamReader
import sys
import codecs
# trick to process sys.stdin with a custom encoding
fin = codecs.getreader('utf_8_sig')(sys.stdin.buffer, errors='replace')
if '-i' in sys.argv: # For command line option "-i <infile>"
fin = open(sys.argv[sys.argv.index('-i') + 1], 'rt',
encoding='utf_8_sig', errors='replace')
for line in fin:
...Processing here...
–
–
In Python 3, stdin
should be auto-decoded properly, but if it's not working for you (and for Python 2) you need to specify PythonIOEncoding before invoking your script like
PYTHONIOENCODING="UTF-8-SIG" python <scriptname> [options] > <outfile>
Notice that this setting also makes stdout
working with UTF-8-SIG
, so your <outfile>
will maintain the original encoding.
For your -i
parameter, just do open(path, 'rt', encoding="UTF-8-SIG")
–
–
–
–
You really don't need to import codecs
or anything to deal with this. As lenz suggested in comments just check for the BOM and throw it out.
for line in input:
if line[0] == "\ufeff":
line = line[1:] # trim the BOM away
# the rest of your code goes here as usual
In Python 3.9 default encoding for standard input seems to be utf-8
, at least on Linux:
In [2]: import sys
In [3]: sys.stdin
Out[3]: <_io.TextIOWrapper name='<stdin>' mode='r' encoding='utf-8'>
sys.stdin
has the method reconfigure()
:
sys.stdin.reconfigure("utf-8-sig")
which should be called before any attempt of reading the standard input. This will decode the BOM, which will no longer appear when reading sys.stdin
.
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.