Collectives™ on Stack Overflow
Find centralized, trusted content and collaborate around the technologies you use most.
Learn more about Collectives
Teams
Q&A for work
Connect and share knowledge within a single location that is structured and easy to search.
Learn more about Teams
I have a huge text file which I want to open.
I'm reading the file in chunks, avoiding memory issues related to reading too much of the file all at once.
code snippet:
def open_delimited(fileName, args):
with open(fileName, args, encoding="UTF16") as infile:
chunksize = 10000
remainder = ''
for chunk in iter(lambda: infile.read(chunksize), ''):
pieces = re.findall(r"(\d+)\s+(\d+_\d+)", remainder + chunk)
for piece in pieces[:-1]:
yield piece
remainder = '{} {} '.format(*pieces[-1])
if remainder:
yield remainder
the code throws the error UnicodeDecodeError: 'utf16' codec can't decode bytes in position 8190-8191: unexpected end of data
.
I tried UTF8
and got the error UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in position 0: invalid start byte
.
latin-1
and iso-8859-1
raised the error IndexError: list index out of range
A sample of the input file:
b'\xff\xfe1\x000\x000\x005\x009\x00\t\x001\x000\x000\x005\x009\x00_\x009\x007\x004\x007\x001\x007\x005\x003\x001\x000\x009\x001\x00\t\x00\t\x00P\x00o\x00s\x00t\x00\t\x001\x00\t\x00H\x00a\x00p\x00p\x00y\x00 \x00B\x00i\x00r\x00t\x00h\x00d\x00a\x00y\x00\t\x002\x000\x001\x001\x00-\x000\x008\x00-\x002\x004\x00 \x00'
I will also mention that I have several of those huge text files.
UTF16
works fine for many of them, and fail at a specific file.
Anyway to resolve this issue?
–
–
–
To ignore corrupted data (which can lead to data loss), set errors='ignore'
on the open()
call:
with open(fileName, args, encoding="UTF16", errors='ignore') as infile:
The open()
function documentation states:
'ignore'
ignores errors. Note that ignoring encoding errors can lead to data loss.
This does not mean you can recover from the apparent data corruption you are experiencing.
To illustrate, imagine a byte was dropped or added somewhere in your file. UTF-16 is a codec that uses 2 bytes per character. If there is one byte missing or surplus then all byte-pairs following the missing or extra byte are going to be out of alignment.
That can lead to problems decoding further down the line, not necessarily immediately. There are some codepoints in UTF-16 that are illegal, but usually because they are used in combination with another byte-pair; your exception was thrown for such an invalid codepoint. But there may have been hundreds or thousands byte-pairs preceding that point that were valid UTF-16, if not legible text.
–
–
I was doing the same thing (reading many large text files in chunks) and ran into the same error with one the files:
Traceback (most recent call last):
File "wordcount.py", line 128, in <module>
decodedtext = rawtext.decode('utf8')
File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc2 in position 9999999: unexpected end of data
Here's what I found: the problem was a particular Unicode sequence (\xc2\xa0\xc2\xa0
) spanning two chunks. Thus that sequence was split and became undecodable. Here's how I solved it:
# read text
rawtext = file.read(chunksize)
# fix splited end
if chunknumber < totalchunks:
while rawtext[-1] != ' ':
rawtext = rawtext + file.read(1)
# decode text
decodedtext = rawtext.decode('utf8')
This also solves the more general problem of words being cut in half when they span two chunks.
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.