Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.

Learn more about Collectives

Teams

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Learn more about Teams

I have a huge text file which I want to open.
I'm reading the file in chunks, avoiding memory issues related to reading too much of the file all at once.

code snippet:

def open_delimited(fileName, args):
    with open(fileName, args, encoding="UTF16") as infile:
        chunksize = 10000
        remainder = ''
        for chunk in iter(lambda: infile.read(chunksize), ''):
            pieces = re.findall(r"(\d+)\s+(\d+_\d+)", remainder + chunk)
            for piece in pieces[:-1]:
                yield piece
            remainder = '{} {} '.format(*pieces[-1]) 
        if remainder:
            yield remainder

the code throws the error UnicodeDecodeError: 'utf16' codec can't decode bytes in position 8190-8191: unexpected end of data.

I tried UTF8 and got the error UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in position 0: invalid start byte.

latin-1 and iso-8859-1 raised the error IndexError: list index out of range

A sample of the input file:

b'\xff\xfe1\x000\x000\x005\x009\x00\t\x001\x000\x000\x005\x009\x00_\x009\x007\x004\x007\x001\x007\x005\x003\x001\x000\x009\x001\x00\t\x00\t\x00P\x00o\x00s\x00t\x00\t\x001\x00\t\x00H\x00a\x00p\x00p\x00y\x00 \x00B\x00i\x00r\x00t\x00h\x00d\x00a\x00y\x00\t\x002\x000\x001\x001\x00-\x000\x008\x00-\x002\x004\x00 \x00'

I will also mention that I have several of those huge text files.
UTF16 works fine for many of them, and fail at a specific file.

Anyway to resolve this issue?

If your inputfile is UTF-16 (albeit truncated), then Latin1 or UTF-8 will certainly not work. – Martijn Pieters Aug 21, 2013 at 12:41 Can we see a sample of your inputfile? Then at least we can take a stab at guessing the encoding used. Read the file as binary, and print that. print(open(fileName, 'rb').read(120)) should give us enough to work with. – Martijn Pieters Aug 21, 2013 at 12:43 That is most definitely UTF16. If that data is corrupted somewhere, there is little we can do to fix that. You could try a different chunk size, perhaps there is a bug in TextIOWrapper.read() where a it ends up with a partial read of a surrogate pair. I recommend a power of 2. 16384 is 2**14, for example. – Martijn Pieters Aug 21, 2013 at 13:02

To ignore corrupted data (which can lead to data loss), set errors='ignore' on the open() call:

with open(fileName, args, encoding="UTF16", errors='ignore') as infile:

The open() function documentation states:

  • 'ignore' ignores errors. Note that ignoring encoding errors can lead to data loss.
  • This does not mean you can recover from the apparent data corruption you are experiencing.

    To illustrate, imagine a byte was dropped or added somewhere in your file. UTF-16 is a codec that uses 2 bytes per character. If there is one byte missing or surplus then all byte-pairs following the missing or extra byte are going to be out of alignment.

    That can lead to problems decoding further down the line, not necessarily immediately. There are some codepoints in UTF-16 that are illegal, but usually because they are used in combination with another byte-pair; your exception was thrown for such an invalid codepoint. But there may have been hundreds or thousands byte-pairs preceding that point that were valid UTF-16, if not legible text.

    I tried the errors='ignore' and get remainder = '{} {} '.format(*pieces[-1]) IndexError: list index out of range – Presen Aug 21, 2013 at 14:45 Right, because now you are apparently ending up with a chunk where re.findall() returns no matches at all. That is the risk of ignoring invalid characters; if one byte is missing in your file, then the UTF-16 decoding may be unreadable now; it is effectively not detectable what byte is missing in that case and the exception you saw could be well past the file corruption. – Martijn Pieters Aug 21, 2013 at 14:54

    I was doing the same thing (reading many large text files in chunks) and ran into the same error with one the files:

    Traceback (most recent call last):
      File "wordcount.py", line 128, in <module>
        decodedtext = rawtext.decode('utf8')
      File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
        return codecs.utf_8_decode(input, errors, True)
    UnicodeDecodeError: 'utf8' codec can't decode byte 0xc2 in position 9999999: unexpected end of data
    

    Here's what I found: the problem was a particular Unicode sequence (\xc2\xa0\xc2\xa0) spanning two chunks. Thus that sequence was split and became undecodable. Here's how I solved it:

    # read text
    rawtext = file.read(chunksize)
    # fix splited end
    if chunknumber < totalchunks:
        while rawtext[-1] != ' ':
            rawtext = rawtext + file.read(1)
    # decode text
    decodedtext = rawtext.decode('utf8')
    

    This also solves the more general problem of words being cut in half when they span two chunks.

    Thanks for contributing an answer to Stack Overflow!

    • Please be sure to answer the question. Provide details and share your research!

    But avoid

    • Asking for help, clarification, or responding to other answers.
    • Making statements based on opinion; back them up with references or personal experience.

    To learn more, see our tips on writing great answers.