python - How can I read tar.gz file using pandas read_csv with gzip compression option?

Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.

Learn more about Collectives

Teams

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Learn more about Teams

I have a very simple csv, with the following data, compressed inside the tar.gz file. I need to read that in dataframe using pandas.read_csv.

0 1 4 1 2 5 2 3 6 import pandas as pd pd.read_csv("sample.tar.gz",compression='gzip')

However, I am getting error:

CParserError: Error tokenizing data. C error: Expected 1 fields in line 440, saw 2
Following are the set of read_csv commands and the different errors I get with them:
pd.read_csv("sample.tar.gz",compression='gzip',  engine='python')
Error: line contains NULL byte
pd.read_csv("sample.tar.gz",compression='gzip', header=0)
CParserError: Error tokenizing data. C error: Expected 1 fields in line 440, saw 2
pd.read_csv("sample.tar.gz",compression='gzip', header=0, sep=" ")
CParserError: Error tokenizing data. C error: Expected 2 fields in line 94, saw 14    
pd.read_csv("sample.tar.gz",compression='gzip', header=0, sep=" ", engine='python')
Error: line contains NULL byte
What's going wrong here? How can I fix this?
                If it is a single file, why are you tar-ing it? Why not just gzip it? That way you can use pd.read_csv() on it directly.
– Nehal J Wani
                Sep 1, 2016 at 6:32
                I am not tar-ing it. It's given and I can't unzip the original file as it's more that 100 GB.
– Geet
                Sep 1, 2016 at 6:36
df = pd.read_csv('sample.tar.gz', compression='gzip', header=0, sep=' ', quotechar='"', error_bad_lines=False)
Note: error_bad_lines=False will ignore the offending rows. 
                My pandas version is 0.18.1. The updated code give me "CParserError: Error tokenizing data. C error: Expected 1 fields in line 440, saw 2" error
– Geet
                Sep 1, 2016 at 6:53
                This worked for me for a sample csv file. your link let me download 40GB. don't you have a sample of it for me to test?
– Marlon Abeykoon
                Sep 1, 2016 at 6:58
You can use the tarfile module to read a particular file from the tar.gz archive (as discussed in this resolved issue).
If there is only one file in the archive, then you can do this:
import tarfile
import pandas as pd
with tarfile.open("sample.tar.gz", "r:*") as tar:
    csv_path = tar.getnames()[0]
    df = pd.read_csv(tar.extractfile(csv_path), header=0, sep=" ")
The read mode r:* handles the gz extension (or other kinds of compression) appropriately.  If there are multiple files in the zipped tar file, then you could do something like csv_path = list(n for n in tar.getnames() if n.endswith('.csv'))[-1] line to get the last csv file in the archived folder.
                Isn't r:* (or equivalently r) the default? I don't see what benefit it has to specify it explicitly.
– Asclepius
                Feb 22, 2021 at 23:16
                @tmthyjames Maybe you would like to program in C instead where everything is as explicit as it can be.
– Asclepius
                Sep 25, 2022 at 15:15
        Thanks for contributing an answer to Stack Overflow!
Please be sure to answer the question. Provide details and share your research!
But avoid …
Asking for help, clarification, or responding to other answers.
Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.