Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.

Learn more about Collectives

Teams

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Learn more about Teams

I'm trying to write a program that looks at a .CSV file (input.csv) and rewrites only the rows that begin with a certain element (corrected.csv), as listed in a text file (output.txt).

This is what my program looks like right now:

import csv
lines = []
with open('output.txt','r') as f:
    for line in f.readlines():
        lines.append(line[:-1])
with open('corrected.csv','w') as correct:
    writer = csv.writer(correct, dialect = 'excel')
    with open('input.csv', 'r') as mycsv:
        reader = csv.reader(mycsv)
        for row in reader:
            if row[0] not in lines:
                writer.writerow(row)

Unfortunately, I keep getting this error, and I have no clue what it's about.

Traceback (most recent call last):
  File "C:\Python32\Sample Program\csvParser.py", line 12, in <module>
    for row in reader:
_csv.Error: line contains NULL byte

Credit to all the people here to even to get me to this point.

Just a guess but it sounds like your input.csv file contains a blank line (mebe at the end?). Try lookin in the csvParser.py file for that exception text. – Sam Axe Oct 25, 2011 at 19:43 I actually just went through the input.csv file and got rid of any and all blank space... still no luck (same error). – James Roseman Oct 25, 2011 at 19:50 To pinpoint the line number, I suggest you introduce a counter variable and increment it within the for row in reader loop. – codeape Oct 25, 2011 at 19:50 I'm not sure how I'm supposed to do that when the program itself won't execute. I tried adding a counter and nothing different showed up, just the same traceback error. – James Roseman Oct 25, 2011 at 19:54 Do you have a NULL byte in your .csv? open('input.csv').read().index('\0') will give you the offset of the first one if you do. – retracile Oct 25, 2011 at 19:55

I'm guessing you have a NUL byte in input.csv. You can test that with

if '\0' in open('input.csv').read():
    print "you have null bytes in your input file"
else:
    print "you don't"

if you do,

reader = csv.reader(x.replace('\0', '') for x in mycsv)

may get you around that. Or it may indicate you have utf16 or something 'interesting' in the .csv file.

+1 on finding NULL bytes in the file... unforetunately now my 'corrected.csv' file now reads in Japanese... – James Roseman Oct 25, 2011 at 20:09 Sounds like your .csv isn't in ascii. I think further help is going to require a bit more information about your .csv's actual content. Have you tried opening it in a text editor like vim or notepad? Or running file input.csv to identify the file type? – retracile Oct 25, 2011 at 20:14 I've opened it in Notepad and it looks fine. What should a csv look like? It reads the same as it does on Google Analytics, but with huge tabs between the data. – James Roseman Oct 25, 2011 at 20:16 Damn... is there any way to replace the tabs with commas and have it work with the Python program? – James Roseman Oct 25, 2011 at 20:21 If your csv is tab delimited you need to specify so: reader = csv.reader(mycsv, delimiter='\t'). I imaging that the csv reader is gobbling up your whole file looking for the commas and getting all the way to EOF. But you definitely have an encoding issue. You need to specify the encoding when opening the file. – Steven Rumbalski Oct 25, 2011 at 20:26 I had this same problem with a CSV file created from LibreOffice, which had been originally opened from an Excel .xls file. For some reason, LibreOffice had saved the CSV file as UTF-16. You can tell by looking at the first 2 bytes of the file, if it's FF FE then it's a good indicator that it's UTF-16 – Tom Dalton Nov 1, 2013 at 9:43 Note that if your file contains UTF-16 data that is outside of the ASCII range csv.reader() will not be able to handle it, and you'll get UnicodeEncodeErrors instead. – Martijn Pieters Sep 4, 2014 at 9:30 This just caused a different error to be raised, UnicodeError: UTF-16 stream does not start with BOM – Cerin Sep 5, 2017 at 21:13 Replacing null with a space won't be a good choice. Worked for me to replace with a empty string – Marcelo Assis Jun 7, 2018 at 17:05 I have a question about how you have used yield. Given that this is in a loop, does it mean that it will still read the file line by line or would it load it into the memory at once? – Mansour.M Oct 27, 2020 at 14:31 Thanks, Claudiu. This is an elegant, easy-to-adapt solution. However, what would be the difference if I replace the yield with return? Could you please explain me the difference for this particular case? – madhur Jan 11 at 16:23 Or why to even return or yield it i.e. def fix_nulls(s): for line in s: line.replace('\0', ' ') – madhur Jan 11 at 16:31

You could just inline a generator to filter out the null values if you want to pretend they don't exist. Of course this is assuming the null bytes are not really part of the encoding and really are some kind of erroneous artifact or bug.

See the (line.replace('\0','') for line in f) below, also you'll want to probably open that file up using mode rb.

import csv
lines = []
with open('output.txt','r') as f:
    for line in f.readlines():
        lines.append(line[:-1])
with open('corrected.csv','w') as correct:
    writer = csv.writer(correct, dialect = 'excel')
    with open('input.csv', 'rb') as mycsv:
        reader = csv.reader( (line.replace('\0','') for line in mycsv) )
        for row in reader:
            if row[0] not in lines:
                writer.writerow(row)
                Thanks! This worked for the NC election results files, which do indeed (!) use a null byte in place of a "0" byte in one column. See dl.ncsbe.gov/ENRS/resultsPCT20161108.zip
– nealmcb
                Dec 7, 2016 at 22:07
                This solution worked for me. Replaced this NUL characters before to read each line in the processing section. Thanks!
– eduardosufan
                May 20, 2022 at 17:16
with open('corrected.csv','w') as correct:
    writer = csv.writer(correct, dialect = 'excel')
    with open('input.csv', 'r') as mycsv:
        reader = csv.reader(mycsv)
            for i, row in enumerate(reader):
                if row[0] not in lines:
                   writer.writerow(row)
        except csv.Error:
            print('csv choked on line %s' % (i+1))
            raise

Perhaps this from daniweb would be helpful:

I'm getting this error when reading from a csv file: "Runtime Error! line contains NULL byte". Any idea about the root cause of this error?

Ok, I got it and thought I'd post the solution. Simply yet caused me grief... Used file was saved in a .xls format instead of a .csv Didn't catch this because the file name itself had the .csv extension while the type was still .xls

Traceback (most recent call last): File "C:\Python32\Sample Program\csvParser.py", line 17, in <module> print ('csv choked on line %s' % (i+1)) NameError: name 'i' is not defined – James Roseman Oct 25, 2011 at 20:03 Ok. Then it's choking on the very first line. Run this and post what you see: print(open('input.csv', 'r').readlines()[0]) – Steven Rumbalski Oct 25, 2011 at 20:08 Something funky... but it's running. ÿþ/ < That's all it would paste (it's mostly blocks and numbers) – James Roseman Oct 25, 2011 at 20:10 Oh shoot that could completely be it, how might I go about fixing this? I saved it straight from Google Analytics too... – James Roseman Oct 25, 2011 at 20:12

If you develop under Lunux, you can use all the power of sed:

from subprocess import check_call, CalledProcessError
PATH_TO_FILE = '/home/user/some/path/to/file.csv'
    check_call("sed -i -e 's|\\x0||g' {}".format(PATH_TO_FILE), shell=True)
except CalledProcessError as err:
    print(err)    

The most efficient solution for huge files.

Checked for Python3, Kubuntu

with open(csv_file, 'r', encoding = "utf-8") as f: reader = csv.reader(fix_nulls(f)) for line in reader: #do something

this way works for me

Turning my linux environment into a clean complete UTF-8 environment made the trick for me. Try the following in your command line:

export LC_ALL=en_US.UTF-8
export LANG=en_US.UTF-8
export LANGUAGE=en_US.UTF-8
                for me also changing to UTF-8 solved the problem. On windows I used Notepad++ to change the format from UTF16 to UTF8. I then opened the file with libreoffice calc, and cleared extra lines etc.
– Yuval Harpaz
                Jun 11, 2018 at 11:31

This is long settled, but I ran across this answer because I was experiencing an unexpected error while reading a CSV to process as training data in Keras and TensorFlow.

In my case, the issue was much simpler, and is worth being conscious of. The data being produced into the CSV wasn't consistent, resulting in some columns being completely missing, which seems to end up throwing this error as well.

The lesson: If you're seeing this error, verify that your data looks the way that you think it does!

pandas.read_csv now handles the different UTF encoding when reading/writing and therefore can deal directly with null bytes

data = pd.read_csv(file, encoding='utf-16')

see https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

It is very simple.

don't make a csv file by "create new excel" or save as ".csv" from window.

simply import csv module, write a dummy csv file, and then paste your data in that.

csv made by python csv module itself will no longer show you encoding or blank line error.

This answer does not provide any solution on how to manipulate input data, but rather how to "fix" input data. More often then not, the input data is not manageable. – despina Sep 1, 2021 at 11:10

Thanks for contributing an answer to Stack Overflow!

  • Please be sure to answer the question. Provide details and share your research!

But avoid

  • Asking for help, clarification, or responding to other answers.
  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.