Collectives™ on Stack Overflow
Find centralized, trusted content and collaborate around the technologies you use most.
Learn more about Collectives
Teams
Q&A for work
Connect and share knowledge within a single location that is structured and easy to search.
Learn more about Teams
.../site-packages/pandas/io/parsers.py:1130:
DtypeWarning: Columns (4,5,7,16) have mixed types. Specify dtype
option on import or set low_memory=False.
Why is the
dtype
option related to
low_memory
, and why might
low_memory=False
help?
–
The deprecated low_memory option
The
low_memory
option is not properly deprecated, but it should be, since it does not actually do anything differently[
source
]
The reason you get this
low_memory
warning is because guessing dtypes for each column is very memory demanding. Pandas tries to determine what dtype to set by analyzing the data in each column.
Dtype Guessing (very bad)
Pandas can only determine what dtype a column should have once the whole file is read. This means nothing can really be parsed before the whole file is read unless you risk having to change the dtype of that column when you read the last value.
Consider the example of one file which has a column called user_id.
It contains 10 million rows where the user_id is always numbers.
Since pandas cannot know it is only numbers, it will probably keep it as the original strings until it has read the whole file.
Specifying dtypes (should always be done)
adding
dtype={'user_id': int}
to the pd.read_csv()
call will make pandas know when it starts reading the file, that this is only integers.
Also worth noting is that if the last line in the file would have "foobar"
written in the user_id
column, the loading would crash if the above dtype was specified.
Example of broken data that breaks when dtypes are defined
import pandas as pd
from StringIO import StringIO
except ImportError:
from io import StringIO
csvdata = """user_id,username
1,Alice
3,Bob
foobar,Caesar"""
sio = StringIO(csvdata)
pd.read_csv(sio, dtype={"user_id": int, "username": "string"})
ValueError: invalid literal for long() with base 10: 'foobar'
dtypes are typically a numpy thing, read more about them here:
http://docs.scipy.org/doc/numpy/reference/generated/numpy.dtype.html
What dtypes exists?
We have access to numpy dtypes: float, int, bool, timedelta64[ns] and datetime64[ns]. Note that the numpy date/time dtypes are not time zone aware.
Pandas extends this set of dtypes with its own:
'datetime64[ns, <tz>]'
Which is a time zone aware timestamp.
'category' which is essentially an enum (strings represented by integer keys to save
'period[]' Not to be confused with a timedelta, these objects are actually anchored to specific time periods
'Sparse', 'Sparse[int]', 'Sparse[float]' is for sparse data or 'Data that has a lot of holes in it' Instead of saving the NaN or None in the dataframe it omits the objects, saving space.
'Interval' is a topic of its own but its main use is for indexing. See more here
'Int8', 'Int16', 'Int32', 'Int64', 'UInt8', 'UInt16', 'UInt32', 'UInt64' are all pandas specific integers that are nullable, unlike the numpy variant.
'string' is a specific dtype for working with string data and gives access to the .str
attribute on the series.
'boolean' is like the numpy 'bool' but it also supports missing data.
Read the complete reference here:
Pandas dtype reference
Gotchas, caveats, notes
Setting dtype=object
will silence the above warning, but will not make it more memory efficient, only process efficient if anything.
Setting dtype=unicode
will not do anything, since to numpy, a unicode
is represented as object
.
Usage of converters
@sparrow correctly points out the usage of converters to avoid pandas blowing up when encountering 'foobar'
in a column specified as int
. I would like to add that converters are really heavy and inefficient to use in pandas and should be used as a last resort. This is because the read_csv process is a single process.
CSV files can be processed line by line and thus can be processed by multiple converters in parallel more efficiently by simply cutting the file into segments and running multiple processes, something that pandas does not support. But this is a different story.
–
–
–
–
–
dashboard_df = pd.read_csv(p_file, sep=',', error_bad_lines=False, index_col=False, dtype='unicode')
According to the pandas documentation:
dtype : Type name or dict of column -> type
As for low_memory, it's True by default and isn't yet documented. I don't think its relevant though. The error message is generic, so you shouldn't need to mess with low_memory anyway. Hope this helps and let me know if you have further problems
–
–
–
–
–
As mentioned earlier by firelynx if dtype is explicitly specified and there is mixed data that is not compatible with that dtype then loading will crash. I used a converter like this as a workaround to change the values with incompatible data type so that the data could still be loaded.
def conv(val):
if not val:
return 0
return np.float64(val)
except:
return np.float64(0)
df = pd.read_csv(csv_file,converters={'COL_A':conv,'COL_B':conv})
I was facing a similar issue when processing a huge csv file (6 million rows). I had three issues:
the file contained strange characters (fixed using encoding)
the datatype was not specified (fixed using dtype property)
Using the above I still faced an issue which was related with the file_format that could not be defined based on the filename (fixed using try .. except..)
df = pd.read_csv(csv_file,sep=';', encoding = 'ISO-8859-1',
names=['permission','owner_name','group_name','size','ctime','mtime','atime','filename','full_filename'],
dtype={'permission':str,'owner_name':str,'group_name':str,'size':str,'ctime':object,'mtime':object,'atime':object,'filename':str,'full_filename':str,'first_date':object,'last_date':object})
df['file_format'] = [Path(f).suffix[1:] for f in df.filename.tolist()]
except:
df['file_format'] = ''
It worked for me with low_memory = False
while importing a DataFrame. That is all the change that worked for me:
df = pd.read_csv('export4_16.csv',low_memory=False)
–
According to the pandas documentation, specifying low_memory=False
as long as the engine='c'
(which is the default) is a reasonable solution to this problem.
If low_memory=False
, then whole columns will be read in first, and then the proper types determined. For example, the column will be kept as objects (strings) as needed to preserve information.
If low_memory=True
(the default), then pandas reads in the data in chunks of rows, then appends them together. Then some of the columns might look like chunks of integers and strings mixed up, depending on whether during the chunk pandas encountered anything that couldn't be cast to integer (say). This could cause problems later. The warning is telling you that this happened at least once in the read in, so you should be careful. Setting low_memory=False
will use more memory but will avoid the problem.
Personally, I think low_memory=True
is a bad default, but I work in an area that uses many more small datasets than large ones and so convenience is more important than efficiency.
The following code illustrates an example where low_memory=True
is set and a column comes in with mixed types. It builds off the answer by @firelynx
import pandas as pd
from StringIO import StringIO
except ImportError:
from io import StringIO
# make a big csv data file, following earlier approach by @firelynx
csvdata = """1,Alice
2,Bob
3,Caesar
# we have to replicate the "integer column" user_id many many times to get
# pd.read_csv to actually chunk read. otherwise it just reads
# the whole thing in one chunk, because it's faster, and we don't get any
# "mixed dtype" issue. the 100000 below was chosen by experimentation.
csvdatafull = ""
for i in range(100000):
csvdatafull = csvdatafull + csvdata
csvdatafull = csvdatafull + "foobar,Cthlulu\n"
csvdatafull = "user_id,username\n" + csvdatafull
sio = StringIO(csvdatafull)
# the following line gives me the warning:
# C:\Users\rdisa\anaconda3\lib\site-packages\IPython\core\interactiveshell.py:3072: DtypeWarning: Columns (0) have mixed types.Specify dtype option on import or set low_memory=False.
# interactivity=interactivity, compiler=compiler, result=result)
# but it does not always give me the warning, so i guess the internal workings of read_csv depend on background factors
x = pd.read_csv(sio, low_memory=True) #, dtype={"user_id": int, "username": "string"})
x.dtypes
# this gives:
# Out[69]:
# user_id object
# username object
# dtype: object
type(x['user_id'].iloc[0]) # int
type(x['user_id'].iloc[1]) # int
type(x['user_id'].iloc[2]) # int
type(x['user_id'].iloc[10000]) # int
type(x['user_id'].iloc[299999]) # str !!!! (even though it's a number! so this chunk must have been read in as strings)
type(x['user_id'].iloc[300000]) # str !!!!!
Aside: To give an example where this is a problem (and where I first encountered this as a serious issue), imagine you ran pd.read_csv()
on a file then wanted to drop duplicates based on an identifier. Say the identifier is sometimes numeric, sometimes string. One row might be "81287", another might be "97324-32". Still, they are unique identifiers.
With low_memory=True
, pandas might read in the identifier column like this:
81287
81287
81287
81287
81287
"81287"
"81287"
"81287"
"81287"
"97324-32"
"97324-32"
"97324-32"
"97324-32"
"97324-32"
Just because it chunks things and so, sometimes the identifier 81287 is a number, sometimes a string. When I try to drop duplicates based on this, well,
81287 == "81287"
Out[98]: False
As the error says, you should specify the datatypes when using the read_csv()
method.
So, you should write
file = pd.read_csv('example.csv', dtype='unicode')
Sometimes, when all else fails, you just want to tell pandas to shut up about it:
# Ignore DtypeWarnings from pandas' read_csv
warnings.filterwarnings('ignore', message="^Columns.*")
I had a similar issue with a ~400MB file. Setting low_memory=False
did the trick for me. Do the simple things first,I would check that your dataframe isn't bigger than your system memory, reboot, clear the RAM before proceeding. If you're still running into errors, its worth making sure your .csv
file is ok, take a quick look in Excel and make sure there's no obvious corruption. Broken original data can wreak havoc...
Building on the answer given by Jerald Achaibar we can detect the mixed Dytpes warning and only use the slower python engine when the warning occurs:
import warnings
# Force mixed datatype warning to be a python error so we can catch it and reattempt the
# load using the slower python engine
warnings.simplefilter('error', pandas.errors.DtypeWarning)
df = pandas.read_csv(path, sep=sep, encoding=encoding)
except pandas.errors.DtypeWarning:
df = pandas.read_csv(path, sep=sep, encoding=encoding, engine="python")
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.