Pandas read_csv: low_memory and dtype options

Question 1

df = pd.read_csv('somefile.csv')
...给出一个错误。
.../site-packages/pandas/io/parsers.py:1130。
DtypeWarning。列（4,5,7,16）有混合类型。  在导入时指定dtype
选项或设置low_memory=False。
为什么dtype选项与low_memory有关，以及为什么low_memory=False可能有帮助？

Question 2


          
           The deprecated low_memory option
          
          
           替换代码0】选项没有被适当地废除，但它应该被废除，因为它实际上没有做任何不同的事情[
           
            source
           
           ]
          
          
           你之所以得到这个
           
            low_memory
           
           的警告，是因为为每一列猜测dtypes对内存的要求非常高。Pandas试图通过分析每一列的数据来确定要设置什么dtype。
          
          
           Dtype Guessing (very bad)
          
          
           Pandas只有在读取整个文件后才能确定一个列应该具有什么dtype。这意味着在整个文件被读取之前，没有任何东西可以真正被解析，除非你在读取最后一个值的时候冒着改变该列的dtype的风险。
          
          
           考虑一个文件的例子，该文件有一个名为user_id的列。
它包含1000万行，其中的user_id总是数字。
由于pandas不知道它只是数字，它可能会把它保持为原始字符串，直到它读完整个文件。
          
          
           Specifying dtypes (should always be done)
          
          dtype={'user_id': int}
to the pd.read_csv()调用将使pandas在开始读取文件时知道，这只是整数。
另外值得注意的是，如果文件中的最后一行会有"foobar"写在user_id栏中，如果指定了上述dtype，加载会崩溃。
Example of broken data that breaks when dtypes are defined
import pandas as pd
    from StringIO import StringIO
except ImportError:
    from io import StringIO
csvdata = """user_id,username
1,Alice
3,Bob
foobar,Caesar"""
sio = StringIO(csvdata)
pd.read_csv(sio, dtype={"user_id": int, "username": "string"})
ValueError: invalid literal for long() with base 10: 'foobar'
dtypes是典型的numpy的东西，在这里阅读更多关于它们的信息。
http://docs.scipy.org/doc/numpy/reference/generated/numpy.dtype.html
What dtypes exists?
我们可以访问numpy dtypes：float, int, bool, timedelta64[ns] 和 datetime64[ns]。请注意，numpy的日期/时间dtypes是not时区意识。
Pandas用它自己的dtypes扩展了这一组。
【替换代码7这是一个可感知时区的时间戳。
'类别'本质上是一个枚举（用整数键表示的字符串，以保存
'period[]' 不要和timedelta混淆，这些对象实际上是锚定在特定的时间段。
Sparse', 'Sparse[int]', 'Sparse[float]'是针对稀疏数据或'有很多漏洞的数据'，而不是在数据帧中保存NaN或None，它省略了这些对象，节省空间。
间隔 "是一个独立的主题，但它的主要用途是用于索引。See more here
'Int8', 'Int16', 'Int32', 'Int64', 'UInt8', 'UInt16', 'UInt32', 'UInt64'都是pandas特定的整数，是可以归零的，与numpy的变量不同。
string'是一个特定的dtype，用于处理字符串数据，并允许访问系列上的.str属性。
boolean "就像numpy的 "bool"，但它也支持缺失数据。
在这里阅读完整的参考资料。
Pandas dtype参考
Gotchas, caveats, notes
设置dtype=object会使上述警告消失，但不会使它的内存效率更高，如果有的话，只是进程效率更高。
设置dtype=unicode不会有任何作用，因为对numpy来说，一个unicode被表示为object。
Usage of converters
@sparrow正确地指出了转换器的用法，以避免pandas在遇到'foobar'的列中被指定为int时炸掉。我想补充的是，在pandas中使用转换器真的很重，而且效率很低，应该作为最后的手段来使用。这是因为read_csv过程是一个单一的过程。
CSV文件可以逐行处理，因此可以通过简单地将文件切割成段并运行多个进程来更有效地被多个转换器并行处理，这一点pandas并不支持。但这是一个不同的故事。

Question 3


          
           
            dashboard_df = pd.read_csv(p_file, sep=',', error_bad_lines=False, index_col=False, dtype='unicode')
根据pandas的文档。
  dtype : Type name or dict of column -> type
至于low_memory，它是True默认情况下而且还没有记录下来。但我认为这并不重要。错误信息是通用的，所以无论如何你都不需要搞清楚low_memory。希望这有帮助，如果你有进一步的问题，请让我知道。

Question 4


          
           
            
             df = pd.read_csv('somefile.csv', low_memory=False)
这应该可以解决这个问题。我得到了完全相同的错误，当从CSV中读取1.8M行时。

Question 5


          
           
            
             
              
               正如firelynx之前提到的，如果dtype是明确指定的，并且有与该dtype不兼容的混合数据，那么加载将崩溃。我使用了一个类似于这样的转换器作为变通方法来改变不兼容的数据类型的值，这样数据仍然可以被加载。
              
              def conv(val):
    if not val:
        return 0    
        return np.float64(val)
    except:        
        return np.float64(0)
df = pd.read_csv(csv_file,converters={'COL_A':conv,'COL_B':conv})

Question 6


          
           
            
             
              
               
                This worked for me!
               
               file = pd.read_csv('example.csv', engine='python')

Question 7


          
           
            
             
              
               
                
                 我在处理一个巨大的csv文件（600万行）时也面临类似的问题。我有三个问题。
                
                
                 the file contained strange characters (fixed using encoding)
                
                
                 the datatype was not specified (fixed using dtype property)
                
                
                 Using the above I still faced an issue which was related with the file_format that could not be defined based on the filename (fixed using try .. except..)
                
                    df = pd.read_csv(csv_file,sep=';', encoding = 'ISO-8859-1',
                     names=['permission','owner_name','group_name','size','ctime','mtime','atime','filename','full_filename'],
                     dtype={'permission':str,'owner_name':str,'group_name':str,'size':str,'ctime':object,'mtime':object,'atime':object,'filename':str,'full_filename':str,'first_date':object,'last_date':object})
        df['file_format'] = [Path(f).suffix[1:] for f in df.filename.tolist()]
    except:
        df['file_format'] = ''

Question 8


          
           
            
             
              
               
                
                 
                  在导入DataFrame时，用
                  
                   low_memory = False
                  
                  对我起作用。这就是对我有用的所有变化。
                 
                 df = pd.read_csv('export4_16.csv',low_memory=False)

Question 9


          
           
            
             
              
               
                
                 
                  
                   根据
                   
                    pandas文档
                   
                   ，指定
                   
                    low_memory=False
                   
                   ，只要
                   
                    engine='c'
                   
                   （这是默认的）就是对这个问题的合理解决。
                  
                  
                   如果
                   
                    low_memory=False
                   
                   ，那么将首先读入整列，然后确定适当的类型。例如，该列将根据需要保留为对象（字符串），以保存信息。
                  
                  
                   如果
                   
                    low_memory=True
                   
                   （默认），那么pandas就会以行为单位分块读入数据，然后将它们附加在一起。那么有些列可能看起来像整数和字符串的混合块，这取决于在这块数据中pandas是否遇到了不能被转换为整数的东西（比如）。这可能会导致以后的问题。警告告诉你，这种情况在读入过程中至少发生过一次，所以你应该小心。设置
                   
                    low_memory=False
                   
                   会使用更多的内存，但会避免这个问题。
                  
                  
                   我个人认为
                   
                    low_memory=True
                   
                   是一个不好的默认值，但我工作的领域使用的小数据集比大数据集多得多，所以方便比效率更重要。
                  
                  
                   下面的代码说明了一个例子：
                   
                    low_memory=True
                   
                   被设置，并且有一列是混合类型。它建立在@firelynx的回答之上
                  
                  import pandas as pd
    from StringIO import StringIO
except ImportError:
    from io import StringIO
# make a big csv data file, following earlier approach by @firelynx
csvdata = """1,Alice
2,Bob
3,Caesar
# we have to replicate the "integer column" user_id many many times to get
# pd.read_csv to actually chunk read. otherwise it just reads 
# the whole thing in one chunk, because it's faster, and we don't get any 
# "mixed dtype" issue. the 100000 below was chosen by experimentation.
csvdatafull = ""
for i in range(100000):
    csvdatafull = csvdatafull + csvdata
csvdatafull =  csvdatafull + "foobar,Cthlulu\n"
csvdatafull = "user_id,username\n" + csvdatafull
sio = StringIO(csvdatafull)
# the following line gives me the warning:
    # C:\Users\rdisa\anaconda3\lib\site-packages\IPython\core\interactiveshell.py:3072: DtypeWarning: Columns (0) have mixed types.Specify dtype option on import or set low_memory=False.
    # interactivity=interactivity, compiler=compiler, result=result)
# but it does not always give me the warning, so i guess the internal workings of read_csv depend on background factors
x = pd.read_csv(sio, low_memory=True) #, dtype={"user_id": int, "username": "string"})
x.dtypes
# this gives:
# Out[69]: 
# user_id     object
# username    object
# dtype: object
type(x['user_id'].iloc[0]) # int
type(x['user_id'].iloc[1]) # int
type(x['user_id'].iloc[2]) # int
type(x['user_id'].iloc[10000]) # int
type(x['user_id'].iloc[299999]) # str !!!! (even though it's a number! so this chunk must have been read in as strings)
type(x['user_id'].iloc[300000]) # str !!!!!
旁白。举个例子，这是一个问题（也是我第一次遇到这个严重问题的地方），假设你在一个文件上运行pd.read_csv()，然后想根据一个标识符来删除重复的文件。假设这个标识符有时是数字，有时是字符串。一行可能是 "81287"，另一行可能是 "97324-32"。但是，它们仍然是唯一的标识符。
有了low_memory=True，pandas可能会像这样在标识符列中读取。
81287
81287
81287
81287
81287
"81287"
"81287"
"81287"
"81287"
"97324-32"
"97324-32"
"97324-32"
"97324-32"
"97324-32"
只是因为它把事情分块，所以，有时标识符81287是一个数字，有时是一个字符串。当我试图根据这个放弃重复的东西时，好吧。
81287 == "81287"
Out[98]: False

Question 10


          
           
            
             
              
               
                
                 
                  
                   
                    
                     
                      正如错误所说，在使用
                      
                       read_csv()
                      
                      方法时，你应该指定数据类型。
所以，你应该写
                     
                     file = pd.read_csv('example.csv', dtype='unicode')

Question 11


          
           
            
             
              
               
                
                 
                  
                   
                    
                     
                      
                       我有一个类似的问题，有一个~400MB的文件。设置
                       
                        low_memory=False
                       
                       对我来说是个好办法。先做简单的事情，我会检查你的数据框架是否比你的系统内存大，重新启动，在继续进行之前清除内存。如果你仍然遇到错误，值得确保你的
                       
                        .csv
                       
                       文件是好的，在Excel中快速查看一下，确保没有明显的损坏。损坏的原始数据会造成严重的破坏...

Question 12


          
           
            
             
              
               
                
                 
                  
                   
                    
                     
                      
                       有时，当所有其他方法都失败时，你只想告诉大熊猫，让它闭嘴。
                      
                      # Ignore DtypeWarnings from pandas' read_csv                                                                                                                                                                                            
warnings.filterwarnings('ignore', message="^Columns.*")

Question 13


          
           
            
             
              
               
                
                 
                  
                   
                    
                     
                      
                       
                        在Jerald Achaibar给出的答案的基础上，我们可以检测混合Dytpes警告，并在警告发生时只使用较慢的python引擎。
                       
                       import warnings
# Force mixed datatype warning to be a python error so we can catch it and reattempt the 
# load using the slower python engine
warnings.simplefilter('error', pandas.errors.DtypeWarning)