Python中读取大文件的懒人方法？

发布于 2015-06-18

已采纳

0 人赞同

要写一个懒惰的函数，只需使用 yield :

def read_in_chunks(file_object, chunk_size=1024):
    """Lazy function (generator) to read a file piece by piece.
    Default chunk size: 1k."""
    while True:
        data = file_object.read(chunk_size)
        if not data:
            break
        yield data
with open('really_big_file.dat') as f:
    for piece in read_in_chunks(f):
        process_data(piece)
    process_data(line)


         
          使用open('really_big_file.dat', 'rb')是很好的做法，可以与我们使用Posix的Windows同事兼容。


         
          缺少@Tal Weiss提到的
          
           rb
          
          ；以及缺少一个
          
           file.close()
          
          语句（可以用
          
           with open('really_big_file.dat', 'rb') as f:
          
          来完成；见
          
           这里有另一个简明的实施方案


         
          @cod3monk3y：文本和二进制文件是不同的东西。这两种类型都很有用，但在不同的情况下。默认（文本）模式在这里可能是有用的，即
          
           'rb'
          
          是
          
           not
          
          missing.


         
          @j-f-sebastian: 确实，OP并没有说明他是在读取文本数据还是二进制数据。但如果他使用的是python 2.7，在
          
           Windows
          
          和
          
           is
          
          读取二进制数据，当然值得注意的是，如果他忘记了
          
           'b'
          
          ，他的数据将
          
           very likely be corrupted
          
          .
          
           From the docs
          
          -
          
           Python on Windows makes a distinction between text and binary files; [...] it’ll corrupt binary data like that in JPEG or EXE files. Be very careful to use binary mode when reading and writing such files.


         
          这里有一个返回1k块的生成器。
          
           buf_iter = (x for x in iter(lambda: buf.read(1024), ''))
          
          。  然后
          
           for chunk in buf_iter:
          
          来循环处理这些块。


       0
       
       人赞同


        
         替换代码0】接收一个可选的大小参数，该参数近似于返回的行数中所读取的行数。
        
        bigfile = open('bigfilename','r')
tmp_lines = bigfile.readlines(BUF_SIZE)
while tmp_lines:
    process([line for line in tmp_lines])
    tmp_lines = bigfile.readlines(BUF_SIZE)


         
          
           这是一个非常好的主意，特别是当它与defaultdict相结合，将大数据分割成小数据时。


         
          
           Myers Carpenter
          
          ：


         
          
           我建议使用
           
            .read()
           
           而不是
           
            .readlines()
           
           。  如果文件是二进制的，就不会有换行符。


         
          
           如果文件是一个巨大的字符串怎么办？


         
          
           这个解决方案是有缺陷的。如果其中一行大于你的BUF_SIZE，你将会处理一个不完整的行。@MattSom是正确的。


         
          
           @MyersCarpenter 这一行会不会重复两次？ tmp_lines = bigfile.readlines(BUF_SIZE)


        
         
          已经有很多很好的答案了，但是如果你的整个文件都在一个行上，而且你还想处理 "行"（相对于固定大小的块），这些答案就帮不了你。
         
         
          99%的情况下，可以逐行处理文件。然后，正如本文所建议的
          
           答案
          
          你可以使用文件对象本身作为懒惰发生器。
         
         with open('big.csv') as f:
    for line in f:
        process(line)
然而，人们可能会遇到非常大的文件，其中的行分隔符不是'\n'（一个常见的情况是'|'）。
Converting '|' to '\n' before processing may not be an option because it can mess up fields which may legitimately contain '\n' (e.g. free text user input).
Using the csv library is also ruled out because the fact that, at least in early versions of the lib, it is hardcoded to read the input line by line.
对于这种情况，我创建了以下代码段 [2021年5月更新，适用于Python 3.8+]。
def rows(f, chunksize=1024, sep='|'):
    Read a file where the row separator is '|' lazily.
    Usage:
    >>> with open('big.csv') as f:
    >>>     for r in rows(f):
    >>>         process(r)
    row = ''
    while (chunk := f.read(chunksize)) != '':   # End of file
        while (i := chunk.find(sep)) != -1:     # No separator found
            yield row + chunk[:i]
            chunk = chunk[i+1:]
            row = ''
        row += chunk
    yield row
[For older versions of python]:
def rows(f, chunksize=1024, sep='|'):
    Read a file where the row separator is '|' lazily.
    Usage:
    >>> with open('big.csv') as f:
    >>>     for r in rows(f):
    >>>         process(r)
    curr_row = ''
    while True:
        chunk = f.read(chunksize)
        if chunk == '': # End of file
            yield curr_row
            break
        while True:
            i = chunk.find(sep)
            if i == -1:
                break
            yield curr_row + chunk[:i]
            curr_row = ''
            chunk = chunk[i+1:]
        curr_row += chunk
我能够成功地使用它来解决各种问题。它已经被广泛地测试过了，有各种块状大小。下面是我使用的测试套件，供那些需要说服自己的人使用。
test_file = 'test_file'
def cleanup(func):
    def wrapper(*args, **kwargs):
        func(*args, **kwargs)
        os.unlink(test_file)
    return wrapper
@cleanup
def test_empty(chunksize=1024):
    with open(test_file, 'w') as f:
        f.write('')
    with open(test_file) as f:
        assert len(list(rows(f, chunksize=chunksize))) == 1
@cleanup
def test_1_char_2_rows(chunksize=1024):
    with open(test_file, 'w') as f:
        f.write('|')
    with open(test_file) as f:
        assert len(list(rows(f, chunksize=chunksize))) == 2
@cleanup
def test_1_char(chunksize=1024):
    with open(test_file, 'w') as f:
        f.write('a')
    with open(test_file) as f:
        assert len(list(rows(f, chunksize=chunksize))) == 1
@cleanup
def test_1025_chars_1_row(chunksize=1024):
    with open(test_file, 'w') as f:
        for i in range(1025):
            f.write('a')
    with open(test_file) as f:
        assert len(list(rows(f, chunksize=chunksize))) == 1
@cleanup
def test_1024_chars_2_rows(chunksize=1024):
    with open(test_file, 'w') as f:
        for i in range(1023):
            f.write('a')
        f.write('|')
    with open(test_file) as f:
        assert len(list(rows(f, chunksize=chunksize))) == 2
@cleanup
def test_1025_chars_1026_rows(chunksize=1024):
    with open(test_file, 'w') as f:
        for i in range(1025):
            f.write('|')
    with open(test_file) as f:
        assert len(list(rows(f, chunksize=chunksize))) == 1026
@cleanup
def test_2048_chars_2_rows(chunksize=1024):
    with open(test_file, 'w') as f:
        for i in range(1022):
            f.write('a')
        f.write('|')
        f.write('a')
        # -- end of 1st chunk --
        for i in range(1024):
            f.write('a')
        # -- end of 2nd chunk
    with open(test_file) as f:
        assert len(list(rows(f, chunksize=chunksize))) == 2
@cleanup
def test_2049_chars_2_rows(chunksize=1024):
    with open(test_file, 'w') as f:
        for i in range(1022):
            f.write('a')
        f.write('|')
        f.write('a')
        # -- end of 1st chunk --
        for i in range(1024):
            f.write('a')
        # -- end of 2nd chunk
        f.write('a')
    with open(test_file) as f:
        assert len(list(rows(f, chunksize=chunksize))) == 2
if __name__ == '__main__':
    for chunksize in [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024]:
        test_empty(chunksize)
        test_1_char_2_rows(chunksize)
        test_1_char(chunksize)
        test_1025_chars_1_row(chunksize)
        test_1024_chars_2_rows(chunksize)
        test_1025_chars_1026_rows(chunksize)
        test_2048_chars_2_rows(chunksize)
        test_2049_chars_2_rows(chunksize)


        
         
          
           
            
             
              
               如果你的电脑、操作系统和python是64位的
              
              ，那么你可以使用
              
               mmap模块
              
              来将文件的内容映射到内存中，并通过索引和分片来访问它。这里是文档中的一个例子。
             
             import mmap
with open("hello.txt", "r+") as f:
    # memory-map the file, size 0 means whole file
    map = mmap.mmap(f.fileno(), 0)
    # read content via standard file methods
    print map.readline()  # prints "Hello Python!"
    # read content via slice notation
    print map[:5]  # prints "Hello"
    # update content using slice notation;
    # note that new content must have same size
    map[6:] = " world!\n"
    # ... and read again using standard file methods
    map.seek(0)
    print map.readline()  # prints "Hello  world!"
    # close the map
    map.close()
如果你的电脑、操作系统或python是32位的那么mmap-ing大文件可以保留大部分的地址空间和饿死你的程序的内存。


         
          
           
            
             
              
               
                Savino Sguera
               
               ：


         
          
           
            
             
              
               
                这应该如何运作？如果我有一个32GB的文件怎么办？如果我在一个只有256MB内存的虚拟机上怎么办？对如此巨大的文件进行Mmapping真的不是一件好事。


         
          
           
            
             
              
               
                Phyo Arkar Lwin
               
               ：


         
          
           
            
             
              
               
                这个答案应该得到-12票。这将扼杀任何使用该软件处理大文件的人。


         
          
           
            
             
              
               
                这可以在64位的Python上工作，即使是大文件。即使文件是内存映射的，它也不会被读到内存中，所以物理内存的数量可以比文件大小小得多。


         
          
           
            
             
              
               
                @SavinoSguera 物理内存的大小与文件整形有关系吗？


         
          
           
            
             
              
               
                @V3ss0n:我试着在64位Python上对32GB的文件进行mmap。
                
                 It works
                
                (我的内存小于32GB）。我可以使用序列和文件接口访问文件的开始、中间和结束。


        
         
          
           
            
             
              f = ... # file-like object, i.e. supporting read(size) function and 
        # returning empty string '' when there is nothing to read
def chunked(file, chunk_size):
    return iter(lambda: file.read(chunk_size), '')
for data in chunked(f, 65536):
    # process the data
更新：该方法的最佳解释是在https://stackoverflow.com/a/4566523/38592


         
          
           
            
             
              
               
                
                 这对Blobs很有效，但对行间分离的内容（如CSV、HTML等需要逐行处理的内容）可能不合适。


         
          
           
            
             
              
               
                
                 对不起，请问F的值是多少？


         
          
           
            
             
              
               
                
                 @user1，它可以是open('文件名')。


        
         
          
           
            
             
              
               
                
                
                 Boris Verkhovskiy
                
               
               
                发布于
                
                2015-06-18


        
         
          
           
            
             
              
               
                In Python
                
                 3.8+
                
                你可以使用
                
                 
                  .read()
                 
                
                in a
                
                 while
                
                loop:
               
               with open("somefile.txt") as f:
    while chunk := f.read(8192):
        do_something(chunk)
Of course,你可以使用any chunk size you want, you don't have to use 8192 (2**13) bytes. Unless your file's size happens to be a multiple of your chunk size, the last chunk will be smaller than your chunk size.


        
         
          
           
            
             
              
               
                
                 请参考python的官方文档
                 
                  https://docs.python.org/3/library/functions.html#iter
                 
                
                
                 也许这种方法更符合pythonic。
                
                """A file object returned by open() is a iterator with
read method which could specify current read's block size
with open('mydata.db', 'r') as f_in:
    block_read = partial(f_in.read, 1024 * 1024)
    block_iterator = iter(block_read, '')
    for index, block in enumerate(block_iterator, start=1):
        block = process_block(block)  # process your block data
        with open(f'{index}.txt', 'w') as f_out:
            f_out.write(block)


         
          
           
            
             
              
               
                
                 
                  
                   Leroy Scandal
                  
                  ：


         
          
           
            
             
              
               
                
                 
                  
                   布鲁斯是正确的。我使用functools.partial来解析视频流。使用py;py3，我可以在一秒钟内解析超过1GB。             ` for pkt in iter(partial(vid.read, PACKET_SIZE ), b""): `


        
         
          
           
            
             
              
               
                
                 
                  
                  
                   TonyCoolZhu
                  




    

                 
                 
                  发布于
                  
                  2015-06-18


        
         
          
           
            
             
              
               
                
                 
                  我想我们可以这样写。
                 
                 def read_file(path, block_size=1024): 
    with open(path, 'rb') as f: 
        while True: 
            piece = f.read(block_size) 
            if piece: 
                yield piece 
            else: 
                return
for piece in read_file(path):
    process_piece(piece)


        
         
          
           
            
             
              
               
                
                 
                  
                   由于我的名气太小，我不允许发表评论，但SilentGhosts的解决方案应该是更容易使用file.readlines([sizehint])。
                  
                  
                   
                    python file methods
                   
                  
                  
                   编辑：SilentGhost是对的，但这应该是比。
                  
                  s = "" 
for i in xrange(100): 
   s += file.next()


         
          
           
            
             
              
               
                
                 
                  
                   
                    
                     好吧，对不起，你说得很对。但也许这个解决方案会让你更高兴；): s = "" for i in xrange(100): s += file.next()


         
          
           
            
             
              
               
                
                 
                  
                   
                    
                     -1: 可怕的解决方案，这将意味着每一行都要在内存中创建一个新的字符串，并将读取的整个文件数据复制到新的字符串中。性能和内存都是最差的。


         
          
           
            
             
              
               
                
                 
                  
                   
                    
                     为什么它要把整个文件的数据复制到一个新的字符串中呢？ 来自python文档。为了使for循环成为在文件行间循环的最有效方式（这是一个非常常见的操作），next()方法使用了一个隐藏的先读缓冲区。


         
          
           
            
             
              
               
                
                 
                  
                   
                    
                     @sinzi: "s +="或串联字符串，每次都会产生一个新的字符串副本，因为字符串是不可变的，所以你是在创建一个新的字符串。


         
          
           
            
             
              
               
                
                 
                  
                   
                    
                     SilentGhost
                    
                    ：


         
          
           
            
             
              
               
                
                 
                  
                   
                    
                     @nosklo：这些是实现的细节，可以用列表理解来代替它。


        
         
          
           
            
             
              
               
                
                 
                  
                   
                    
                    
                     SilentGhost
                    
                   
                   
                    发布于
                    
                    2015-06-18


        
         
          
           
            
             
              
               
                
                 
                  
                   
                    我遇到的情况有点类似。不清楚你是否知道以字节为单位的分块大小；我通常不知道，但需要的记录（行）数量是已知的。
                   
                   def get_line():
     with open('4gb_file') as file:
         for i in file:
             yield i
lines_required = 100
gen = get_line()
chunk = [i for i, j in zip(gen, range(lines_required))]
更新:谢谢nosklo。这就是我的意思。它几乎是有效的，除了它失去了一个行 "之间 "的大块。
chunk = [next(gen) for i in range(lines_required)]
在不损失任何线条的情况下做到了这一点，但它看起来并不漂亮。


         
          
           
            
             
              
               
                
                 
                  
                   
                    
                     
                      
                       这是伪装的代码吗？这也是不必要的混乱，你应该让行数成为get_line函数的可选参数。


        
         
          
           
            
             
              
               
                
                 
                  
                   
                    
                     
                      你可以使用以下代码。
                     
                     file_obj = open('big_file') 
open()返回一个文件对象
然后用os.stat来获取大小