已经有很多很好的答案了,但是如果你的整个文件都在一个行上,而且你还想处理 "行"(相对于固定大小的块),这些答案就帮不了你。
99%的情况下,可以逐行处理文件。然后,正如本文所建议的
答案
你可以使用文件对象本身作为懒惰发生器。
with open('big.csv') as f:
for line in f:
process(line)
然而,人们可能会遇到非常大的文件,其中的行分隔符不是'\n'
(一个常见的情况是'|'
)。
Converting '|'
to '\n'
before processing may not be an option because it can mess up fields which may legitimately contain '\n'
(e.g. free text user input).
Using the csv library is also ruled out because the fact that, at least in early versions of the lib, it is hardcoded to read the input line by line.
对于这种情况,我创建了以下代码段 [2021年5月更新,适用于Python 3.8+]。
def rows(f, chunksize=1024, sep='|'):
Read a file where the row separator is '|' lazily.
Usage:
>>> with open('big.csv') as f:
>>> for r in rows(f):
>>> process(r)
row = ''
while (chunk := f.read(chunksize)) != '': # End of file
while (i := chunk.find(sep)) != -1: # No separator found
yield row + chunk[:i]
chunk = chunk[i+1:]
row = ''
row += chunk
yield row
[For older versions of python]:
def rows(f, chunksize=1024, sep='|'):
Read a file where the row separator is '|' lazily.
Usage:
>>> with open('big.csv') as f:
>>> for r in rows(f):
>>> process(r)
curr_row = ''
while True:
chunk = f.read(chunksize)
if chunk == '': # End of file
yield curr_row
break
while True:
i = chunk.find(sep)
if i == -1:
break
yield curr_row + chunk[:i]
curr_row = ''
chunk = chunk[i+1:]
curr_row += chunk
我能够成功地使用它来解决各种问题。它已经被广泛地测试过了,有各种块状大小。下面是我使用的测试套件,供那些需要说服自己的人使用。
test_file = 'test_file'
def cleanup(func):
def wrapper(*args, **kwargs):
func(*args, **kwargs)
os.unlink(test_file)
return wrapper
@cleanup
def test_empty(chunksize=1024):
with open(test_file, 'w') as f:
f.write('')
with open(test_file) as f:
assert len(list(rows(f, chunksize=chunksize))) == 1
@cleanup
def test_1_char_2_rows(chunksize=1024):
with open(test_file, 'w') as f:
f.write('|')
with open(test_file) as f:
assert len(list(rows(f, chunksize=chunksize))) == 2
@cleanup
def test_1_char(chunksize=1024):
with open(test_file, 'w') as f:
f.write('a')
with open(test_file) as f:
assert len(list(rows(f, chunksize=chunksize))) == 1
@cleanup
def test_1025_chars_1_row(chunksize=1024):
with open(test_file, 'w') as f:
for i in range(1025):
f.write('a')
with open(test_file) as f:
assert len(list(rows(f, chunksize=chunksize))) == 1
@cleanup
def test_1024_chars_2_rows(chunksize=1024):
with open(test_file, 'w') as f:
for i in range(1023):
f.write('a')
f.write('|')
with open(test_file) as f:
assert len(list(rows(f, chunksize=chunksize))) == 2
@cleanup
def test_1025_chars_1026_rows(chunksize=1024):
with open(test_file, 'w') as f:
for i in range(1025):
f.write('|')
with open(test_file) as f:
assert len(list(rows(f, chunksize=chunksize))) == 1026
@cleanup
def test_2048_chars_2_rows(chunksize=1024):
with open(test_file, 'w') as f:
for i in range(1022):
f.write('a')
f.write('|')
f.write('a')
# -- end of 1st chunk --
for i in range(1024):
f.write('a')
# -- end of 2nd chunk
with open(test_file) as f:
assert len(list(rows(f, chunksize=chunksize))) == 2
@cleanup
def test_2049_chars_2_rows(chunksize=1024):
with open(test_file, 'w') as f:
for i in range(1022):
f.write('a')
f.write('|')
f.write('a')
# -- end of 1st chunk --
for i in range(1024):
f.write('a')
# -- end of 2nd chunk
f.write('a')
with open(test_file) as f:
assert len(list(rows(f, chunksize=chunksize))) == 2
if __name__ == '__main__':
for chunksize in [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024]:
test_empty(chunksize)
test_1_char_2_rows(chunksize)
test_1_char(chunksize)
test_1025_chars_1_row(chunksize)
test_1024_chars_2_rows(chunksize)
test_1025_chars_1026_rows(chunksize)
test_2048_chars_2_rows(chunksize)
test_2049_chars_2_rows(chunksize)