谢谢你的有趣的任务!我已经实现了纯粹的
numpy
。+
pandas
的解决方案。它使用排序的数组来保存名字和计数。因此,算法的复杂度应该是
O(n * log n)
左右。
我在numpy中没有任何哈希表,哈希表肯定会更快(
O(n)
)。因此,我使用了numpy现有的排序/插入程序。
我还使用了
pandas
中的
.read_csv()
和
iterator = True, chunksize = 1 << 24
参数,这允许分块读取文件并从每个分块中产生固定大小的pandas数据帧。
注意!在第一次运行时(直到程序调试完毕),将代码中的
limit_chunks
(处理的块数)设置为小值(如
5
)。这是为了检查整个程序在部分数据上运行是否正确。
如果你没有这两个软件包,程序需要运行一次命令
python -m pip install pandas numpy
来安装它们。
偶尔打印一下进度,完成的总兆字节数加上速度。
结果将被打印到控制台,并保存到
res_fname
文件名中,所有配置脚本的常量都放在脚本的开头。替换代码11】常量控制有多少个顶层名称将被输出到文件/控制台。
有趣的是我的解决方案有多快。如果它太慢了,也许我会花些时间用纯的
HashTable
类来写出漂亮的
numpy
。
你也可以尝试运行下一个代码 here online .
import os, math, time, sys
# Needs: python -m pip install pandas numpy
import pandas as pd, numpy as np
import pandas, numpy
fname = 'test.csv'
fname_res = 'test.res'
chunk_size = 1 << 24
limit_chunks = None # Number of chunks to process, set to None if to process whole file
all_genders = ['Male', 'Female']
topk = 1000 # How many top names to output
progress_step = 1 << 23 # in bytes
fsize = os.path.getsize(fname)
#el_man = enlighten.get_manager() as el_man
#el_ctr = el_man.counter(color = 'green', total = math.ceil(fsize / 2 ** 20), unit = 'MiB', leave = False)
tables = {g : {
'vals': np.full([1], chr(0x10FFFF), dtype = np.str_),
'cnts': np.zeros([1], dtype = np.int64),
} for g in all_genders}
tb = time.time()
def Progress(
done, total = min([fsize] + ([chunk_size * limit_chunks] if limit_chunks is not None else [])),
cfg = {'progressed': 0, 'done': False},
if not cfg['done'] and (done - cfg['progressed'] >= progress_step or done >= total):
if done < total:
while cfg['progressed'] + progress_step <= done:
cfg['progressed'] += progress_step
else:
cfg['progressed'] = total
sys.stdout.write(
f'{str(round(cfg["progressed"] / 2 ** 20)).rjust(5)} MiB of ' +
f'{str(round(total / 2 ** 20)).rjust(5)} MiB ' +
f'speed {round(cfg["progressed"] / 2 ** 20 / (time.time() - tb), 4)} MiB/sec\n'
sys.stdout.flush()
if done >= total:
cfg['done'] = True
with open(fname, 'rb', buffering = 1 << 26) as f:
for i, df in enumerate(pd.read_csv(f, iterator = True, chunksize = chunk_size)):
if limit_chunks is not None and i >= limit_chunks:
break
if i == 0:
name_col = df.columns.get_loc('First Name')
gender_col = df.columns.get_loc('Gender')
names = np.array(df.iloc[:, name_col]).astype('str')
genders = np.array(df.iloc[:, gender_col]).astype('str')
for g in all_genders:
ctab = tables[g]
gnames = names[genders == g]
vals, cnts = np.unique(gnames, return_counts = True)
if vals.size == 0:
continue
if ctab['vals'].dtype.itemsize < names.dtype.itemsize:
ctab['vals'] = ctab['vals'].astype(names.dtype)
poss = np.searchsorted(ctab['vals'], vals)
exist = ctab['vals'][poss] == vals
ctab['cnts'][poss[exist]] += cnts[exist]
nexist = np.flatnonzero(exist == False)
ctab['vals'] = np.insert(ctab['vals'], poss[nexist], vals[nexist])
ctab['cnts'] = np.insert(ctab['cnts'], poss[nexist], cnts[nexist])
Progress(f.tell())
Progress(fsize)
with open(fname_res, 'w', encoding = 'utf-8') as f:
for g in all_genders:
f.write(f'{g}:\n\n')
print(g, '\n')
order = np.flip(np.argsort(tables[g]['cnts']))[:topk]
snames, scnts = tables[g]['vals'][order], tables[g]['cnts'][order]
if snames.size > 0:
for n, c in zip(np.nditer(snames), np.nditer(scnts)):
n, c = str(n), int(c)
if c == 0:
continue
f.write(f'{c} {n}\n')
print(c, n.encode('ascii', 'replace').decode('ascii'))
f.write(f'\n')
print()