非常大的pandas数据框架--保持计数

相关文章推荐

有胆有识的椰子 · Pandas ...· 3 周前 ·

奔跑的风衣 · pandas高效读取大文件的探索之路 - ...· 3 周前 ·

坏坏的羽毛球 · 数据分析利器 pandas ...· 3 周前 ·

温暖的卡布奇诺 · 用python的pandas读取excel文 ...· 2 周前 ·

被表白的日记本 · 【python】读取excel的行列内容，p ...· 2 周前 ·

侠义非凡的大象 · 史海回眸：共和国十大元帅不同的葬礼_cctv ...· 4 月前 ·

欢乐的打火机 · 法国古树年轮记录下已知最强太阳风暴-新华网· 5 月前 ·

帅气的夕阳 · 百强席位操盘手法解密：“深南哥”的两套打板手 ...· 7 月前 ·

帅呆的玉米 · Microsoft Office ...· 7 月前 ·

呐喊的包子 · “奥斯卡遗珠”《波斯语课》票房仅1500万， ...· 1 年前 ·

假设这是我的数据的一个样本。数据框

the entire 数据框 is stored in a csv file (数据框.csv) that is 40GBs so I can't open all of it at once. I am hoping to find the most dominant 25 names for all genders. My instinct is to create a for loop that runs through the file (because I can't open it at once), and have a python dictionary that holds the counter for each name (that I will increment as I go through the data). To be honest, I'm confused on where to even start with this (how to create the dictionary, since to_dict() does not appear to do what I'm looking for). And also, if this is even a good solution? Is there a more efficient way someone can think of?

摘要 -- 如果问题有点长，请原谅。存储数据的csv文件非常大，我不能一下子打开它，但我想在数据中找到前25个主要的名字。对做什么和怎么做有什么想法吗？如果能得到任何帮助，我将不胜感激!:)

2 个评论

FatihAkici ：

欢迎来到SO。有两点说明。1.请不要添加任何东西的图片。相反，请分享可重现的代码来重新生成你的数据集。2.2.请分享你的努力和你的代码进展，不管它可能有多糟糕--不用担心。

segfaultshurtme ：

我很抱歉。这是我在网上找到的一个样本，因为我所使用的数据是保密的，这意味着不幸的是我不能分享数据。到目前为止，我的代码一直在组织事情，直到我来到这里。对于这个问题，我真的没有任何代码。我只有一个想法，我希望在这个问题上得到一些意见或帮助。）

python

pandas

dataframe

dictionary

segfaultshurtme

发布于 2020-09-21

3 个回答

Arty

发布于 2020-09-21

已采纳

0 人赞同

谢谢你的有趣的任务!我已经实现了纯粹的 numpy 。+ pandas 的解决方案。它使用排序的数组来保存名字和计数。因此，算法的复杂度应该是 O(n * log n) 左右。

我在numpy中没有任何哈希表，哈希表肯定会更快（ O(n) ）。因此，我使用了numpy现有的排序/插入程序。

我还使用了 pandas 中的 .read_csv() 和 iterator = True, chunksize = 1 << 24 参数，这允许分块读取文件并从每个分块中产生固定大小的pandas数据帧。

注意!在第一次运行时（直到程序调试完毕），将代码中的 limit_chunks （处理的块数）设置为小值（如 5 ）。这是为了检查整个程序在部分数据上运行是否正确。

如果你没有这两个软件包，程序需要运行一次命令 python -m pip install pandas numpy 来安装它们。

偶尔打印一下进度，完成的总兆字节数加上速度。

结果将被打印到控制台，并保存到 res_fname 文件名中，所有配置脚本的常量都放在脚本的开头。替换代码11】常量控制有多少个顶层名称将被输出到文件/控制台。

有趣的是我的解决方案有多快。如果它太慢了，也许我会花些时间用纯的 HashTable 类来写出漂亮的 numpy 。

你也可以尝试运行下一个代码 here online .

import os, math, time, sys
# Needs: python -m pip install pandas numpy
import pandas as pd, numpy as np
import pandas, numpy
fname = 'test.csv'
fname_res = 'test.res'
chunk_size = 1 << 24
limit_chunks = None # Number of chunks to process, set to None if to process whole file
all_genders = ['Male', 'Female']
topk = 1000 # How many top names to output
progress_step = 1 << 23 # in bytes
fsize = os.path.getsize(fname)
#el_man = enlighten.get_manager() as el_man
#el_ctr = el_man.counter(color = 'green', total = math.ceil(fsize / 2 ** 20), unit = 'MiB', leave = False)
tables = {g : {
    'vals': np.full([1], chr(0x10FFFF), dtype = np.str_),
    'cnts': np.zeros([1], dtype = np.int64),
} for g in all_genders}
tb = time.time()
def Progress(
    done, total = min([fsize] + ([chunk_size * limit_chunks] if limit_chunks is not None else [])),
    cfg = {'progressed': 0, 'done': False},
    if not cfg['done'] and (done - cfg['progressed'] >= progress_step or done >= total):
        if done < total:
            while cfg['progressed'] + progress_step <= done:
                cfg['progressed'] += progress_step
        else:
            cfg['progressed'] = total
        sys.stdout.write(
            f'{str(round(cfg["progressed"] / 2 ** 20)).rjust(5)} MiB of ' +
            f'{str(round(total / 2 ** 20)).rjust(5)} MiB ' +
            f'speed {round(cfg["progressed"] / 2 ** 20 / (time.time() - tb), 4)} MiB/sec\n'
        sys.stdout.flush()
        if done >= total:
            cfg['done'] = True
with open(fname, 'rb', buffering = 1 << 26) as f:
    for i, df in enumerate(pd.read_csv(f, iterator = True, chunksize = chunk_size)):
        if limit_chunks is not None and i >= limit_chunks:
            break
        if i == 0:
            name_col = df.columns.get_loc('First Name')
            gender_col = df.columns.get_loc('Gender')
        names = np.array(df.iloc[:, name_col]).astype('str')
        genders = np.array(df.iloc[:, gender_col]).astype('str')
        for g in all_genders:
            ctab = tables[g]
            gnames = names[genders == g]
            vals, cnts = np.unique(gnames, return_counts = True)
            if vals.size == 0:
                continue
            if ctab['vals'].dtype.itemsize < names.dtype.itemsize:
                ctab['vals'] = ctab['vals'].astype(names.dtype)
            poss = np.searchsorted(ctab['vals'], vals)
            exist = ctab['vals'][poss] == vals
            ctab['cnts'][poss[exist]] += cnts[exist]
            nexist = np.flatnonzero(exist == False)
            ctab['vals'] = np.insert(ctab['vals'], poss[nexist], vals[nexist])
            ctab['cnts'] = np.insert(ctab['cnts'], poss[nexist], cnts[nexist])
        Progress(f.tell())
    Progress(fsize)
with open(fname_res, 'w', encoding = 'utf-8') as f:
    for g in all_genders:
        f.write(f'{g}:\n\n')
        print(g, '\n')
        order = np.flip(np.argsort(tables[g]['cnts']))[:topk]
        snames, scnts = tables[g]['vals'][order], tables[g]['cnts'][order]
        if snames.size > 0:
            for n, c in zip(np.nditer(snames), np.nditer(scnts)):
                n, c = str(n), int(c)
                if c == 0:
                    continue
                f.write(f'{c} {n}\n')
                print(c, n.encode('ascii', 'replace').decode('ascii'))
        f.write(f'\n')
        print()


         
          我忘了在表格/CSV中提取
          
           NaN
          
          。要做到这一点，我需要知道它们是如何存储在CSV中的，什么字符串代表它们，等等。另外，我希望只有两种性别，男，女。你可以在脚本的开头将其他性别（包括代表性别的NaN字符串）放到性别的配置中。如果NaN在CSV中是用 "NaN "字符串表示的，那么一切都可以正常工作，只需添加 "NaN "性别。而且在输出结果中也会有 "NaN "的名字。


         
          segfaultshurtme
         
         ：


         
          谢谢你的帮助!如果我想按不同的名字获得所有工资的摘要（所有的Lindas赚多少钱？ 等），看看谁是收入最高的名字，我该怎么做？我希望它有点自动化，这样我就能找到收入最高的名字，而不是去找每个名字的收入。


       0
       
       人赞同


        import pandas as pd
df = pd.read_csv("sample_data.csv")
print(df['First Name'].value_counts())