相关文章推荐
帅气的葡萄  ·  Python ...·  3 周前    · 
温暖的香烟  ·  Python reportlab表格 ...·  1 周前    · 
无邪的丝瓜  ·  Html5 ...·  11 月前    · 
性感的鸵鸟  ·  PHP ...·  1 年前    · 

如何在一列中填充空值?

1 人不认可

我需要读取一个csv文件,并根据关系类型和LastName在 "Phone,Email,address,city,state,Zip "列中填充空值/空值,然后写入一个新的csv文件。 例子。如果一个人的关系类型是 "雇主",他的家属有相同的姓氏,如果家属没有 "电话,电子邮件,地址,城市,州,邮编",我应该用雇主的 "电话,电子邮件,地址,城市,州,邮编 "填写空值。我面临两个主要问题:1)不同的人/家庭的姓氏可能是相同的(即第5-10行),所以只要 "关系 "变为 "雇主",循环就应该中断并继续进行;2)在某些情况下,家属的姓氏不会与雇主的姓氏相同,但如果它介于相同姓氏之间,我们仍然应该填写空值(即第8行)。

我将无法使用python pandas,因为其余的代码/程序纯粹是基于python 2.7的。

输入格式/表格看起来像下面的空单元格(csv文件)。

[FirstName  LastName    DoB Relationship    Phone   Email   Address City    State   Zip
Hannah  Kahnw   9/12/1972   Employer    1457871452  hann@gmail.com  Han Ave hannas  UT  563425
Michel  Kahnw   2/9/1993    Dependent                       
Jonaas  Kahnw   2/22/1997   Dependent                       
Mikkel  Nielsen 1/25/1976   Employer    4509213887  mik@yah.com 887 Street  neil    NY  72356
Magnus  Nielsen 9/20/1990   Dependent                       
Ulrich  Nielsen 9/12/1983   Employer    7901234516  ulric@mail.com  Ulric Build mavric  KS  421256
kathari Nielsen 10/2/2003   Dependent                       
kathy   storm   12/12/1999  Dependent                       
kiiten  Nielsen 6/21/1999   Dependent                       
Elisab  Doppler 2/22/1987   Employer    5439001211  elop@amaz.com   Elisa apart Elis    AR  758475
Peterp  Doppler 1/25/1977   Employer    6847523758  per@dail.com    park Ave    Pete    PT  415253
bartos  Doppler 9/21/1990   Dependent][1]                       

输出格式应如下。

FirstName   LastName    DoB Relationship    Phone   Email   Address City    State   Zip
Hannah  Kahnw   9/12/1972   Employer    1457871452  hann@gmail.com  Han Ave hannas  UT  563425
Michel  Kahnw   2/9/1993    Dependent   1457871453  hann@gmail.com  Han Ave hannas  UT  563426
Jonaas  Kahnw   2/22/1997   Dependent   1457871454  hann@gmail.com  Han Ave hannas  UT  563427
Mikkel  Nielsen 1/25/1976   Employer    4509213887  mik@yah.com 887 Street  neil    NY  72356
Magnus  Nielsen 9/20/1990   Dependent   4509213888  mik@yah.com 888 Street  neil    NY  72357
Ulrich  Nielsen 9/12/1983   Employer    7901234516  ulric@mail.com  Ulric Build mavric  KS  421256
kathari Nielsen 10/2/2003   Dependent   7901234517  ulric@mail.com  Ulric Build mavric  KS  421257
kathy   storm   12/12/1999  Dependent   7901234518  ulric@mail.com  Ulric Build mavric  KS  421258
kiiten  Nielsen 6/21/1999   Dependent   7901234519  ulric@mail.com  Ulric Build mavric  KS  421259
Elisab  Doppler 2/22/1987   Employer    5439001211  elop@amaz.com   Elisa apart Elis    AR  758475
Peterp  Doppler 1/25/1977   Employer    6847523758  per@dail.com    park Ave    Pete    PT  415253
bartos  Doppler 9/21/1990   Dependent   6847523759  per@dail.com    park Ave    Pete    PT  415254
    # Read data from file and convert to list of namedtuples
    # dictionary to use to fill in missing information from others.
    with open(file_path, 'rb') as fin:
        csv_reader =  csv.reader(fin, skipinitialspace=True)
        header = next(csv_reader)
        Record = namedtuple('Record', header)
        addr_dict = {}
        data = [header]
        for rec in (Record._make(row) for row in csv_reader):
            if rec.Email or rec.Phone or rec.Address or rec.City or rec.State or rec.Zip:
                addr_dict.setdefault(rec.LastName, []).append(rec)  # Remember it.
    # Try to fill in missing data from any other records with same Address.
    for i, row in enumerate(data[1:], 1):
        if not (row.Phone and row.Email and rec.Address and rec.City and rec.State and rec.Zip):  # Info missing?
            # Try to copy it from others at same address.
            updated = False
            for other in addr_dict.get(row.LastName, []):
                if not row.Phone and other.Phone:
                    row = row._replace(Phone=other.Phone)
                    updated = True
                if not row.Email and other.Email:
                    row = row._replace(Email=other.Email)
                    updated = True
                if not row.Address and other.Address:
                    row = row._replace(Address=other.Address)
                    updated = True
                if not row.City and other.City:
                    row = row._replace(City=other.City)
                    updated = True
                if not row.Zip and other.Zip:
                    row = row._replace(Zip=other.Zip)
                    updated = True
                if row.Phone and row.Email and rec.Address and rec.City and rec.State and rec.Zip:  # Info now filled in?
                    break
            if updated:
                data[i] = row
    return data
INPUT_FILE = 'null_cols.csv'
OUTPUT_FILE = 'fill_cols.csv'
data = get_info(INPUT_FILE)
with open(OUTPUT_FILE, 'wb') as fout:
    writer = csv.DictWriter(fout, data[0])  # First elem has column names.
    writer.writeheader()
    for row in data[1:]:
        writer.writerow(row._asdict())
#(i got this code from earlier question which i asked in S.O This script doesn't include relationshiptype logic and also it doesn't consider the Duplicate LastName issue)

谢谢你的帮助!!

2 个评论
Mit
你能不能添加一个格式化的数据框架样本,我真的不知道哪个记录属于哪个列。从那里我可以帮你写一个代码来做。
Roy
@mit 你能告诉我如何添加格式化的样本吗?我总是遇到这样的麻烦,现在我只是从csv文件中复制粘贴输入数据。如果你是指pandas数据框架?我没有使用pandas/Jupyter,我使用的是Pycharm,我是这方面的新手。
python
python-2.7
csv
parsing
Roy
Roy
发布于 2020-05-05
1 个回答
Mit
Mit
发布于 2020-05-05
已采纳
0 人赞同

好的,有一个简单的方法,用fflf做这个,如下。 然而,这只有在受抚养人总是按行顺序跟随其雇主的情况下才会起作用。 这就是你提供的数据样本的情况,因此下面的代码可以工作。

import pandas as pd
#read in your csv file
df = pd.read_csv('fileName.csv')
#loop over columns and replace nans with the most recent value available
for c in df.columns: