pandas与numpy之间的数据对应关系。

重点介绍object,int64,float64,datetime64,bool等几种类型,category与timedelta两种类型这里不做介绍。

import numpy as np
import pandas as pd
# 从csv文件读取数据,数据表格中只有5行,里面包含了float,string,int三种数据python类型,也就是分别对应的pandas的float64,object,int64
df = pd.read_csv("sales_data_types.csv", index_col=0)
print(df)
   Customer Number     Customer Name         2016          2017  \
0          10002.0  Quest Industries  $125,000.00    $162500.00   
1         552278.0    Smith Plumbing  $920,000.00  $101,2000.00   
2          23477.0   ACME Industrial   $50,000.00     $62500.00   
3          24900.0        Brekke LTD  $350,000.00    $490000.00   
4         651029.0         Harbor Co   $15,000.00     $12750.00   
  Percent Growth Jan Units  Month  Day  Year Active  
0         30.00%       500      1   10  2015      Y  
1         10.00%       700      6   15  2014      Y  
2         25.00%       125      3   29  2016      Y  
3          4.00%        75     10   27  2015      Y  
4        -15.00%    Closed      2    2  2014      N  
dtype: object
# 想得到2016年与2017年的数据总和,直接相加不是我们需要的答案,因为这两列中的数据类型是object,执行该操作之后,得到是一个更加长的字符串,
# 通过df.info() 来获得关于数据框的更多的详细信息
df['2016']+df['2017']
0      $125,000.00$162500.00
1    $920,000.00$101,2000.00
2        $50,000.00$62500.00
3      $350,000.00$490000.00
4        $15,000.00$12750.00
dtype: object
df.info()
# Customer Number 列是float64,然而应该是int64
# 2016 2017两列的数据是object,并不是float64或者int64格式
# Percent以及Jan Units 也是objects而不是数字格式
# Month,Day以及Year应该转化为datetime64[ns]格式
# Active 列应该是布尔值
# 如果不做数据清洗,很难进行下一步的数据分析,为了进行数据格式的转化,pandas里面有三种比较常用的方法
# 1. astype()强制转化数据类型
# 2. 通过创建自定义的函数进行数据转化
# 3. pandas提供的to_nueric()以及to_datetime()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5 entries, 0 to 4
Data columns (total 10 columns):
Customer Number    5 non-null float64
Customer Name      5 non-null object
2016               5 non-null object
2017               5 non-null object
Percent Growth     5 non-null object
Jan Units          5 non-null object
Month              5 non-null int64
Day                5 non-null int64
Year               5 non-null int64
Active             5 non-null object
dtypes: float64(1), int64(3), object(6)
memory usage: 440.0+ bytes
# 想要真正的改变数据框,通常需要通过赋值来进行,比如
df["Customer Number"] = df["Customer Number"].astype("int")
print(df)
print ('{:-^70}'.format('转换后的类型:'))
print(df.dtypes)
   Customer Number     Customer Name         2016          2017  \
0            10002  Quest Industries  $125,000.00    $162500.00   
1           552278    Smith Plumbing  $920,000.00  $101,2000.00   
2            23477   ACME Industrial   $50,000.00     $62500.00   
3            24900        Brekke LTD  $350,000.00    $490000.00   
4           651029         Harbor Co   $15,000.00     $12750.00   
  Percent Growth Jan Units  Month  Day  Year Active  
0         30.00%       500      1   10  2015      Y  
1         10.00%       700      6   15  2014      Y  
2         25.00%       125      3   29  2016      Y  
3          4.00%        75     10   27  2015      Y  
4        -15.00%    Closed      2    2  2014      N  
-------------------------------转换后的类型:--------------------------------
Customer Number     int32
Customer Name      object
2016               object
2017               object
Percent Growth     object
Jan Units          object
Month               int64
Day                 int64
Year                int64
Active             object
dtype: object
# 然后像2016,2017 Percent Growth,Jan Units 这几列带有特殊符号的object是不能直接通过astype("flaot)方法进行转化的,
# 这与python中的字符串转化为浮点数,都要求原始的字符都只能含有数字本身,不能含有其他的特殊字符
# 我们可以试着将将Active列转化为布尔值,看一下到底会发生什么,五个结果全是True,说明并没有起到什么作用
df["Active"].astype("bool")
0    True
1    True
2    True
3    True
4    True
Name: Active, dtype: bool
df['2016'].astype('float')
ValueError: could not convert string to float: '$15,000.00'

以上说明:

  • 如果数据是纯净的数据,可以转化为数字。
  • astype 两种作用,数字转化为单纯字符串,单纯数字的字符串转化为数字,含非数字的字符串不能通过 astype 转化。
  • 3. 自定义函数清理数据

    通过下面函数将货币进行转化:

    def convert_currency(var):
        convert the string number to a float
        _ 去除$
        - 去除逗号,
        - 转化为浮点数类型
        new_value = var.replace(",","").replace("$","")
        return float(new_value)
    # 通过replace函数将$以及逗号去掉,然后字符串转化为浮点数,让pandas选择pandas认为合适的特定类型,float或者int,该例子中将数据转化为了float64
    # 通过pandas中的apply函数将2016列中的数据全部转化
    df["2016"].apply(convert_currency)
    0    125000.0
    1    920000.0
    2     50000.0
    3    350000.0
    4     15000.0
    Name: 2016, dtype: float64
    # 当然可以通过lambda 函数将这个比较简单的函数一行带过
    df["2016"].apply(lambda x: x.replace(",","").replace("$","")).astype("float64")
    0    125000.0
    1    920000.0
    2     50000.0
    3    350000.0
    4     15000.0
    Name: 2016, dtype: float64
    #同样可以利用lambda表达式将PercentGrowth进行数据清理
    df["Percent Growth"].apply(lambda x: x.replace("%","")).astype("float")/100
    0    0.30
    1    0.10
    2    0.25
    3    0.04
    4   -0.15
    Name: Percent Growth, dtype: float64
    # 同样可以通过自定义函数进行解决,结果同上
    # 最后一个自定义函数是利用np.where() function 将Active 列转化为布尔值。
    df["Active"] = np.where(df["Active"] == "Y", True, False)
    df["Active"]
    0     True
    1     True
    2     True
    3     True
    4    False
    Name: Active, dtype: bool
    # 此时可查看一下数据格式
    df["2016"]=df["2016"].apply(lambda x: x.replace(",","").replace("$","")).astype("float64")
    df["2017"]=df["2017"].apply(lambda x: x.replace(",","").replace("$","")).astype("float64")
    df["Percent Growth"]=df["Percent Growth"].apply(lambda x: x.replace("%","")).astype("float")/100
    df.dtypes
    Customer Number      int32
    Customer Name       object
    2016               float64
    2017               float64
    Percent Growth     float64
    Jan Units           object
    Month                int64
    Day                  int64
    Year                 int64
    Active                bool
    dtype: object
    # 再次查看DataFrame
    # 此时只有Jan Units中格式需要转化,以及年月日的合并,可以利用pandas中自带的几个函数进行处理
    print(df)
       Customer Number     Customer Name      2016       2017  Percent Growth  \
    0            10002  Quest Industries  125000.0   162500.0            0.30   
    1           552278    Smith Plumbing  920000.0  1012000.0            0.10   
    2            23477   ACME Industrial   50000.0    62500.0            0.25   
    3            24900        Brekke LTD  350000.0   490000.0            0.04   
    4           651029         Harbor Co   15000.0    12750.0           -0.15   
      Jan Units  Month  Day  Year Active  
    0       500      1   10  2015   True  
    1       700      6   15  2014   True  
    2       125      3   29  2016   True  
    3        75     10   27  2015   True  
    4    Closed      2    2  2014  False  
    
    # pandas中pd.to_numeric()处理Jan Units中的数据
    pd.to_numeric(df["Jan Units"],errors='coerce').fillna(0)
    0    500.0
    1    700.0
    2    125.0
    3     75.0
    4      0.0
    Name: Jan Units, dtype: float64
    # 最后利用pd.to_datatime()将年月日进行合并
    pd.to_datetime(df[['Month', 'Day', 'Year']])
    0   2015-01-10
    1   2014-06-15
    2   2016-03-29
    3   2015-10-27
    4   2014-02-02
    dtype: datetime64[ns]
    # 做到这里不要忘记重新赋值,否则原始数据并没有变化
    df["Jan Units"] = pd.to_numeric(df["Jan Units"],errors='coerce')
    df["Start_date"] = pd.to_datetime(df[['Month', 'Day', 'Year']])
    df
     Customer NumberCustomer Name20162017Percent GrowthJan UnitsMonthDayYearActiveStart_date 10002 Quest Industries 125000.0 162500.0 500.0 2015-01-10 552278 Smith Plumbing 920000.0 1012000.0 700.0 2014-06-15 23477 ACME Industrial 50000.0 62500.0 125.0 2016-03-29 24900 Brekke LTD 350000.0 490000.0 2015-10-27 651029 Harbor Co 15000.0 12750.0 -0.15 False 2014-02-02
    Customer Number             int32
    Customer Name              object
    2016                      float64
    2017                      float64
    Percent Growth            float64
    Jan Units                 float64
    Month                       int64
    Day                         int64
    Year                        int64
    Active                       bool
    Start_date         datetime64[ns]
    dtype: object
    # 将这些转化整合在一起
    def convert_percent(val):
        Convert the percentage string to an actual floating point percent
        - Remove %
        - Divide by 100 to make decimal
        new_val = val.replace('%', '')
        return float(new_val) / 100
    df_2 = pd.read_csv("sales_data_types.csv",dtype={"Customer_Number":"int"},index_col=0,converters={
        "2016":convert_currency,
        "2017":convert_currency,
        "Percent Growth":convert_percent,
        "Jan Units":lambda x:pd.to_numeric(x,errors="coerce"),
        "Active":lambda x: np.where(x=="Y",True,False)
    
    df_2.dtypes
    Customer Number    float64
    Customer Name       object
    2016               float64
    2017               float64
    Percent Growth     float64
    Jan Units          float64
    Month                int64
    Day                  int64
    Year                 int64
    Active              object
    dtype: object
     Customer NumberCustomer Name20162017Percent GrowthJan UnitsMonthDayYearActive 10002.0 Quest Industries 125000.0 162500.0 500.0 552278.0 Smith Plumbing 920000.0 1012000.0 700.0 23477.0 ACME Industrial 50000.0 62500.0 125.0 24900.0 Brekke LTD 350000.0 490000.0 651029.0 Harbor Co 15000.0 12750.0 -0.15 False