pandas与numpy之间的数据对应关系。
重点介绍object,int64,float64,datetime64,bool等几种类型,category与timedelta两种类型这里不做介绍。
import numpy as np
import pandas as pd
# 从csv文件读取数据,数据表格中只有5行,里面包含了float,string,int三种数据python类型,也就是分别对应的pandas的float64,object,int64
df = pd.read_csv("sales_data_types.csv", index_col=0)
print(df)
Customer Number Customer Name 2016 2017 \
0 10002.0 Quest Industries $125,000.00 $162500.00
1 552278.0 Smith Plumbing $920,000.00 $101,2000.00
2 23477.0 ACME Industrial $50,000.00 $62500.00
3 24900.0 Brekke LTD $350,000.00 $490000.00
4 651029.0 Harbor Co $15,000.00 $12750.00
Percent Growth Jan Units Month Day Year Active
0 30.00% 500 1 10 2015 Y
1 10.00% 700 6 15 2014 Y
2 25.00% 125 3 29 2016 Y
3 4.00% 75 10 27 2015 Y
4 -15.00% Closed 2 2 2014 N
dtype: object
# 想得到2016年与2017年的数据总和,直接相加不是我们需要的答案,因为这两列中的数据类型是object,执行该操作之后,得到是一个更加长的字符串,
# 通过df.info() 来获得关于数据框的更多的详细信息
df['2016']+df['2017']
0 $125,000.00$162500.00
1 $920,000.00$101,2000.00
2 $50,000.00$62500.00
3 $350,000.00$490000.00
4 $15,000.00$12750.00
dtype: object
df.info()
# Customer Number 列是float64,然而应该是int64
# 2016 2017两列的数据是object,并不是float64或者int64格式
# Percent以及Jan Units 也是objects而不是数字格式
# Month,Day以及Year应该转化为datetime64[ns]格式
# Active 列应该是布尔值
# 如果不做数据清洗,很难进行下一步的数据分析,为了进行数据格式的转化,pandas里面有三种比较常用的方法
# 1. astype()强制转化数据类型
# 2. 通过创建自定义的函数进行数据转化
# 3. pandas提供的to_nueric()以及to_datetime()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5 entries, 0 to 4
Data columns (total 10 columns):
Customer Number 5 non-null float64
Customer Name 5 non-null object
2016 5 non-null object
2017 5 non-null object
Percent Growth 5 non-null object
Jan Units 5 non-null object
Month 5 non-null int64
Day 5 non-null int64
Year 5 non-null int64
Active 5 non-null object
dtypes: float64(1), int64(3), object(6)
memory usage: 440.0+ bytes
# 想要真正的改变数据框,通常需要通过赋值来进行,比如
df["Customer Number"] = df["Customer Number"].astype("int")
print(df)
print ('{:-^70}'.format('转换后的类型:'))
print(df.dtypes)
Customer Number Customer Name 2016 2017 \
0 10002 Quest Industries $125,000.00 $162500.00
1 552278 Smith Plumbing $920,000.00 $101,2000.00
2 23477 ACME Industrial $50,000.00 $62500.00
3 24900 Brekke LTD $350,000.00 $490000.00
4 651029 Harbor Co $15,000.00 $12750.00
Percent Growth Jan Units Month Day Year Active
0 30.00% 500 1 10 2015 Y
1 10.00% 700 6 15 2014 Y
2 25.00% 125 3 29 2016 Y
3 4.00% 75 10 27 2015 Y
4 -15.00% Closed 2 2 2014 N
-------------------------------转换后的类型:--------------------------------
Customer Number int32
Customer Name object
2016 object
2017 object
Percent Growth object
Jan Units object
Month int64
Day int64
Year int64
Active object
dtype: object
# 然后像2016,2017 Percent Growth,Jan Units 这几列带有特殊符号的object是不能直接通过astype("flaot)方法进行转化的,
# 这与python中的字符串转化为浮点数,都要求原始的字符都只能含有数字本身,不能含有其他的特殊字符
# 我们可以试着将将Active列转化为布尔值,看一下到底会发生什么,五个结果全是True,说明并没有起到什么作用
df["Active"].astype("bool")
0 True
1 True
2 True
3 True
4 True
Name: Active, dtype: bool
df['2016'].astype('float')
ValueError: could not convert string to float: '$15,000.00'
以上说明:
如果数据是纯净的数据,可以转化为数字。
astype 两种作用,数字转化为单纯字符串,单纯数字的字符串转化为数字,含非数字的字符串不能通过 astype 转化。
3. 自定义函数清理数据
通过下面函数将货币进行转化:
def convert_currency(var):
convert the string number to a float
_ 去除$
- 去除逗号,
- 转化为浮点数类型
new_value = var.replace(",","").replace("$","")
return float(new_value)
# 通过replace函数将$以及逗号去掉,然后字符串转化为浮点数,让pandas选择pandas认为合适的特定类型,float或者int,该例子中将数据转化为了float64
# 通过pandas中的apply函数将2016列中的数据全部转化
df["2016"].apply(convert_currency)
0 125000.0
1 920000.0
2 50000.0
3 350000.0
4 15000.0
Name: 2016, dtype: float64
# 当然可以通过lambda 函数将这个比较简单的函数一行带过
df["2016"].apply(lambda x: x.replace(",","").replace("$","")).astype("float64")
0 125000.0
1 920000.0
2 50000.0
3 350000.0
4 15000.0
Name: 2016, dtype: float64
#同样可以利用lambda表达式将PercentGrowth进行数据清理
df["Percent Growth"].apply(lambda x: x.replace("%","")).astype("float")/100
0 0.30
1 0.10
2 0.25
3 0.04
4 -0.15
Name: Percent Growth, dtype: float64
# 同样可以通过自定义函数进行解决,结果同上
# 最后一个自定义函数是利用np.where() function 将Active 列转化为布尔值。
df["Active"] = np.where(df["Active"] == "Y", True, False)
df["Active"]
0 True
1 True
2 True
3 True
4 False
Name: Active, dtype: bool
# 此时可查看一下数据格式
df["2016"]=df["2016"].apply(lambda x: x.replace(",","").replace("$","")).astype("float64")
df["2017"]=df["2017"].apply(lambda x: x.replace(",","").replace("$","")).astype("float64")
df["Percent Growth"]=df["Percent Growth"].apply(lambda x: x.replace("%","")).astype("float")/100
df.dtypes
Customer Number int32
Customer Name object
2016 float64
2017 float64
Percent Growth float64
Jan Units object
Month int64
Day int64
Year int64
Active bool
dtype: object
# 再次查看DataFrame
# 此时只有Jan Units中格式需要转化,以及年月日的合并,可以利用pandas中自带的几个函数进行处理
print(df)
Customer Number Customer Name 2016 2017 Percent Growth \
0 10002 Quest Industries 125000.0 162500.0 0.30
1 552278 Smith Plumbing 920000.0 1012000.0 0.10
2 23477 ACME Industrial 50000.0 62500.0 0.25
3 24900 Brekke LTD 350000.0 490000.0 0.04
4 651029 Harbor Co 15000.0 12750.0 -0.15
Jan Units Month Day Year Active
0 500 1 10 2015 True
1 700 6 15 2014 True
2 125 3 29 2016 True
3 75 10 27 2015 True
4 Closed 2 2 2014 False
# pandas中pd.to_numeric()处理Jan Units中的数据
pd.to_numeric(df["Jan Units"],errors='coerce').fillna(0)
0 500.0
1 700.0
2 125.0
3 75.0
4 0.0
Name: Jan Units, dtype: float64
# 最后利用pd.to_datatime()将年月日进行合并
pd.to_datetime(df[['Month', 'Day', 'Year']])
0 2015-01-10
1 2014-06-15
2 2016-03-29
3 2015-10-27
4 2014-02-02
dtype: datetime64[ns]
# 做到这里不要忘记重新赋值,否则原始数据并没有变化
df["Jan Units"] = pd.to_numeric(df["Jan Units"],errors='coerce')
df["Start_date"] = pd.to_datetime(df[['Month', 'Day', 'Year']])
df
Customer NumberCustomer Name20162017Percent GrowthJan UnitsMonthDayYearActiveStart_date
10002
Quest Industries
125000.0
162500.0
500.0
2015-01-10
552278
Smith Plumbing
920000.0
1012000.0
700.0
2014-06-15
23477
ACME Industrial
50000.0
62500.0
125.0
2016-03-29
24900
Brekke LTD
350000.0
490000.0
2015-10-27
651029
Harbor Co
15000.0
12750.0
-0.15
False
2014-02-02
Customer Number int32
Customer Name object
2016 float64
2017 float64
Percent Growth float64
Jan Units float64
Month int64
Day int64
Year int64
Active bool
Start_date datetime64[ns]
dtype: object
# 将这些转化整合在一起
def convert_percent(val):
Convert the percentage string to an actual floating point percent
- Remove %
- Divide by 100 to make decimal
new_val = val.replace('%', '')
return float(new_val) / 100
df_2 = pd.read_csv("sales_data_types.csv",dtype={"Customer_Number":"int"},index_col=0,converters={
"2016":convert_currency,
"2017":convert_currency,
"Percent Growth":convert_percent,
"Jan Units":lambda x:pd.to_numeric(x,errors="coerce"),
"Active":lambda x: np.where(x=="Y",True,False)
df_2.dtypes
Customer Number float64
Customer Name object
2016 float64
2017 float64
Percent Growth float64
Jan Units float64
Month int64
Day int64
Year int64
Active object
dtype: object
Customer NumberCustomer Name20162017Percent GrowthJan UnitsMonthDayYearActive
10002.0
Quest Industries
125000.0
162500.0
500.0
552278.0
Smith Plumbing
920000.0
1012000.0
700.0
23477.0
ACME Industrial
50000.0
62500.0
125.0
24900.0
Brekke LTD
350000.0
490000.0
651029.0
Harbor Co
15000.0
12750.0
-0.15
False