原标题:左手用R右手Python系列(IX):字符串合并与拆分
在文本处理和数据清洗阶段,对字符串或者字符型变量进行分割、提取或者合并虽然谈不上什么高频需求,但是往往也对很重要的。
接下来跟大家大致盘点一下在R语言与Pyhton中,常用的字符串分割与合并的函数。
字符串向量:
针对向量:
strsplit
#针对字符串向量(拆分)
str_split
#针对字符串向量(拆分)stringr包内函数
paste
#针对向量合并
针对数据框:
unite
#合并数据框中的某几列
separate
#将数据框中某一列按照某种模式拆分成几列
R语言:
library
(dplyr)
library
(stringr)
library
(tidyr) myyear<-sprintf(
"20%02d"
,sample(
0
:
17
,
10
))mymonth<-sprintf(
"%02d"
,sample(
0
:
12
,
10
))myday<-sprintf(
"%02d"
,sample(
0
:
31
,
10
))myyear;mymonth;myday[
1
]
"2000"
"2010"
"2002"
"2012"
"2015"
"2006"
"2001"
"2017"
"2005"
"2013"
[
1
]
"10"
"03"
"01"
"09"
"04"
"02"
"05"
"07"
"00"
"12"
[
1
]
"18"
"15"
"28"
"00"
"11"
"20"
"31"
"19"
"04"
"12"
首先使用paste函数进行合并:
full<-paste(myyear,mymonth,myday,sep =
"-"
);full
#在向量等长的情况下,可以实现配对合并:
[
1
]
"2000"
"2010"
"2002"
"2012"
"2015"
"2006"
"2001"
"2017"
"2005"
"2013"
使用strsplit函数进行拆分:
myyear1=mymonth1=myday1=NULLfor( i in 1:length(full)){myyear1[i]<-strsplit(full[i],"-")[[1]][1]mymonth1[i]<-strsplit(full[i],"-")[[1]][2]myday1[i]<-strsplit(full[i],"-")[[1]][3]}myyear1;mymonth1;myday1[1] "2000" "2010" "2002" "2012" "2015" "2006" "2001" "2017" "2005" "2013"[1] "10" "03" "01" "09" "04" "02" "05" "07" "00" "12"[1] "18" "15" "28" "00" "11" "20" "31" "19" "04" "12"
str_split函数与strsplit函数用法类似:
myyear1=mymonth1=myday1=
NULL
for
( i
in
1
:length(full)){myyear1[i]<-str_split(full[i],
"-"
)[[
1
]][
1
]mymonth1[i]<-str_split(full[i],
"-"
)[[
1
]][
2
]myday1[i]<-str_split(full[i],
"-"
)[[
1
]][
3
]}myyear1;mymonth1;myday1> myyear1;mymonth1;myday1[
1
]
"2000"
"2010"
"2002"
"2012"
"2015"
"2006"
"2001"
"2017"
"2005"
"2013"
[
1
]
"10"
"03"
"01"
"09"
"04"
"02"
"05"
"07"
"00"
"12"
[
1
]
"18"
"15"
"28"
"00"
"11"
"20"
"31"
"19"
"04"
"12"
接下来解释在如何直接针对数据框进行合并与分列的操作:
mydata<-data.frame(myyear,mymonth,myday);mydata myyear mymonth myday
1
2000
10
18
2
2010
03
15
3
2002
01
28
4
2012
09
00
5
2015
04
11
6
2006
02
20
7
2001
05
31
8
2017
07
19
9
2005
00
04
10
2013
12
12
unite (data,col,
...
, sep =
"-"
, remove =
TRUE
)separate(data,col, into,sep=
"-"
, remove =
TRUE
)
unite和separate函数是配对函数,内部的参数严格白痴对称,第一个参数数要操作的数据框名称,第二个参数是合并后的新列名(或者待拆分的列名),第三部分是待合并的列名向量(拆分后的新增列名),sep是拆分(合并)依据,remove则控制输出的数据框是否包含原始向量(针对合并前的待合并变量和拆分前的待拆分变量)。
mydata1<-unite(mydata,col=
"datetime"
,c(
"myyear"
,
"mymonth"
,
"myday"
),sep=
"-"
,remove=
FALSE
);mydata1
datetime myyear mymonth myday
1
2000
-
10
-
18
2000
10
18
2
2010
-
03
-
15
2010
03
15
3
2002
-
01
-
28
2002
01
28
4
2012
-
09
-
00
2012
09
00
5
2015
-
04
-
11
2015
04
11
6
2006
-
02
-
20
2006
02
20
7
2001
-
05
-
31
2001
05
31
8
2017
-
07
-
19
2017
07
19
9
2005
-
00
-
04
2005
00
04
10
2013
-
12
-
12
2013
12
12
mydata2<-unite(mydata1,col=
"datetime1"
,c(
"myyear"
,
"mymonth"
,
"myday"
),sep=
"-"
,remove=
FALSE
);mydata2 datetime datetime1 myyear mymonth myday
1
2000
-
10
-
18
2000
-
10
-
18
2000
10
18
2
2010
-
03
-
15
2010
-
03
-
15
2010
03
15
3
2002
-
01
-
28
2002
-
01
-
28
2002
01
28
4
2012
-
09
-
00
2012
-
09
-
00
2012
09
00
5
2015
-
04
-
11
2015
-
04
-
11
2015
04
11
6
2006
-
02
-
20
2006
-
02
-
20
2006
02
20
7
2001
-
05
-
31
2001
-
05
-
31
2001
05
31
8
2017
-
07
-
19
2017
-
07
-
19
2017
07
19
9
2005
-
00
-
04
2005
-
00
-
04
2005
00
04
10
2013
-
12
-
12
2013
-
12
-
12
2013
12
12
Python字符串合并与分列:
因为对Python的字符串操作掌握有限,再加上Python字符串操作及其灵活,各种推导式和匿名函数可以很方便的完成,这里仅给出自己常用的做法作为实例,未包含所有方法:
字符串合并:
字符串链接符:”+”
字符串合并函数:join
字符串拆分:split
import
random
import
pandas
as
pd myyear=random.sample(list(range(
2000
,
2017
)),
10
);myyearmymonth=[
'%02d'
% i
for
i
in
random.sample(list(range(
1
,
12
)),
10
)];mymonthmyday=[
'%02d'
% i
for
i
in
random.sample(list(range(
1
,
31
)),
10
)];myday[
2006
,
2000
,
2007
,
2001
,
2015
,
2016
,
2002
,
2012
,
2010
,
2004
][
'04'
,
'11'
,
'06'
,
'10'
,
'07'
,
'08'
,
'05'
,
'02'
,
'03'
,
'01'
][
'13'
,
'28'
,
'21'
,
'06'
,
'08'
,
'03'
,
'17'
,
'16'
,
'04'
,
'20'
]
字符串合并:
mydate=[str(i)+
"-"
+j+
"-"
+k
for
i,j,k
in
zip(myyear,mymonth,myday)][
'2011-04-25'
,
'2008-11-30'
,
'2003-06-02'
,
'2007-10-22'
,
'2009-07-13'
,
'2005-08-27'
,
'2014-05-28'
,
'2012-02-10'
,
'2016-03-14'
,
'2015-01-21'
]mydate=[
"-"
.join([str(i),j,k])
for
i,j,k
in
zip(myyear,mymonth,myday)][
'2011-04-25'
,
'2008-11-30'
,
'2003-06-02'
,
'2007-10-22'
,
'2009-07-13'
,
'2005-08-27'
,
'2014-05-28'
,
'2012-02-10'
,
'2016-03-14'
,
'2015-01-21'
]
字符串拆分:
方法一(列表推导式):
myyear1=[i.split(
"-"
)[
0
]
for
i
in
mydate];myyear1mymonth1=[i.split(
"-"
)[
1
]
for
i
in
mydate];mymonth1myday1=[i.split(
"-"
)[
2
]
for
i
in
mydate];myday1[
'2011'
,
'2008'
,
'2003'
,
'2007'
,
'2009'
,
'2005'
,
'2014'
,
'2012'
,
'2016'
,
'2015'
][
'04'
,
'11'
,
'06'
,
'10'
,
'07'
,
'08'
,
'05'
,
'02'
,
'03'
,
'01'
][
'25'
,
'30'
,
'02'
,
'22'
,
'13'
,
'27'
,
'28'
,
'10'
,
'14'
,
'21'
]
方法二(使用字典):
mydata=pd.DataFrame({
"date"
:mydate})mydata[
"date"
].str.split(
"-"
,expand=
True
)
0
1
2
0
2011
04
25
1
2008
11
30
2
2003
06
02
3
2007
10
22
4
2009
07
13
5
2005
08
27
6
2014
05
28
7
2012
02
10
8
2016
03
14
9
2015
01
21
myyear2=mydata[
"date"
].str.split(
"-"
,expand=
True
)[
0
];print(myyear2)mymonth2=mydata[
"date"
].str.split(
"-"
,expand=
True
)[
1
];print(mymonth2)myday2=mydata[
"date"
].str.split(
"-"
,expand=
True
)[
2
];print(myday2)
0
2011
1
2008
2
2003
3
2007
4
2009
5
2005
6
2014
7
2012
8
2016
9
2015
Name:
0
, dtype: object
0
04
1
11
2
06
3
10
4
07
5
08
6
05
7
02
8
03
9
01
Name:
1
, dtype: object
0
25
1
30
2
02
3
22
4
13
5
27
6
28
7
10
8
14
9
21
Name:
2
, dtype: object
本文小结——字符串拆分与合并:
strsplit
str_split
paste
tidyr::unite
tidyr::separate
Python:
.split
如需转载请联系EasyCharts团队!
微信后台回复“转载”即可!
返回搜狐,查看更多
责任编辑:
声明:该文观点仅代表作者本人,搜狐号系信息发布平台,搜狐仅提供信息存储空间服务。