原标题:左手用R右手Python系列(IX):字符串合并与拆分

在文本处理和数据清洗阶段,对字符串或者字符型变量进行分割、提取或者合并虽然谈不上什么高频需求,但是往往也对很重要的。

接下来跟大家大致盘点一下在R语言与Pyhton中,常用的字符串分割与合并的函数。

字符串向量:

针对向量:

strsplit #针对字符串向量(拆分)

str_split #针对字符串向量(拆分)stringr包内函数

paste #针对向量合并

针对数据框:

unite #合并数据框中的某几列

separate #将数据框中某一列按照某种模式拆分成几列 R语言: library (dplyr)

library (stringr)

library (tidyr) myyear<-sprintf( "20%02d" ,sample( 0 : 17 , 10 ))mymonth<-sprintf( "%02d" ,sample( 0 : 12 , 10 ))myday<-sprintf( "%02d" ,sample( 0 : 31 , 10 ))myyear;mymonth;myday[ 1 ] "2000" "2010" "2002" "2012" "2015" "2006" "2001" "2017" "2005" "2013"

[ 1 ] "10" "03" "01" "09" "04" "02" "05" "07" "00" "12"

[ 1 ] "18" "15" "28" "00" "11" "20" "31" "19" "04" "12"

首先使用paste函数进行合并:

full<-paste(myyear,mymonth,myday,sep = "-" );full #在向量等长的情况下,可以实现配对合并:

[ 1 ] "2000" "2010" "2002" "2012" "2015" "2006" "2001" "2017" "2005" "2013"

使用strsplit函数进行拆分:

myyear1=mymonth1=myday1=NULLfor( i in 1:length(full)){myyear1[i]<-strsplit(full[i],"-")[[1]][1]mymonth1[i]<-strsplit(full[i],"-")[[1]][2]myday1[i]<-strsplit(full[i],"-")[[1]][3]}myyear1;mymonth1;myday1[1] "2000" "2010" "2002" "2012" "2015" "2006" "2001" "2017" "2005" "2013"[1] "10" "03" "01" "09" "04" "02" "05" "07" "00" "12"[1] "18" "15" "28" "00" "11" "20" "31" "19" "04" "12"

str_split函数与strsplit函数用法类似:

myyear1=mymonth1=myday1= NULL

for ( i in 1 :length(full)){myyear1[i]<-str_split(full[i], "-" )[[ 1 ]][ 1 ]mymonth1[i]<-str_split(full[i], "-" )[[ 1 ]][ 2 ]myday1[i]<-str_split(full[i], "-" )[[ 1 ]][ 3 ]}myyear1;mymonth1;myday1> myyear1;mymonth1;myday1[ 1 ] "2000" "2010" "2002" "2012" "2015" "2006" "2001" "2017" "2005" "2013"

[ 1 ] "10" "03" "01" "09" "04" "02" "05" "07" "00" "12"

[ 1 ] "18" "15" "28" "00" "11" "20" "31" "19" "04" "12"

接下来解释在如何直接针对数据框进行合并与分列的操作:

mydata<-data.frame(myyear,mymonth,myday);mydata myyear mymonth myday

1 2000 10 18

2 2010 03 15

3 2002 01 28

4 2012 09 00

5 2015 04 11

6 2006 02 20

7 2001 05 31

8 2017 07 19

9 2005 00 04

10 2013 12 12 unite (data,col, ... , sep = "-" , remove = TRUE )separate(data,col, into,sep= "-" , remove = TRUE )

unite和separate函数是配对函数,内部的参数严格白痴对称,第一个参数数要操作的数据框名称,第二个参数是合并后的新列名(或者待拆分的列名),第三部分是待合并的列名向量(拆分后的新增列名),sep是拆分(合并)依据,remove则控制输出的数据框是否包含原始向量(针对合并前的待合并变量和拆分前的待拆分变量)。

mydata1<-unite(mydata,col= "datetime" ,c( "myyear" , "mymonth" , "myday" ),sep= "-" ,remove= FALSE );mydata1

datetime myyear mymonth myday

1 2000 - 10 - 18 2000 10 18

2 2010 - 03 - 15 2010 03 15

3 2002 - 01 - 28 2002 01 28

4 2012 - 09 - 00 2012 09 00

5 2015 - 04 - 11 2015 04 11

6 2006 - 02 - 20 2006 02 20

7 2001 - 05 - 31 2001 05 31

8 2017 - 07 - 19 2017 07 19

9 2005 - 00 - 04 2005 00 04

10 2013 - 12 - 12 2013 12 12 mydata2<-unite(mydata1,col= "datetime1" ,c( "myyear" , "mymonth" , "myday" ),sep= "-" ,remove= FALSE );mydata2 datetime datetime1 myyear mymonth myday

1 2000 - 10 - 18 2000 - 10 - 18 2000 10 18

2 2010 - 03 - 15 2010 - 03 - 15 2010 03 15

3 2002 - 01 - 28 2002 - 01 - 28 2002 01 28

4 2012 - 09 - 00 2012 - 09 - 00 2012 09 00

5 2015 - 04 - 11 2015 - 04 - 11 2015 04 11

6 2006 - 02 - 20 2006 - 02 - 20 2006 02 20

7 2001 - 05 - 31 2001 - 05 - 31 2001 05 31

8 2017 - 07 - 19 2017 - 07 - 19 2017 07 19

9 2005 - 00 - 04 2005 - 00 - 04 2005 00 04

10 2013 - 12 - 12 2013 - 12 - 12 2013 12 12

Python字符串合并与分列:

因为对Python的字符串操作掌握有限,再加上Python字符串操作及其灵活,各种推导式和匿名函数可以很方便的完成,这里仅给出自己常用的做法作为实例,未包含所有方法:

字符串合并:

字符串链接符:”+”

字符串合并函数:join

字符串拆分:split

import random import pandas as pd myyear=random.sample(list(range( 2000 , 2017 )), 10 );myyearmymonth=[ '%02d' % i for i in random.sample(list(range( 1 , 12 )), 10 )];mymonthmyday=[ '%02d' % i for i in random.sample(list(range( 1 , 31 )), 10 )];myday[ 2006 , 2000 , 2007 , 2001 , 2015 , 2016 , 2002 , 2012 , 2010 , 2004 ][ '04' , '11' , '06' , '10' , '07' , '08' , '05' , '02' , '03' , '01' ][ '13' , '28' , '21' , '06' , '08' , '03' , '17' , '16' , '04' , '20' ]

字符串合并:

mydate=[str(i)+ "-" +j+ "-" +k for i,j,k in zip(myyear,mymonth,myday)][ '2011-04-25' , '2008-11-30' , '2003-06-02' , '2007-10-22' , '2009-07-13' , '2005-08-27' , '2014-05-28' , '2012-02-10' , '2016-03-14' , '2015-01-21' ]mydate=[ "-" .join([str(i),j,k]) for i,j,k in zip(myyear,mymonth,myday)][ '2011-04-25' , '2008-11-30' , '2003-06-02' , '2007-10-22' , '2009-07-13' , '2005-08-27' , '2014-05-28' , '2012-02-10' , '2016-03-14' , '2015-01-21' ]

字符串拆分:

方法一(列表推导式):

myyear1=[i.split( "-" )[ 0 ] for i in mydate];myyear1mymonth1=[i.split( "-" )[ 1 ] for i in mydate];mymonth1myday1=[i.split( "-" )[ 2 ] for i in mydate];myday1[ '2011' , '2008' , '2003' , '2007' , '2009' , '2005' , '2014' , '2012' , '2016' , '2015' ][ '04' , '11' , '06' , '10' , '07' , '08' , '05' , '02' , '03' , '01' ][ '25' , '30' , '02' , '22' , '13' , '27' , '28' , '10' , '14' , '21' ]

方法二(使用字典):

mydata=pd.DataFrame({ "date" :mydate})mydata[ "date" ].str.split( "-" ,expand= True )

0 1 2

0 2011 04 25

1 2008 11 30

2 2003 06 02

3 2007 10 22

4 2009 07 13

5 2005 08 27

6 2014 05 28

7 2012 02 10

8 2016 03 14

9 2015 01 21 myyear2=mydata[ "date" ].str.split( "-" ,expand= True )[ 0 ];print(myyear2)mymonth2=mydata[ "date" ].str.split( "-" ,expand= True )[ 1 ];print(mymonth2)myday2=mydata[ "date" ].str.split( "-" ,expand= True )[ 2 ];print(myday2) 0 2011 1 2008 2 2003 3 2007 4 2009 5 2005 6 2014 7 2012 8 2016 9 2015 Name: 0 , dtype: object 0 04 1 11 2 06 3 10 4 07 5 08 6 05 7 02 8 03 9 01 Name: 1 , dtype: object 0 25 1 30 2 02 3 22 4 13 5 27 6 28 7 10 8 14 9 21 Name: 2 , dtype: object

本文小结——字符串拆分与合并:

strsplit

str_split

paste

tidyr::unite

tidyr::separate

Python:

.split

如需转载请联系EasyCharts团队!

微信后台回复“转载”即可! 返回搜狐,查看更多

责任编辑:

声明:该文观点仅代表作者本人,搜狐号系信息发布平台,搜狐仅提供信息存储空间服务。