67_Pandas将切片应用于字符串，以提取任意位置和长度的部分_pandas字符串截取

67_Pandas将切片应用于字符串，以提取任意位置和长度的部分

Python 字符串（内置类型 str）方法应用于 pandas.DataFrame 列（= pandas.Series），请使用 .str（str 访问器）。

例如，可以使用 str.match() 和 str.extract() 使用正则表达式提取字符串的一部分。

在这里，将介绍如何使用切片从任意位置（例如开头或结尾）提取任意长度（字符数）的字符串，并生成新字符串。

04_Pandas获取和修改任意位置的值（at,iat,loc,iloc）
将切片应用于 pandas 中的字符串列
- 提取第一个字符
- 提取最后一个字符
- 通过指定增量（步长）提取
- 任意位置提取一个字符
- 添加为 pandas.DataFrame 中的列
将数字转换为字符串并应用切片

以下面的 pandas.DataFrame 为例。

import pandas as pd
df = pd.DataFrame({'a': ['abcde', 'fghij', 'klmno'],
                   'b': [123, 456, 789]})
print(df)
#        a    b
# 0  abcde  123
# 1  fghij  456
# 2  klmno  789
print(df.dtypes)
# a    object
# b     int64
# dtype: object
将切片应用于 pandas 中的字符串列
 
可以使用 .str[] 将切片直接应用于字符串列。 
提取第一个字符
 
print(df['a'].str[:2])
# 0    ab
# 1    fg
# 2    kl
# Name: a, dtype: object
提取最后一个字符
 
使用负值来指定结束。 
print(df['a'].str[-2:])
# 0    de
# 1    ij
# 2    no
# Name: a, dtype: object
通过指定增量（步长）提取
 
尽管可能不经常使用，但也可以指定增量（步骤）。 
print(df['a'].str[::2])
# 0    ace
# 1    fhj
# 2    kmo
# Name: a, dtype: object
任意位置提取一个字符
 
除了切片之外，还可以通过索引（从0开始的位置）提取单个字符。用 -1 指定最后一个字符。 
print(df['a'].str[2])
# 0    c
# 1    h
# 2    m
# Name: a, dtype: object
print(df['a'].str[0])
# 0    a
# 1    f
# 2    k
# Name: a, dtype: object
print(df['a'].str[-1])
# 0    e
# 1    j
# 2    o
# Name: a, dtype: object
添加为 pandas.DataFrame 中的列
 
将提取的列添加为新列。 
df['a_head'] = df['a'].str[:2]
print(df)
#        a    b a_head
# 0  abcde  123     ab
# 1  fghij  456     fg
# 2  klmno  789     kl
将数字转换为字符串并应用切片
 
如果在字符串类型以外的列上使用带有 str 访问器的字符串方法，则会出现 AttributeError 错误。 
# print(df['b'].str[:2])
# AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas
使用astype()方法将其转换为字符串str就可以了。 
print(df['b'].astype(str).str[:2])
# 0    12
# 1    45
# 2    78
# Name: b, dtype: object
如果想将其视为数字，请再次应用 astype()。 
print(df['b'].astype(str).str[:2].astype(int))
# 0    12
# 1    45
# 2    78
# Name: b, dtype: int64
在该示例中，还可以如下计算。除以 10 并转换为整数 int 类型，小数部分被截去。 
print((df['b'] / 10).astype(int))
# 0    12
# 1    45
# 2    78
# Name: b, dtype: int64
1. str对象的设计意图¶
str 对象是定义在 Index 或 Series 上的属性，专门用于逐元素处理文本内容，其内部定义了大量方法，因此对一个序列进行文本处理，首先需要获取其 str 对象。在Python标准库中也有 str 模块，为了使用上的便利，有许多函数的用法 pandas 照搬了它的设计，例如字母转为大写的操作：
In [3]: var = 'abcd'
In [4]: str.upper(var) # Python内置str模块
Out[4]
str.startswith（）：以特定的字符串开头
str.match（）：匹配正则表达式模式
要提取部分匹配的行，可以使用pandas的（str.xxx（））方法，根据指定条件提取的字符串方法。
这次以以下数据为例import pandas as pddf = pd.re
				构建数据框，我们要把b列以“-”分割成两列from pandas.core.frame import DataFrame
df = DataFrame({"a" : ["1","2","3","4"],
        "b" : ["5-9","6-10","7-11","8-12"]})
print (df)
   a     b
0  1   5-9
1  2  6-10
2  3  7-11
                    File "E:\test\pycharm\pycharm\PyCharm 2024.1.1\plugins\python\helpers\pydev\_pydev_imps\_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "E:\test\pycharm\pycharmWorkspace\AbovePlate\.venv\Place\placeRequest.py", line 4, in <module>
    df = pd.read_xlsx('./q/s.xlsx',engine='openpyxl')
  File "E:\test\pycharm\pycharmWorkspace\AbovePlate\.venv\lib\site-packages\pandas\__init__.py", line 244, in __getattr__
    raise AttributeError(f"module 'pandas' has no attribute '{name}'")
AttributeError: module 'pandas' has no attribute 'read_xlsx'
python-BaseException
                07_pandas.DataFrame的for循环处理（迭代）
                    Minority Carrier: 
                50_Pandas读取 Excel 文件 (xlsx, xls)
                    BOOKAI: 
                    扑街,现在csdn改上传gpt回答水帖了是吧
                饺子大人的Python-Numpy
                    CSDN-Ada助手: 
                    哇, 你的文章质量真不错，值得学习！不过这么高质量的文章, 还值得进一步提升, 以下的改进点你可以参考下: (1)提升标题与正文的相关性。