Pandas数据框架内的Regex--寻找字符间的最小长度

相关文章推荐

道上混的沙发 · [Day16] Python專案 - ...· 2 月前 ·

另类的蚂蚁 · 使用pandas读取csv文件指定的前几行_ ...· 1 月前 ·

旅行中的硬盘 · CAD各版本 - 腾讯云开发者社区-腾讯云· 1 年前 ·

英俊的猴子 · Rokid联手德国vypii为汽车厂商推出A ...· 2 年前 ·

气宇轩昂的羊肉串 · 斗罗大陆� - 搜狗图片搜索· 2 年前 ·

斯文的数据线 · 赛尔号年度S级狂欢 ...· 2 年前 ·

无邪的黄瓜 · 《名侦探柯南：黑铁的鱼影》最新台版中文预告片 ...· 2 年前 ·

编辑：为可重复性而更新

我目前在一个Pandas数据框架内工作，每一行的列[列A]内都有一个字符串的列表。我试图提取一个关键词列表（列表B）的任何子列表组合之间的最小距离

ListB = [['abc','def'],['ghi','jkl'],['mno','pqr']]
而Dataframe列中的每一行都包含一个字符串的列表。
import pandas as pd
import numpy as np
data = pd.DataFrame(np.array([['1', '2', ['random string to be searched abc def ghi jkl','random string to be searched abc','abc random string to be searched def']],
['4', '5', ['random string to be searched ghi jkl','random string to be searched',' mno random string to be searched pqr']],
['7', '8', ['abc random string to be searched def','random string to be searched mno pqr','random string to be searched']]]),
columns=['a', 'b', 'list_of_strings_to_search'])
在高层次上，我试图在data['list_of_strings_to_search']所包含的列表中搜索每个字符串，寻找ListB元素的任何子列表组合（必须满足两个条件）。并返回满足条件的ListB子列表，从中我可以计算出每个子列表元素对之间的距离（用词）。
import pandas as pd
import numpy as np
import re
def find_distance_between_words(text, word_list):
  '''This function does not work as intended yet.'''
  keyword_list = [] 
  # iterates through all sublists in ListB:
  for i in word_list:
    # iterates through all strings within list in dataframe column:
    for strings in text:
      # determines the two words to search (iterates through word_list)
      word1, word2 = i[0], i[1]
      # use regex to find both words:
      p = re.compile('.*?'.join((word1, word2)))
      iterator = p.finditer(strings)
      # for each match, append the string:
      for match in iterator:
        keyword_list.append(match.group())
    return keyword_list
data['try'] = data['list_of_strings_to_search'].apply(find_distance_between_words, word_list = ListB)
expected output:
0    [abc def, ghi jkl, abc random string to be searched def]
1     [ghi jkl, mno random string to be searched pqr]
2    [abc random string to be searched def, mno pqr]
current output:
0    [abc def, abc random string to be searched def]
1                                                 []
2             [abc random string to be searched def]
然而，从对字符串和输出的手动检查来看，大多数的重组词组合并没有从下面的语句中返回，我要求在每个字符串中保留所有的组合。
for match in iterator:
  keyword_list.append(match.group())
我打算返回每个字符串中存在的所有子列表组合（因此要通过子列表候选值列表进行迭代），以评估元素之间的最小距离。
非常感谢任何帮助！!
    2 个评论
Shubham Sharma：
你能解释一下你是如何在预期输出中得到def random string to be searched lmn的吗？
DJW001：
我已经根据样本ListB更新了预期输出 - 谢谢。
python
regex
pandas
string
substring
DJW001发布于 2020-11-29
1 个回答
Shubham Sharma发布于 2020-11-29
已采纳
0 人赞同

让我们在列表理解里面遍历list_of_strings_to_search列中的每个列表，然后对列表中的每个字符串使用re.findall用regex模式找到指定关键词之间长度最小的子字符串。
import re
pat = '|'.join(fr'{x}.*?{y}' for x, y in ListB)
data['result'] = [np.hstack([re.findall(pat, s) for s in l]) for l in data['list_of_strings_to_search']]
Result: