[0:00:02] name1: Okay, this is the continued string...

我想得到一个Python正则表达式,提取所有以Okay...开头的文本。

我已经想出了如何提取时间戳和发言人的名字。

 time_frame = re.search('\[(.*?)\]', temp).group(1)
 speaker_id = re.search('\] (.*?)\:', temp).group(1)

然而,我对最后一个问题感到不满意。请注意,右边的文本字符串中可能有一个冒号,但我想捕捉文本字符串中的所有内容。

1 个评论
speaker_id模式中附加\s*(.*)有什么特别的问题吗?
python
regex
user1357015
user1357015
发布于 2020-12-04
4 个回答
Ryszard Czech
Ryszard Czech
发布于 2020-12-04
已采纳
0 人赞同

Following your logic:

re.search(r'\[.*?\]\s*\w+:\s*(.+)', temp).group(1)

See proof

--------------------------------------------------------------------------------
  \[                       '['
--------------------------------------------------------------------------------
  .*?                      any character except \n (0 or more times
                           (matching the least amount possible))
--------------------------------------------------------------------------------
  \]                       ']'
--------------------------------------------------------------------------------
  \s*                      whitespace (\n, \r, \t, \f, and " ") (0 or
                           more times (matching the most amount
                           possible))
--------------------------------------------------------------------------------
  \w+                      word characters (a-z, A-Z, 0-9, _) (1 or
                           more times (matching the most amount
                           possible))
--------------------------------------------------------------------------------
  :                        ':'
--------------------------------------------------------------------------------
  \s*                      whitespace (\n, \r, \t, \f, and " ") (0 or
                           more times (matching the most amount
                           possible))
--------------------------------------------------------------------------------
  (                        group and capture to \1:
--------------------------------------------------------------------------------
    .+                       any character except \n (1 or more times
                             (matching the most amount possible))
--------------------------------------------------------------------------------
  )                        end of \1
    
taras
taras
发布于 2020-12-04
0 人赞同

你可以从字面上用\[\d+:\d+:\d+\]匹配时间戳,用.*?:匹配第一个冒号。

'\[\d+:\d+:\d+\].*?:(.*)'

实际上,你可以用一个词组匹配所有3个组。

'\[(\d+:\d+:\d+\)] (.*?):(.*)'
    
Nour-Allah Hussein
Nour-Allah Hussein
发布于 2020-12-04
0 人赞同

让我们以一种简单的方式聚在一起g=re.findall(r'\[(.*?)\]\s*(.*):\s*(.*)',text)

import re
text='[0:00:02] name1: Okay, this is the continued string...'
g=re.findall(r'\[(.*?)\]\s*(.*):\s*(.*)',text)
time_frame = g[0][0]
speaker_id = g[0][1]
speach = g[0][2]
print(time_frame)
print(speaker_id)
print(speach)

output

0:00:02
name1
Okay, this is the continued string...
    
The fourth bird
The fourth bird
发布于 2020-12-04
0 人赞同

你可以排除匹配:,然后匹配它和可选的空白字符。然后在一个捕捉组中捕捉后面的所有内容。

^\[[^][]*][^:]*:\s*(.+)

Regex demo

import re
regex = r"^\[[^][]*][^:]*:\s*(.+)"
temp = "[0:00:02] name1: Okay, this is the continued string..."
matches = re.search(regex, temp)
if matches:
    print(matches.group(1))

Output

Okay, this is the continued string...

在一个模式中匹配所有3个部分。

^\[([^][]*)]([^:]*):\s*(.+)
  • ^\[ Match opening [ at the start of the string
  • ([^][]*) Capture group 1, match any char except [ and ]
  • ]\s* Match closing ]
  • ([^:]*) Capture group 2 Match any char except :
  • :\s* Match : and 0+ whitespace chars
  • (.+) Capture group 3, Match the rest of the string
  • regex demo

    import re
    regex = r"^\[([^][]*)]\s*([^:]*):\s*(.+)"
    temp = "[0:00:02] name1: Okay, this is the continued string..."
    matches = re.search(regex, temp)
    if matches:
        print(matches.group(1))
        print(matches.group(2))
        print(matches.group(3))