如何删除html中某一特定文本后的所有内容？使用 python 和 beautifulsoup4

1 人关注

我正在尝试搜刮维基百科。我希望只获得所需的数据，并抛弃所有不必要的东西，例如另见 , References , etc.

<span class="mw-headline" id="See_also">See also</span> <li><a href="/wiki/List_of_adaptations_of_works_by_Stephen_King" title="List of adaptations of works by Stephen King">List of adaptations of works by Stephen King</a></li> <li><a href="/wiki/Castle_Rock_(Stephen_King)" title="Castle Rock (Stephen King)">Castle Rock (Stephen King)</a></li> <li><a href="/wiki/Charles_Scribner%27s_Sons" title="Charles Scribner's Sons">Charles Scribner's Sons</a> (aka Scribner)</li> <li><a href="/wiki/Derry_(Stephen_King)" title="Derry (Stephen King)">Derry (Stephen King)</a></li> <li><a href="/wiki/Dollar_Baby" title="Dollar Baby">Dollar Baby</a></li> <li><a href="/wiki/Jerusalem%27s_Lot_(Stephen_King)" title="Jerusalem's Lot (Stephen King)">Jerusalem's Lot (Stephen King)</a></li> <li><i><a href="/wiki/Haven_(TV_series)" title="Haven (TV series)">Haven</a></i></li>

如上面的HTML所示。如果我发现另见 in h2 标签，我想删除它后面的所有内容。在这种情况下是无序列表。

4 个评论

furas ：

更简单的做法是将其作为文本，找到位置，用


            html = html[:position]

将其切开。

furas ：

用


            beautifulsoup

或


            lxml

你可以用


            extract()

来删除元素，所以它需要一个一个地删除它们。

furas ：

也许你应该使用更好的方法，只获得想要的数据，而不是删除其他数据。

Noob ：

The suggestion for using position logic is good, but it is not effficient.

python

html

python-3.x

web-scraping

beautifulsoup

Noob

发布于 2021-05-13

1 个回答

Andrej Kesely

发布于 2021-05-13

已采纳

0 人赞同

你可以使用带有 ~ 的CSS选择器来选择合适的元素进行提取。

from bs4 import BeautifulSoup
txt = '''
<div>This I want to keep</div>
     <span class="mw-headline" id="See_also">See also</span>
     <li><a href="/wiki/List_of_adaptations_of_works_by_Stephen_King" title="List of adaptations of works by Stephen King">List of adaptations of works by Stephen King</a></li>
     <li><a href="/wiki/Castle_Rock_(Stephen_King)" title="Castle Rock (Stephen King)">Castle Rock (Stephen King)</a></li>
     <li><a href="/wiki/Charles_Scribner%27s_Sons" title="Charles Scribner&#39;s Sons">Charles Scribner's Sons</a> (aka Scribner)</li>
     <li><a href="/wiki/Derry_(Stephen_King)" title="Derry (Stephen King)">Derry (Stephen King)</a></li>
     <li><a href="/wiki/Dollar_Baby" title="Dollar Baby">Dollar Baby</a></li>
     <li><a href="/wiki/Jerusalem%27s_Lot_(Stephen_King)" title="Jerusalem&#39;s Lot (Stephen King)">Jerusalem's Lot (Stephen King)</a></li>
     <li><i><a href="/wiki/Haven_(TV_series)" title="Haven (TV series)">Haven</a></i></li>
soup = BeautifulSoup(txt, 'html.parser')
for tag in soup.select('h2:contains("See also") ~ *, h2:contains("See also")'):
    tag.extract()
print(soup)