BeaufulSoup获取特定标签下内容的方法_qianc6350528的博客

相关文章推荐

帅气的甘蔗 · 负责任的AI：ChatGPT和负责任的研究与 ...· 3 月前 ·

沉着的大白菜 · VTK+OSG实验小结（图）_osg ...· 1 年前 ·

酷酷的柠檬 · mongodb 多条件查询 java-掘金· 1 年前 ·

绅士的长颈鹿 · SpringBoot从入门到精通-Dozer ...· 1 年前 ·

以下是个人在学习beautifulSoup过程中的一些总结，目前我在使用爬虫数据时使用的方法的是：先用find_all()找出需要内容所在的标签，如果所需内容一个find_all()不能满足，那就用两个或者多个。接下来遍历find_all的结果，用get_txt（）、get(‘href’)、得到文本或者链接，然后放入各自的列表中。这样做有一个缺点就是txt的数据是一个单独的列表，链接的数据也是一个单独的列表，一方面不能体现这些数据之间的结构性，另一方面当想要获得更多的内容时，就要创建更多的空列表。

遍历所有标签：

soup.find_all('a')

找出所有页面中含有标签a的html语句，结果以列表形式存储。对找到的标签可以进一步处理，如用for对结果遍历，可以对结果进行purify，得到如链接，字符等结果。

# 创建空列表
links=[]  
txts=[]
tags=soup.find_all('a')
for tag in tags:
    links.append(tag.get('href')
    txts.append(tag.txt)                 #或者txts.append(tag.get_txt())
得到html的属性名： 
atr=[]
tags=soup.find_all('a')
for tag in tags:
    atr.append(tag.p('class'))  # 得到a 标签下，子标签p的class名称   
find_all()的相关用法实例： 
 实例来自BeautifulSoup中文文档 
 1. 字符串 
 最简单的过滤器是字符串.在搜索方法中传入一个字符串参数,Beautiful Soup会查找与字符串完整匹配的内容,下面的例子用于查找文档中所有的标签: 
soup.find_all('b')
# [<b>The Dormouse's story</b>] 
2.正则表达式 
 如果传入正则表达式作为参数,Beautiful Soup会通过正则表达式的 match() 来匹配内容.下面例子中找出所有以b开头的标签,这表示和标签都应该被找到: 
import re
for tag in soup.find_all(re.compile("^b")):
    print(tag.name)
# body
下面代码找出所有名字中包含”t”的标签: 
for tag in soup.find_all(re.compile("t")):
    print(tag.name)
# html
# title 
3.列表 
 如果传入列表参数,Beautiful Soup会将与列表中任一元素匹配的内容返回.下面代码找到文档中所有标签和标签: 
soup.find_all(["a", "b"])
# [<b>The Dormouse's story</b>,
#  <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] 
4.方法（自定义函数，传入find_all） 
 如果没有合适过滤器,那么还可以定义一个方法,方法只接受一个元素参数 [4] ,如果这个方法返回 True 表示当前元素匹配并且被找到,如果不是则反回 False 
 下面方法校验了当前元素,如果包含 class 属性却不包含 id 属性,那么将返回 True: 
def has_class_but_no_id(tag):
    return tag.has_attr('class') and not tag.has_attr('id')``` 
返回结果中只有
标签没有标签,因为标签还定义了”id”,没有返回和,因为和中没有定义”class”属性. 
 下面代码找到所有被文字包含的节点内容: 
from bs4 import NavigableString
def surrounded_by_strings(tag):
    return (isinstance(tag.next_element, NavigableString)
            and isinstance(tag.previous_element, NavigableString))
for tag in soup.find_all(surrounded_by_strings):
    print tag.name
5.按照CSS搜索 
 按照CSS类名搜索tag的功能非常实用,但标识CSS类名的关键字 class 在Python中是保留字,使用 class 做参数会导致语法错误.从Beautiful Soup的4.1.1版本开始,可以通过 class_ 参数搜索有指定CSS类名的tag: 
soup.find_all("a", class_="sister")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] 
soup.find_all("a", attrs={"class": "sister"})
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] 
6.按照text参数查找 
 通过 text 参数可以搜搜文档中的字符串内容.与 name 参数的可选值一样, text 参数接受 字符串 , 正则表达式 , 列表, True . 看例子: 
soup.find_all(text="Elsie")
# [u'Elsie']
soup.find_all(text=["Tillie", "Elsie", "Lacie"])
# [u'Elsie', u'Lacie', u'Tillie']
soup.find_all(text=re.compile("Dormouse"))
[u"The Dormouse's story", u"The Dormouse's story"]
def is_the_only_string_within_a_tag(s):
    ""Return True if this string is the only child of its parent tag.""
    return (s == s.parent.string)
soup.find_all(text=is_the_only_string_within_a_tag)
# [u"The Dormouse's story", u"The Dormouse's story", u'Elsie', u'Lacie', u'Tillie', u'...'] 
虽然 text 参数用于搜索字符串,还可以与其它参数混合使用来过滤tag.Beautiful Soup会找到 .string 方法与 text 参数值相符的tag.下面代码用来搜索内容里面包含“Elsie”的标签: 
soup.find_all("a", text="Elsie")
# [<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>] 
7.只查找当前标签的子节点 
 调用tag的 find_all() 方法时,Beautiful Soup会检索当前tag的所有子孙节点,如果只想搜索tag的直接子节点,可以使用参数 recursive=False . 
一段简单的文档: 
  <title>
   The Dormouse's story
  </title>
 </head>
是否使用 recursive 参数的搜索结果: 
soup.html.find_all("title")
# [<title>The Dormouse's story</title>]
soup.html.find_all("title", recursive=False)
                    以下是个人在学习beautifulSoup过程中的一些总结，目前我在使用爬虫数据时使用的方法的是：先用find_all()找出需要内容所在的标签，如果所需内容一个find_all()不能满足，那就用两个或者多个。接下来遍历find_all的结果，用get_txt（）、get('href')、得到文本或者链接，然后放入各自的列表中。这样做有一个缺点就是txt的数据是一个单独的列表，链接的数据也是一个单
from bs4 import BeautifulSoup
f = open("word.txt", "r")  # 设置文件对象
html = f.read()  # 将txt文件的所有内容读入到字符串html中
soup = BeautifulSoup(html, 'lxml')
# 获取a标签里的文本内容
for item in soup.find_all("a"):
    print(item.string)
    # 将单词写入five_star.txt 文件
    with open('five_star.txt', 'a',
				1. 查找标签soup.find_all('tag')
2. 查找文本soup.find_all(text='text')
3. 根据id查找soup.find_all(id='tag id')
4. 使用正则soup.find_all(text=re.compile('your re')), soup.find_all(id=re.compile('your re'))
5. 指定属性查找标签soup.find_all('tag', {'id': 'tag id', 'class': 'tag ...
爬虫一直是Python的一大应用场景，差不多每门语言都可以写爬虫，但是程序员们却独爱Python。之所以偏爱Python就是因为她简洁的语法，我们使用Python可以很简单的写出一个爬虫程序。本篇博客将以Python语言，用几个非常简单的例子带大家入门Python爬虫。
二、网络爬虫
如果把我们的因特网比作一张复杂的蜘蛛网的话，那我们的爬虫就是一个蜘，我们可以让这个蜘蛛在网上任意爬行，在...
				学习爬虫的小笔记，希望能和大家一起进步哈。
爬虫爬一般都是网页信息，beautifulsoup能够使用html.parse对网页信息进行解析，一个beautifulsoup对应一个网页的内容，就是使用requests（url）返回的对象的text。
下面讲一下beautifulsoup的几个基本元素:
举几个例子说明一下吧。
tag是带有&lt;&gt;…&lt;/&gt;标签的一段内容，比如我...
				本文为北理嵩天老师《Python网络爬虫与信息提取》学习笔记。
本文含有以下内容：一、BeautifulSoup库、html文档、标签树三者间关系二、使用Beautiful Soup库最基本的语句：三、BeautifulSoup类的基本元素四、HTML树形结构有三种遍历方法：五、基于bs4库的HTML格式输出
Beautiful Soup库能够对提供给它的任何格式进行爬取，并且进行属性解析。在爬虫中，常被用来解析html和xml页面。
一、BeautifulSoup库、html文档、标签树三者间关系
				在爬取网页的时候，用bs4库爬取网页上想要的一块标签，但是却不知道怎么提取里面的内容，或者不知道怎么得到标签里面的各种属性值，比如a标签的href属性的值，这里有几种方法：
使用get_text()或者是.text同时取出了div标签下的文本及子标签文本。
实例如图：
detials=dl.select('p')[0].text
    print(detials)
运行结果如图：
可以看到后面的【详细】是在p标签下的子标签里面的内容
如果已经拿到了最里层的标签的话，可以直接用.string的方
Beautiful Soup 是一个可以从 HTML 或 XML 文件中提取数据的第三方 Python 库，通常在网络爬虫项目中使用。今天，笔者在使用过程中遇到了一个坑，在此记录，并将解决方案分享给大家。
参考官方文档可知：如果标签里面只有一个子节点，使用 .string 方法可以获得标签内的文本内容。
但是，运行下列代码的输出结果却分别是 None 和 ZXC's Blog。
from bs4 import BeautifulSoup
html = '''<p> &lt
				1.HTML文本
这里以官方文档提供的html代码来演示Beautiful Soup中find_all()和find()的基本使用。
<html><head><title>The Dormouse's story</title></head>
<p class="title"><b>The Dormouse's story</b></p>
<p class="stor
from bs4 import BeautifulSoup
soup=BeautifulSoup(demo,"html.parser")
print(soup.prettify())
Beaut
                    我想问一下亳州，单纯的查找并输出第二a标签该怎么办？
[code=html]
#  <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
[/code]
                BeaufulSoup获取特定标签下内容的方法
                R practise
                    qianc6350528: 
                    转载自coursera上 JohnHopkins University的数据分析课程