使用lxml在python中解析多个命名空间的XML

1 人关注

<?xml-stylesheet href="/Style Library/st/xslt/rss2.xsl" type="text/xsl" media="screen" ?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:ta="http://www.smartraveller.gov.au/schema/rss/travel_advisories/" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Travel Advisories</title>
    <link>http://smartraveller.gov.au/countries/</link>
    <description>the Australian Department of Foreign Affairs and Trade's Smartraveller advisory service</description>
    <language>en</language>
    <webMaster>webmaster@dfat.gov.au</webMaster>
    <ttl>60</ttl>
    <atom:link href="http://smartraveller.gov.au/countries/Documents/index.rss" rel="self" type="application/rss+xml" />
    <generator>zcms</generator>
    <image>
      <title>Advice</title>
      <link>http://smartraveller.gov.au/countries/</link>
      <url>/Style Library/st/images/dfat_logo_small.gif</url>
    </image>
      <title>Czech Republic</title>
      <description>ThisÂ travel advice has been reviewed.Â The level of ourÂ advice has not changed. Exercise normal safety precautions in the Czech Republic.</description>
      <link>http://smartraveller.gov.au/Countries/europe/eastern/Pages/czech_republic.aspx</link>
      <pubDate>26 Oct 2018 05:25:14 GMT</pubDate>
      <guid isPermaLink="false">cdbcc3d4-3a89-4768-ac1d-0221f8c99227 GMT</guid>
      <ta:warnings>
        <dc:coverage>Czech Republic</dc:coverage>
        <ta:level>2/5</ta:level>
        <dc:description>Exercise normal safety precautions</dc:description>
      </ta:warnings>
  </item>
  我想为我的每个项目提取<ta:level>下的<warning>的值。我已经尝试了现有的在线解决方案，但没有任何东西对我有效。基本上，我的xml包含多个命名空间。
req = requests.request('GET', "https://smartraveller.gov.au/countries/documents/index.rss")
a = str(req.text).encode()
tree = etree.fromstring(a)
ns = {'TravelAd': 'https://smartraveller.gov.au/countries/documents/index.rss',
          'ta': 'http://www.smartraveller.gov.au/schema/rss/travel_advisories/'}
    e = tree.findall('{0}channel/{0}item/{0}warnings/{0}level'.format(ns))
    for i in e:
        print(i.text)


           
            
             那么错误在哪里呢？


         
          python

xml


         
          parsing


         
          namespaces


         
          lxml


        
         
          
          
           Ahtsham Manzoor
          
         
         
          发布于
          
          2019-03-05


          
           
            
            
             Daniel Haley
            
           
           
            发布于
            
            2019-03-06


          
           已采纳


          
           
            XML有多个命名空间，但你唯一需要担心的命名空间是
            
             http://www.smartraveller.gov.au/schema/rss/travel_advisories/
            
            。
           
           
            这是因为在通往你的目标的路径中，只有
            
             ta:level
            
            和
            
             ta:warning
            
            是属于命名空间的元素。
           
           
            例如...
           
           from lxml import etree
import requests
req = requests.request('GET', "https://smartraveller.gov.au/countries/documents/index.rss")
a = str(req.text).encode()
tree = etree.fromstring(a)
ns = {'ta': 'http://www.smartraveller.gov.au/schema/rss/travel_advisories/'}
e = tree.findall('channel/item/ta:warnings/ta:level', ns)
for i in e:
    print(i.text)
prints...
...and so on
如果你想要一个列表，可以考虑从findall()切换到xpath()...
e = tree.xpath('channel/item/ta:warnings/ta:level/text()', namespaces=ns)
print(e)