使用BeautifulSoup与re获取HTML文档中script var的值

最近在编写测试数据生成程序,需创建一个模块用来随机获取全国的行政区划代码及对应区划。刚好在网上找到了官方查询入口: http://xzqh.mca.gov.cn/defaultQuery?shengji=-1&diji=-1&xianji=-1 ,想着用爬虫扒下来对应数据,直接一个函数解决好了。

用fiddler看了源码,在不用任何查询条件的情况下,返回HTML文档的script标签中里有个 var json ,里面存储了所有省份及对应的区划代码,我想取到它,用于之后的省份内区划代码查询。

相关代码如下:

   <script type="text/javascript" src="/js/jquery-1.6.2.min.js"></script>
   <script>
   var json = [{"children":[],"diji":"","quHuaDaiMa":"110000","quhao":"","shengji":"北京市(京)","xianji":""},{"children":[],"diji":"","quHuaDaiMa":"120000","quhao":"","shengji":"天津市(津)","xianji":""},{"children":[],"diji":"","quHuaDaiMa":"130000","quhao":"","shengji":"河北省(冀)","xianji":""},{"children":[],"diji":"","quHuaDaiMa":"140000","quhao":"","shengji":"山西省(晋)","xianji":""},{"children":[],"diji":"","quHuaDaiMa":"150000","quhao":"","shengji":"内蒙古自治区(内蒙古)","xianji":""},{"children":[],"diji":"","quHuaDaiMa":"210000","quhao":"","shengji":"辽宁省(辽)","xianji":""},{"children":[],"diji":"","quHuaDaiMa":"220000","quhao":"","shengji":"吉林省(吉)","xianji":""},{"children":[],"diji":"","quHuaDaiMa":"230000","quhao":"","shengji":"黑龙江省(黑)","xianji":""},{"children":[],"diji":"","quHuaDaiMa":"310000","quhao":"","shengji":"上海市(沪)","xianji":""},{"children":[],"diji":"","quHuaDaiMa":"320000","quhao":"","shengji":"江苏省(苏)","xianji":""},{"children":[],"diji":"","quHuaDaiMa":"330000","quhao":"","shengji":"浙江省(浙)","xianji":""},{"children":[],"diji":"","quHuaDaiMa":"340000","quhao":"","shengji":"安徽省(皖)","xianji":""},{"children":[],"diji":"","quHuaDaiMa":"350000","quhao":"","shengji":"福建省(闽)","xianji":""},{"children":[],"diji":"","quHuaDaiMa":"360000","quhao":"","shengji":"江西省(赣)","xianji":""},{"children":[],"diji":"","quHuaDaiMa":"370000","quhao":"","shengji":"山东省(鲁)","xianji":""},{"children":[],"diji":"","quHuaDaiMa":"410000","quhao":"","shengji":"河南省(豫)","xianji":""},{"children":[],"diji":"","quHuaDaiMa":"420000","quhao":"","shengji":"湖北省(鄂)","xianji":""},{"children":[],"diji":"","quHuaDaiMa":"430000","quhao":"","shengji":"湖南省(湘)","xianji":""},{"children":[],"diji":"","quHuaDaiMa":"440000","quhao":"","shengji":"广东省(粤)","xianji":""},{"children":[],"diji":"","quHuaDaiMa":"450000","quhao":"","shengji":"广西壮族自治区(桂)","xianji":""},{"children":[],"diji":"","quHuaDaiMa":"460000","quhao":"","shengji":"海南省(琼)","xianji":""},{"children":[],"diji":"","quHuaDaiMa":"500000","quhao":"","shengji":"重庆市(渝)","xianji":""},{"children":[],"diji":"","quHuaDaiMa":"510000","quhao":"","shengji":"四川省(川、蜀)","xianji":""},{"children":[],"diji":"","quHuaDaiMa":"520000","quhao":"","shengji":"贵州省(黔、贵)","xianji":""},{"children":[],"diji":"","quHuaDaiMa":"530000","quhao":"","shengji":"云南省(滇、云)","xianji":""},{"children":[],"diji":"","quHuaDaiMa":"540000","quhao":"","shengji":"西藏自治区(藏)","xianji":""},{"children":[],"diji":"","quHuaDaiMa":"610000","quhao":"","shengji":"陕西省(陕、秦)","xianji":""},{"children":[],"diji":"","quHuaDaiMa":"620000","quhao":"","shengji":"甘肃省(甘、陇)","xianji":""},{"children":[],"diji":"","quHuaDaiMa":"630000","quhao":"","shengji":"青海省(青)","xianji":""},{"children":[],"diji":"","quHuaDaiMa":"640000","quhao":"","shengji":"宁夏回族自治区(宁)","xianji":""},{"children":[],"diji":"","quHuaDaiMa":"650000","quhao":"","shengji":"新疆维吾尔自治区(新)","xianji":""},{"children":[],"diji":"","quHuaDaiMa":"810000","quhao":"0852","shengji":"香港特别行政区(港)","xianji":""},{"children":[],"diji":"","quHuaDaiMa":"820000","quhao":"0853","shengji":"澳门特别行政区(澳)","xianji":""},{"children":[],"diji":"","quHuaDaiMa":"710000","quhao":"","shengji":"台湾省(台)","xianji":""}];
   </script>

步骤一:BeautifulSoup解析

在看到返回结果是HTML文档后,第一个反应是使用BeautifulSoup对结果进行解析,方便使用标签将var取到手。【BeautifulSoup模块作用:解析文档,获取文档对象】

示例代码:

  # 使用html.parser解析响应文档
  soup = BeautifulSoup(res_Name.text, 'html.parser')
  # 获取所有script对象内容
  temps = soup.find_all("script")

步骤二:获取var json 相应内容

尝试方法一: 第三方parse库

其实为了偷懒,想先用parse库对所有script的内容进行匹配,获取变量var json的值。但使用如下代码,返回值为空,所以放弃了该想法:

  pattern = compile("var json = {}")
  print(pattern.parse(str(temps)))

尝试方法二:正则表达式

因为近路parse库没有效果,于是老实回头钻re的牛角尖。网上找了许久,终于找到了可行代码:

  pattern = re.findall(r""".+?quHuaDaiMa":"(.+?)".+?shengji":"(.+?\))""", str(temps), re.MULTILINE | re.DOTALL)

正则表达式详解:

1) re.findall(pattern, string, flags=0)
以string列表形式返回string中pattern的所有非重叠匹配项。从左到右扫描该字符串,并以找到的顺序返回匹配项。如果该模式中存在一个或多个组,则返回一个组列表;否则,返回一个列表。如果模式包含多个组,则这将是一个元组列表。空匹配项包含在结果中。

2) r" " 表达式前添加r——避免转义
Python正则表达式前的 r 表示原生字符串(rawstring),该字符串声明了引号中的内容表示该内容的原始含义,避免了多次转义造成的反斜杠困扰。

关于反斜杠困扰:与多数编程语言相同,正则表达式中使用“\”作为转义字符,如果需要匹配文本中的字符“\”,在正则表达式中需要4个“\”,首先,前2个“\”和后两个“\”在python解释器中分别转义成一个“\”,然后转义后的2个“\”在正则中被转义成一个“\”。

   In [132]: str1 = "c:\\a\\b\\c"                                                    
   In [133]: str1                                                                    
   Out[133]: 'c:\\a\\b\\c'
   In [134]: print(str1)                                                             
   c:\a\b\c
   # 从上面这两个打印,可以看出print对反斜杠进行了转义,将两个\ 转义为了一个 \
   In [135]:  
   # 那么如果我想要匹配字符串中的 c:\\ ,我在匹配规则就要写 c:\\\\ ,因为\\会被转义为一个 \
   In [135]: re.match("c:\\\\",str1).group()                                         
   Out[135]: 'c:\\'
   In [136]:  
   # 而如果再用print来打印的时候,就会发现 \\ 再次被转义为了 \ 
   In [136]: ret = re.match("c:\\\\",str1).group()                                   
   In [137]: print(ret)                                                              
   In [138]:   
   # 那么如果需要匹配字符串 c:\\a 的话,那么匹配规则就要写 c:\\\\a 了。这样就比较麻烦。有没有简单的方法呢?
   In [138]: ret = re.match("c:\\\\a",str1).group()                                  
   In [139]: print(ret)                                                              
   # 在匹配规则前面加 r ,那么就只是要写 c:\\a 就可以匹配字符串 c:\\a 了。
   In [141]: ret = re.match(r"c:\\a",str1).group()                                   
   In [142]: print(ret)                                                              
   In [143]:   
   In [143]: ret = re.match("c:\\\\a\\\\b\\\\c",str1).group()                        
   In [144]: print(ret)                                                              
   c:\a\b\c
   In [145]: ret = re.match(r"c:\\a\\b\\c",str1).group()                             
   In [146]: print(ret)                                                              
   c:\a\b\c

**3).+? **匹配任意一串字符

匹配任意字符,除了换行符,当re.DOTALL标记被指定时,则可以匹配包括换行符的任意字符。

重复一次或更多次。

加上?为将贪婪匹配模式转为非贪婪匹配模式,会匹配尽量短的字符串

4) quHuaDaiMa":"shengji":"
var json中用于识别的关键字

5)(.+?)"(.+?\))"

表示要提取的内容

  • ()后的"
    提取结束的标识

  • 将shengji对应值结尾的)也加入提取内容

    6)str(temps)
    由上文代码可知:

      soup = BeautifulSoup(res_Name.text, 'html.parser')
      temps = soup.find_all("script")
    

    当前temp类型是bs4.element.Tag,因为使用re匹配的是字符串,所以强制转换为str

    **7)re.MULTILINE | re.DOTALL **
    匹配所有字符,包括换行符。使用管道符,设置匹配字符串模式为多行re.MULTILINE【返回结果为多行】或单行re.DOTALL【返回结果为单行】

    步骤三:将获取到的数据放入dataframe中以便使用

    具体代码如下:

      # 将数据放入DataFrame中
      col = ["quHuaDaiMa", "shengji"]
      df = pd.DataFrame(columns=col)
      for i in range(len(pattern)):
          print(pattern[i])
          df = df.append({col[0]: pattern[i][0], col[1]: pattern[i][1]}, ignore_index=True)