最近在编写测试数据生成程序,需创建一个模块用来随机获取全国的行政区划代码及对应区划。刚好在网上找到了官方查询入口: http://xzqh.mca.gov.cn/defaultQuery?shengji=-1&diji=-1&xianji=-1 ,想着用爬虫扒下来对应数据,直接一个函数解决好了。
用fiddler看了源码,在不用任何查询条件的情况下,返回HTML文档的script标签中里有个
var json
,里面存储了所有省份及对应的区划代码,我想取到它,用于之后的省份内区划代码查询。
相关代码如下:
<script type="text/javascript" src="/js/jquery-1.6.2.min.js"></script>
<script>
var json = [{"children":[],"diji":"","quHuaDaiMa":"110000","quhao":"","shengji":"北京市(京)","xianji":""},{"children":[],"diji":"","quHuaDaiMa":"120000","quhao":"","shengji":"天津市(津)","xianji":""},{"children":[],"diji":"","quHuaDaiMa":"130000","quhao":"","shengji":"河北省(冀)","xianji":""},{"children":[],"diji":"","quHuaDaiMa":"140000","quhao":"","shengji":"山西省(晋)","xianji":""},{"children":[],"diji":"","quHuaDaiMa":"150000","quhao":"","shengji":"内蒙古自治区(内蒙古)","xianji":""},{"children":[],"diji":"","quHuaDaiMa":"210000","quhao":"","shengji":"辽宁省(辽)","xianji":""},{"children":[],"diji":"","quHuaDaiMa":"220000","quhao":"","shengji":"吉林省(吉)","xianji":""},{"children":[],"diji":"","quHuaDaiMa":"230000","quhao":"","shengji":"黑龙江省(黑)","xianji":""},{"children":[],"diji":"","quHuaDaiMa":"310000","quhao":"","shengji":"上海市(沪)","xianji":""},{"children":[],"diji":"","quHuaDaiMa":"320000","quhao":"","shengji":"江苏省(苏)","xianji":""},{"children":[],"diji":"","quHuaDaiMa":"330000","quhao":"","shengji":"浙江省(浙)","xianji":""},{"children":[],"diji":"","quHuaDaiMa":"340000","quhao":"","shengji":"安徽省(皖)","xianji":""},{"children":[],"diji":"","quHuaDaiMa":"350000","quhao":"","shengji":"福建省(闽)","xianji":""},{"children":[],"diji":"","quHuaDaiMa":"360000","quhao":"","shengji":"江西省(赣)","xianji":""},{"children":[],"diji":"","quHuaDaiMa":"370000","quhao":"","shengji":"山东省(鲁)","xianji":""},{"children":[],"diji":"","quHuaDaiMa":"410000","quhao":"","shengji":"河南省(豫)","xianji":""},{"children":[],"diji":"","quHuaDaiMa":"420000","quhao":"","shengji":"湖北省(鄂)","xianji":""},{"children":[],"diji":"","quHuaDaiMa":"430000","quhao":"","shengji":"湖南省(湘)","xianji":""},{"children":[],"diji":"","quHuaDaiMa":"440000","quhao":"","shengji":"广东省(粤)","xianji":""},{"children":[],"diji":"","quHuaDaiMa":"450000","quhao":"","shengji":"广西壮族自治区(桂)","xianji":""},{"children":[],"diji":"","quHuaDaiMa":"460000","quhao":"","shengji":"海南省(琼)","xianji":""},{"children":[],"diji":"","quHuaDaiMa":"500000","quhao":"","shengji":"重庆市(渝)","xianji":""},{"children":[],"diji":"","quHuaDaiMa":"510000","quhao":"","shengji":"四川省(川、蜀)","xianji":""},{"children":[],"diji":"","quHuaDaiMa":"520000","quhao":"","shengji":"贵州省(黔、贵)","xianji":""},{"children":[],"diji":"","quHuaDaiMa":"530000","quhao":"","shengji":"云南省(滇、云)","xianji":""},{"children":[],"diji":"","quHuaDaiMa":"540000","quhao":"","shengji":"西藏自治区(藏)","xianji":""},{"children":[],"diji":"","quHuaDaiMa":"610000","quhao":"","shengji":"陕西省(陕、秦)","xianji":""},{"children":[],"diji":"","quHuaDaiMa":"620000","quhao":"","shengji":"甘肃省(甘、陇)","xianji":""},{"children":[],"diji":"","quHuaDaiMa":"630000","quhao":"","shengji":"青海省(青)","xianji":""},{"children":[],"diji":"","quHuaDaiMa":"640000","quhao":"","shengji":"宁夏回族自治区(宁)","xianji":""},{"children":[],"diji":"","quHuaDaiMa":"650000","quhao":"","shengji":"新疆维吾尔自治区(新)","xianji":""},{"children":[],"diji":"","quHuaDaiMa":"810000","quhao":"0852","shengji":"香港特别行政区(港)","xianji":""},{"children":[],"diji":"","quHuaDaiMa":"820000","quhao":"0853","shengji":"澳门特别行政区(澳)","xianji":""},{"children":[],"diji":"","quHuaDaiMa":"710000","quhao":"","shengji":"台湾省(台)","xianji":""}];
</script>
步骤一:BeautifulSoup解析
在看到返回结果是HTML文档后,第一个反应是使用BeautifulSoup对结果进行解析,方便使用标签将var取到手。【BeautifulSoup模块作用:解析文档,获取文档对象】
示例代码:
# 使用html.parser解析响应文档
soup = BeautifulSoup(res_Name.text, 'html.parser')
# 获取所有script对象内容
temps = soup.find_all("script")
步骤二:获取var json 相应内容
尝试方法一: 第三方parse库
其实为了偷懒,想先用parse库对所有script的内容进行匹配,获取变量var json的值。但使用如下代码,返回值为空,所以放弃了该想法:
pattern = compile("var json = {}")
print(pattern.parse(str(temps)))
尝试方法二:正则表达式
因为近路parse库没有效果,于是老实回头钻re的牛角尖。网上找了许久,终于找到了可行代码:
pattern = re.findall(r""".+?quHuaDaiMa":"(.+?)".+?shengji":"(.+?\))""", str(temps), re.MULTILINE | re.DOTALL)
正则表达式详解:
1) re.findall(pattern, string, flags=0)
以string列表形式返回string中pattern的所有非重叠匹配项。从左到右扫描该字符串,并以找到的顺序返回匹配项。如果该模式中存在一个或多个组,则返回一个组列表;否则,返回一个列表。如果模式包含多个组,则这将是一个元组列表。空匹配项包含在结果中。
2) r" "
表达式前添加r——避免转义
Python正则表达式前的 r 表示原生字符串(rawstring),该字符串声明了引号中的内容表示该内容的原始含义,避免了多次转义造成的反斜杠困扰。
关于反斜杠困扰:与多数编程语言相同,正则表达式中使用“\”作为转义字符,如果需要匹配文本中的字符“\”,在正则表达式中需要4个“\”,首先,前2个“\”和后两个“\”在python解释器中分别转义成一个“\”,然后转义后的2个“\”在正则中被转义成一个“\”。
In [132]: str1 = "c:\\a\\b\\c"
In [133]: str1
Out[133]: 'c:\\a\\b\\c'
In [134]: print(str1)
c:\a\b\c
# 从上面这两个打印,可以看出print对反斜杠进行了转义,将两个\ 转义为了一个 \
In [135]:
# 那么如果我想要匹配字符串中的 c:\\ ,我在匹配规则就要写 c:\\\\ ,因为\\会被转义为一个 \
In [135]: re.match("c:\\\\",str1).group()
Out[135]: 'c:\\'
In [136]:
# 而如果再用print来打印的时候,就会发现 \\ 再次被转义为了 \
In [136]: ret = re.match("c:\\\\",str1).group()
In [137]: print(ret)
In [138]:
# 那么如果需要匹配字符串 c:\\a 的话,那么匹配规则就要写 c:\\\\a 了。这样就比较麻烦。有没有简单的方法呢?
In [138]: ret = re.match("c:\\\\a",str1).group()
In [139]: print(ret)
# 在匹配规则前面加 r ,那么就只是要写 c:\\a 就可以匹配字符串 c:\\a 了。
In [141]: ret = re.match(r"c:\\a",str1).group()
In [142]: print(ret)
In [143]:
In [143]: ret = re.match("c:\\\\a\\\\b\\\\c",str1).group()
In [144]: print(ret)
c:\a\b\c
In [145]: ret = re.match(r"c:\\a\\b\\c",str1).group()
In [146]: print(ret)
c:\a\b\c
**3).+?
**匹配任意一串字符
匹配任意字符,除了换行符,当re.DOTALL标记被指定时,则可以匹配包括换行符的任意字符。
重复一次或更多次。
加上?为将贪婪匹配模式转为非贪婪匹配模式,会匹配尽量短的字符串
4) quHuaDaiMa":"
与shengji":"
var json中用于识别的关键字
5)(.+?)"
与(.+?\))"
表示要提取的内容
()后的"
提取结束的标识
将shengji对应值结尾的)
也加入提取内容
6)str(temps)
由上文代码可知:
soup = BeautifulSoup(res_Name.text, 'html.parser')
temps = soup.find_all("script")
当前temp类型是bs4.element.Tag,因为使用re匹配的是字符串,所以强制转换为str
**7)re.MULTILINE | re.DOTALL **
匹配所有字符,包括换行符。使用管道符,设置匹配字符串模式为多行re.MULTILINE
【返回结果为多行】或单行re.DOTALL
【返回结果为单行】
步骤三:将获取到的数据放入dataframe中以便使用
具体代码如下:
# 将数据放入DataFrame中
col = ["quHuaDaiMa", "shengji"]
df = pd.DataFrame(columns=col)
for i in range(len(pattern)):
print(pattern[i])
df = df.append({col[0]: pattern[i][0], col[1]: pattern[i][1]}, ignore_index=True)