Python字典(Dictionary) 在数据分析中的操作

疯猫子

数据分析/人工智能/人生项目投资人，执行人

今天来聊聊python中的字典在数据分析中的应用，为了贴近实战关于简单结构的字典就略过。

今天要聊的字典结构是如下这类复杂结构：

{
"id": "2406124091",
"type": "node",
"visible":"true",
"created": {
          "version":"2",
          "changeset":"17206049",
          "timestamp":"2013-08-03T16:43:42Z",
          "user":"linuxUser16",
          "uid":"1219059"
"pos": [41.9757030, -87.6921867],
"address": {
          "housenumber": "5157",
          "postcode": "60625",
          "street": "North Lincoln Ave"
"amenity": "restaurant",
"cuisine": "mexican",
"name": "La Cabana De Don Luis",
"phone": "1 (773)-271-5176"
}

这类数据结构是为了方便写成JSON，或者存入MongoDB使用而存在的。为了便于理解和掌握这种复杂字典的操作方式，我们采取几个有趣的实验，来感受一下:

一、复杂结构字典是否可以拆分成简单结构的字典

如果把这个复杂结构拆分成几个结构简单的小字典或者列表，那么处理起来就会简单许多：

##第一个小字典
{"id": "2406124091",
"type": "node",
"visible":"true"}
##第二个小字典
{"version":"2",
 "changeset":"17206049",
 "timestamp":"2013-08-03T16:43:42Z",
 "user":"linuxUser16",
 "uid":"1219059"}
##一个小列表
[41.9757030, -87.6921867]
##第三个小字典
{"housenumber": "5157",
 "postcode": "60625",
 "street": "North Lincoln Ave"}
##第四个小字典
{"amenity": "restaurant",
"cuisine": "mexican",
"name": "La Cabana De Don Luis",
"phone": "1 (773)-271-5176"}

接下来，我们看看用哪种方法可以进行合并：

d1 = {"id": "2406124091",
"type": "node",
"visible":"true"}
d2 = {"version":"2",
 "changeset":"17206049",
 "timestamp":"2013-08-03T16:43:42Z",
 "user":"linuxUser16",
 "uid":"1219059"}
l1 = [41.9757030, -87.6921867]
d3 = {"housenumber": "5157",
 "postcode": "60625",
 "street": "North Lincoln Ave"}
d4 = {"amenity": "restaurant",
"cuisine": "mexican",
"name": "La Cabana De Don Luis",
"phone": "1 (773)-271-5176"}
d = {d1,d2,l1,d3,d4}
#Traceback (most recent call last):
#  File "<ipython-input-6-d047a64525d0>", line 1, in <module>
#    d = {d1,d2,l1,d3,d4}
#TypeError: unhashable type: 'dict'
###简单粗暴的合并，可惜这样的合并是不可行的
###尝试加上标签后进行合并
d = d1
d['created'] = d2
d['pos'] = l1
d['address'] = d3
d = dict(d,**d4)
pprint.pprint(d)
#{'address': {'housenumber': '5157',
#             'postcode': '60625',
#             'street': 'North Lincoln Ave'},
# 'amenity': 'restaurant',
# 'created': {'changeset': '17206049',
#             'timestamp': '2013-08-03T16:43:42Z',
#             'uid': '1219059',
#             'user': 'linuxUser16',
#             'version': '2'},
# 'cuisine': 'mexican',
# 'id': '2406124091',
# 'name': 'La Cabana De Don Luis',
# 'phone': '1 (773)-271-5176',
# 'pos': [41.975703, -87.6921867],
# 'type': 'node',
# 'visible': 'true'}
###成功完成复杂字典的合并，但是有个问题，顺序不对。在一些特定应用场景中，字典中的数据结构
###是被严格要求的。那么需要继续进行带有顺序要求的控制。
d = {'created':d2,'pos':l1,'address':d3}
pprint.pprint(d)
#{'address': {'housenumber': '5157',
#             'postcode': '60625',
#             'street': 'North Lincoln Ave'},
# 'created': {'changeset': '17206049',
#             'timestamp': '2013-08-03T16:43:42Z',
#             'uid': '1219059',
#             'user': 'linuxUser16',
#             'version': '2'},
# 'pos': [41.975703, -87.6921867]}
###成功完成了按顺序的合并，但是d1和d4的字典却无法进行可控的合并，采用dict()函数合并后，
###元素会添加在最后，这就又回到最初的情况

二、由上一个实验可知，两个字典直接合并可行，但结构顺序无法控制，需要对一些结构进行再分解。

d1 = {"id": "2406124091"}
d2 = {"type": "node"}
d3 = {"visible":"true"}
d4 = {"version":"2",
 "changeset":"17206049",
 "timestamp":"2013-08-03T16:43:42Z",
 "user":"linuxUser16",
 "uid":"1219059"}
l1 = [41.9757030, -87.6921867]
d5 = {"housenumber": "5157",
 "postcode": "60625",
 "street": "North Lincoln Ave"}
d6 = {"amenity": "restaurant"}
d7 = {"cuisine": "mexican"}
d8 ={"name": "La Cabana De Don Luis"}
d9 = {"phone": "1 (773)-271-5176"}

拆分完之后是这个样子：

d = {'id':d1,'type':d2,'visible':d3,'created':d4,'pos':l1,'address':d5,
     'amenity':d6,'cuisine':d7,'name':d8,'phone':d9}
import pprint
pprint.pprint(d)
#{'address': {'housenumber': '5157',
#             'postcode': '60625',
#             'street': 'North Lincoln Ave'},
# 'amenity': {'amenity': 'restaurant'},
# 'created': {'changeset': '17206049',
#             'timestamp': '2013-08-03T16:43:42Z',
#             'uid': '1219059',
#             'user': 'linuxUser16',
#             'version': '2'},
# 'cuisine': {'cuisine': 'mexican'},
# 'id': {'id': '2406124091'},
# 'name': {'name': 'La Cabana De Don Luis'},
# 'phone': {'phone': '1 (773)-271-5176'},
# 'pos': [41.975703, -87.6921867],
# 'type': {'type': 'node'},
# 'visible': {'visible': 'true'}}
###没有出现我们想要的结果，除了结构是混乱的之外，重新构建的字典中，数据结构也出现的明显的
###错误。尝试另外一种构建方式：
d = {d1,d2,d3,'created':d4,'pos':l1,'address':d5,d6,d7,d8,d9}
d = {d1,d2,d3,d4,l1,d5,d6,d7,d8,d9}
#Traceback (most recent call last):
#  File "<ipython-input-15-a7e54b47c8d5>", line 1, in <module>
#    d = {d1,d2,d3,d4,l1,d5,d6,d7,d8,d9}
#TypeError: unhashable type: 'dict'
d = {d1,d2,d3,d4,'pos':l1,d5,d6,d7,d8,d9}
# File "<ipython-input-16-d4f78f29ba8d>", line 1
#   d = {d1,d2,d3,d4,'pos':l1,d5,d6,d7,d8,d9}
#                         ^
#SyntaxError: invalid syntax
###这种构建方式，过于异想天开了，语法是错误的。
###dict()函数合并的方式我不打算尝试了，应为其中的l1是list，这个是无法用这个函数合并的。

三、直接固定字典的格式，然后对其填充数值或者内容

d = {
"id": "",
"type": "",
"visible":"",
"created": {
          "version":"",
          "changeset":"",
          "timestamp":"",
          "user":"",
          "uid":""
"pos": [0,0],
"address": {
          "housenumber": "",
          "postcode": "",
          "street": ""
"amenity": "",
"cuisine": "",
"name": "",
"phone": ""
d['id']='2406124091'
d['address']['housenumber']='123456'
pprint.pprint(d['id'])