others - 在Python,实现嵌套字典最好的方法是什么 ?

我有一个数据结构,基本上是一个嵌套的字典,假定它是这样的:


{'new jersey': {'mercer county': {'plumbers': 3,
 'programmers': 81},
 'middlesex county': {'programmers': 81,
 'salesmen': 62}},
 'new york': {'queens county': {'plumbers': 9,
 'salesmen': 36}}}

现在,维护和创建这个非常痛苦, 每次我有新的state/county/profession,我都必须通过讨厌的try/catch块来创建底层的字典,而且,如果我想遍历所有的值,我必须创建恼人的嵌套迭代器。

我也可以将元组用作键,例如:


{('new jersey', 'mercer county', 'plumbers'): 3,
 ('new jersey', 'mercer county', 'programmers'): 81,
 ('new jersey', 'middlesex county', 'programmers'): 81,
 ('new jersey', 'middlesex county', 'salesmen'): 62,
 ('new york', 'queens county', 'plumbers'): 9,
 ('new york', 'queens county', 'salesmen'): 36}

我怎么能做得更好?

时间:


class AutoVivification(dict):
 """Implementation of perl's autovivification feature."""
 def __getitem__(self, item):
 try:
 return dict.__getitem__(self, item)
 except KeyError:
 value = self[item] = type(self)()
 return value

测试:


a = AutoVivification()

a[1][2][3] = 4
a[1][3][3] = 5
a[1][2]['test'] = 6

print a

输出:


{1: {2: {'test': 6, 3: 4}, 3: {3: 5}}}

dict子类上实现__missing__以设置,并且返回一个新实例:

我最近发现了一个更优雅的方法,从(并记录)到python 2.5,我喜欢普通的数据表,而不是autovivified defaultdict,


class Vividict(dict):
 def __missing__(self, key):
 value = self[key] = type(self)()
 return value

解释:只要访问了一个键,就会提供类Vividict的另一个嵌套实例,(返回值赋值很有用,因为它避免了我们另外调用dict上的getter,不幸的是,我们无法在设置时返回它。),

演示用法

下面只是一个例子,这个例子是如何轻松地用来创建一个嵌套的fram结构的,这可以快速创建一个你可能想要的分层树结构。


import pprint

class Vividict(dict):
 def __missing__(self, key):
 value = self[key] = type(self)()
 return value

d = Vividict()

d['foo']['bar']
d['foo']['baz']
d['fizz']['buzz']
d['primary']['secondary']['tertiary']['quaternary']
pprint.pprint(d)

输出:


{'fizz': {'buzz': {}},
 'foo': {'bar': {}, 'baz': {}},
 'primary': {'secondary': {'tertiary': {'quaternary': {}}}}}

并且如最后一行所示,它非常漂亮,并且可以进行手动检查。

另一种替代方法,例如:

dict.setdefault

setdefault在循环中使用效果很好,而且你不知道你要获得什么key,但是,重复使用会变得非常麻烦,我不认为有人会想保留以下内容:


d = dict()

d.setdefault('foo', {}).setdefault('bar', {})
d.setdefault('foo', {}).setdefault('baz', {})
d.setdefault('fizz', {}).setdefault('buzz', {})
d.setdefault('primary', {}).setdefault('secondary', {}).setdefault('tertiary', {}).setdefault('quaternary', {})

一个auto-vivified defaultdict

这是一个清晰的实现,并且在不检查数据的脚本中使用的用法与实现__missing__有用:


d = collections.defaultdict(lambda: d)

但是,如果你需要检查数据,则以相同方式填充数据的自动生成的defaultdict的结果如下所示:


>>> d = collections.defaultdict(lambda: d); d['foo']['bar']; d['foo']['baz']; d['fizz']['buzz']; d['primary']['secondary']['tertiary']['quaternary']; import pprint; 
>>> pprint.pprint(d)
defaultdict(<function <lambda> at 0x189D7F30>, {'bar': defaultdict(<function 
<lambda> at 0x189D7F30>, {...}), 'secondary': defaultdict(<function <lambda> at 
0x189D7F30>, {...}), 'baz': defaultdict(<function <lambda> at 0x189D7F30>, {...}), 
'primary': defaultdict(<function <lambda> at 0x189D7F30>, {...}), 'quaternary': 
defaultdict(<function <lambda> at 0x189D7F30>, {...}), 'buzz': defaultdict(<function 
<lambda> at 0x189D7F30>, {...}), 'foo': defaultdict(<function <lambda> at 0x189D7F30>, 
{...}), 'tertiary': defaultdict(<function <lambda> at 0x189D7F30>, {...}), 'fizz': 
defaultdict(<function <lambda> at 0x189D7F30>, {...})})

这个例子非常不优雅,结果非常难以理解。通常给出的解决方案是递归地转换回手工检查的dict ,

结束语

实现__missing__来设置和返回一个新实例是一个中等难度,但是,有它好处,

  • 简单实例化
  • 简单数据填充
  • 轻松查看数据

这是我在python中实现autovivified嵌套字典的建议。


# yo dawg, i heard you liked dicts 
def yodict():
 return defaultdict(yodict)

你可以创建一个YAML文件,并且使用PyYaml读取它。

第1步:创建一个YAML文件,"employment.yml ":


new jersey:
 mercer county:
 pumbers: 3
 programmers: 81
 middlesex county:
 salesmen: 62
 programmers: 81
new york:
 queens county:
 plumbers: 9
 salesmen: 36

第2步:读入


import yaml
file_handle = open("employment.yml")
my_shnazzy_dictionary = yaml.safe_load(file_handle)
file_handle.close()

现在my_shnazzy_dictionary有了所有的值,如果需要动态地执行此操作,可以创建一个字符作为字符串,并且将它提供给yaml.safe_load(...) 。

因为你这是星型模式,你可能想要将它更像是一个关系表而不像字典。


import collections

class Jobs( object ):
 def __init__( self, state, county, title, count ):
 self.state= state
 self.count= county
 self.title= title
 self.count= count

facts = [
 Jobs( 'new jersey', 'mercer county', 'plumbers', 3 ),
 ...

def groupBy( facts, name ):
 total= collections.defaultdict( int )
 for f in facts:
 key= getattr( f, name )
 total[key] += f.count

如果嵌套级别的数目很小,我将使用collections.defaultdict进行以下操作:


from collections import defaultdict

def nested_dict_factory(): 
 return defaultdict(int)
def nested_dict_factory2(): 
 return defaultdict(nested_dict_factory)
db = defaultdict(nested_dict_factory2)

db['new jersey']['mercer county']['plumbers'] = 3
db['new jersey']['mercer county']['programmers'] = 81

使用defaultdict可以避免大量混乱的setdefault()get()等。

我发现setdefault非常有用; 它检查key是否存在,如果不存在,则添加它:


d = {}
d.setdefault('new jersey', {}).setdefault('mercer county', {})['plumbers'] = 3

setdefault总是返回相关键,因此实际上你正在更新'd'的值。

当进行迭代时,我相信如果在python中不存在一个生成器,可以轻松地编写一个生成器:


def iterateStates(d):
 # Let's count up the total number of "plumbers" / "dentists" / etc.
 # across all counties and states
 job_totals = {}

 # I guess this is the annoying nested stuff you were talking about?
 for (state, counties) in d.iteritems():
 for (county, jobs) in counties.iteritems():
 for (job, num) in jobs.iteritems():
 # If job isn't already in job_totals, default it to zero
 job_totals[job] = job_totals.get(job, 0) + num

 # Now return an iterator of (job, number) tuples
 return job_totals.iteritems()

# Display all jobs
for (job, num) in iterateStates(d):
 print "There are %d %s in total" % (job, num)

就像其他人所建议的,关系数据库对你来说更有用,你可以使用内存中的sqlite3数据库作为数据结构来创建表,然后查询它们。


import sqlite3

c = sqlite3.Connection(':memory:')
c.execute('CREATE TABLE jobs (state, county, title, count)')

c.executemany('insert into jobs values (?, ?, ?, ?)', [
 ('New Jersey', 'Mercer County', 'Programmers', 81),
 ('New Jersey', 'Mercer County', 'Plumbers', 3),
 ('New Jersey', 'Middlesex County', 'Programmers', 81),
 ('New Jersey', 'Middlesex County', 'Salesmen', 62),
 ('New York', 'Queens County', 'Salesmen', 36),
 ('New York', 'Queens County', 'Plumbers', 9),
])

# some example queries
print list(c.execute('SELECT * FROM jobs WHERE county = "Queens County"'))
print list(c.execute('SELECT SUM(count) FROM jobs WHERE title = "Programmers"'))

这只是一个简单的例子,你可以为states , counties 和job titles定义单独的表。

defaultdict() 是你的朋友 !

我没有使用 defaultdict") 来创建这个( 查看" python Multi-dimensional dicts,但是你可以使用以下两个维度字典:


d = defaultdict(defaultdict)
d[1][2] = 3

要获得更多维度,你可以:


d = defaultdict(lambda :defaultdict(defaultdict))
d[1][2][3] = 4

collections.defaultdict可以被划分成子类来生成嵌套的dict ,然后向该类添加任何有用的迭代方法。


>>> from collections import defaultdict
>>> class nesteddict(defaultdict):
 def __init__(self):
 defaultdict.__init__(self, nesteddict)
 def walk(self):
 for key, value in self.iteritems():
 if isinstance(value, nesteddict):
 for tup in value.walk():
 yield (key,) + tup
 else:
 yield key, value


>>> nd = nesteddict()
>>> nd['new jersey']['mercer county']['plumbers'] = 3
>>> nd['new jersey']['mercer county']['programmers'] = 81
>>> nd['new jersey']['middlesex county']['programmers'] = 81
>>> nd['new jersey']['middlesex county']['salesmen'] = 62
>>> nd['new york']['queens county']['plumbers'] = 9
>>> nd['new york']['queens county']['salesmen'] = 36
>>> for tup in nd.walk():
 print tup


('new jersey', 'mercer county', 'programmers', 81)
('new jersey', 'mercer county', 'plumbers', 3)
('new jersey', 'middlesex county', 'programmers', 81)
('new jersey', 'middlesex county', 'salesmen', 62)
('new york', 'queens county', 'salesmen', 36)
('new york', 'queens county', 'plumbers', 9)

...