当前位置：

首页
/
IT
/
程序
/
Python
/
理解 Python 字典和集合的性能差异

理解 Python 字典和集合的性能差异

本文为极客时间 Python 核心技术与实战专栏的学习笔记

字典

在 Python3.7+, 字典被确定为有序 (注意: 在 3.6 中, 字典有序是一个 implementation detail, 在 3.7 才正式成为语言特性, 因此 3.6 中无法 100% 确保其有序性), 而 3.6 之前是无序的, 其长度大小可变, 元素可以任意地删减和改变.

相比列表和元组, 字典性能更优, 可以在常数时间复杂度 O(1) 内完成查找, 添加, 删除操作.

常用创建方法

>>> d1 = {
	'name': 'Json', 'age': 20, 'gender': 'male'	
}
>>> d2 = dict( {
	'name': 'Json', 'age': 20, 'gender': 'male'	
})
>>> d3 = dict([('name', 'Json'),('age', 20),('gender','male')])
>>> d4 = dict(name='Json', age=20, gender='male')
>>> d1 == d2 == d3 == d4
True

索引

使用 dict[key] 格式索引, 如果不存在, 会抛出 KeyError 异常.

>>> d1['name']
'Json'
>>> d1['location']
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: 'location'

使用 get(key, default) 方法不会抛出异常, 此外当该键不存在时, 可以指定返回的默认值.

>>> d1.get('name')
'Json'
>>> d1.get('location')
>>> d1.get('location', None)
None

判断某元素是否是字典的键:

>>> 'name' in d1        # 同 d1.keys() 等价
True
>>> 'name' in d1.keys()
True
>>> 'Json' in d1
False
>>> 'Json' in d1.values()
True

添加

>>> d = {
	'name': 'jason', 'age': 20	
}
>>> d['gender'] = 'male'
>>> d
{
	'name': 'jason', 'age': 20, 'gender': 'male'	
}

删除

>>> d.pop('gender')
'male'
>>> d
{
	'name': 'jason', 'age': 20	
}

排序

根据键排序

>>> d = {
	'b': 1, 'v': 20, 'a': 17	
}
>>> d_sorted_by_key = sorted(d.items(), key=lambda x: x[0])
>>> d_sorted_by_key
[('a', 17), ('b', 1), ('v', 20)]
>>> d
{
	'b': 1, 'v': 20, 'a': 17	
}

根据值排序

>>> d_sorted_by_value = sorted(d.items(), key = lambda x: x[1])
>>> d_sorted_by_value
[('b', 1), ('a', 17), ('v', 20)]

性能分析

举例: 有 1000 万件产品, 产品信息包括: 产品 ID, 价格. 现在需求是: 给定某件产品的 ID, 找出其价格:

1. 用列表来存储数据

存储结构如下:

products = [
    (143121312, 100),
    (432314553, 30),
    (32421912367, 150)
]

那么查找需要遍历整个列表, 时间复杂度为 O(n). 即使先对列表排序, 然后二分查找, 也会需要 O(logn) 的时间复杂度, 并且排序还需要 O(nlogn) 的时间.

2. 用字典存储数据

存储结构如下:

products = {
    '143121312': 100,
    '432314553': 30,
    '32421912367': 150
}

因为字典内部结构是一张哈希表, 所以可以在 O(1) 的时间复杂度内完成查找.

3. 效率对比

import time
import numpy
def find_product_price_list(products, product_id):
    for id, price in products:
        if id == product_id:
            return price
    return None
def find_product_price_dict(products, product_id):
    for id in products.keys():
        if id == product_id:
            return products_dict[id]
    return None
r = numpy.random.randint(0,10000000,10000000)       # 生成 10000000 个随机数
id = [str(x) for x in r]
price = [x for x in range(20000000, 30000000)]
products_list = list(zip(id, price))
products_dict = dict(zip(id, price))
# 添加新元素
products_list.append(('111111111', 300))    # 追加到列表末尾
products_dict['111111111'] = 300
start_using_dict = time.perf_counter()
find_product_price_dict(products_dict, '111111111')
end_using_dict = time.perf_counter()
print('time elapse using dict: {}'.format(end_using_dict - start_using_dict))
start_using_list = time.perf_counter()
find_product_price_list(products_dict, '111111111')
end_using_list = time.perf_counter()
print('time elapse using list: {}'.format(end_using_list - start_using_list))
# =========== 运行结果 ============
time elapse using dict: 0.1983588489999999
time elapse using list: 0.41368435999999953

集合

而集合和字典基本相同, 唯一的区别, 就是集合没有键和值的配对, 是一系列无序的, 唯一的元素组合.

常用创建方法

>>> s1 = {
	1,2,3	
}
>>> s2 = set([1,2,3])
>>> s1 == s2
True

索引

集合并不支持索引操作, 因为集合本质上是一个哈希表, 和列表不一样.

进行如下操作, Python 会抛出 TypeError 异常.

>>> s1[0]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'set' object is not subscriptable

只能判断某元素是否在集合中:

>>> 1 in s1
True
>>> 10 in s1
False

添加

>>> s = {
	1,2,3	
}
>>> s.add(4)
>>> s
{
	1, 2, 3, 4	
}

删除

>>> s.remove(4)
>>> s
{
	1, 2, 3	
}

注意: 由于集合是无序的, 所以无法确定 pop() 方法会删除哪个元素, 所以谨慎使用. 一般删除操作采用 remove() 即可.

排序

>>> s  = {
	2,4,546,34	
}
>>> sorted(s)
[2, 4, 34, 546]

集合运算

常用集合运算

语法	操作	说明
set(list1) \| set(list2)	union	包含 list1 和 list2 所有数据的新集合
set(list1) & set(list2)	intersection	包含 list1 和 list2 中共同元素的新集合
set(list1) - set(list2)	difference	在 list1 中出现但不在 list2 中出现的元素的集合

性能分析

还是以上面那个例子为例, 现在要求计算出有多少种价格. 为了节省时间, 我们把产品数量降低到 10 万.

查找效率

1. 用列表存储数据

需要两层循环. 那么, 在最差情况下, 需要 O(n^2) 的时间复杂度.

2. 用集合存储数据

由于集合是高度优化的哈希表, 里面元素不能重复, 并且其添加和查找操作只需 O(1) 的复杂度, 那么, 总的时间复杂度就只有 O(n).

3. 效率对比

import time
import numpy
def find_unique_price_set(products):
    unique_price_set = set()
    for _, price in products:
        unique_price_set.add(price)
    return len(unique_price_set)
def find_unique_price_list(products):
    unique_price_list = []
    for _, price in products: # A
        if price not in unique_price_list: #B
            unique_price_list.append(price)
    return len(unique_price_list)
r = numpy.random.randint(0,1000000,100000)       # 生成 100000 个随机数
id = [str(x) for x in r]
price = [x for x in range(200000, 300000)]
products = list(zip(id, price))
start_using_set = time.perf_counter()
find_unique_price_set(products)
end_using_set = time.perf_counter()
print('time elapse using set: {}'.format(end_using_set - start_using_set))
start_using_list = time.perf_counter()
find_unique_price_list(products)
end_using_list = time.perf_counter()
print('time elapse using list: {}'.format(end_using_list - start_using_list))
# =========== 运行结果 ============
time elapse using set: 0.00985934799999999
time elapse using list: 65.528253501

可以看出, 仅 10 万数据, 差距就已经很明显了.

交集, 并集, 差集运算

以求交集为例:

import time
list_a = [1, 2, 3, 4, 5, 6, 7, 8, 9, 13, 34, 53, 42, 44]
list_b = [2, 4, 6, 9, 23]
intersection = []
# 列表做交集
start_using_list = time.perf_counter()
for a in list_a:
    for b in list_b:
        if a == b:
            intersection.append(a)
end_using_list = time.perf_counter()
print(intersection)
print('time: {}'.format(end_using_list - start_using_list))
# 集合做交集
start_using_list = time.perf_counter()
intersection = list(set(list_a) & set(list_b))
end_using_list = time.perf_counter()
print(intersection)
print('time: {}'.format(end_using_list - start_using_list))
# =========== 运行结果 ============
[2, 4, 6, 9]
time: 9.622000000000797e-06
[9, 2, 4, 6]
time: 4.169000000001782e-06

来源: http://www.bubuko.com/infodetail-3355788.html

与本文相关文章

暂无,快来抢沙发吧！