Python列表追加需要很长时间？_Python_List_Memory Efficient

Python列表追加需要很长时间？

python list

Python列表追加需要很长时间？,python,list,memory-efficient,Python,List,Memory Efficient,我有一个函数，可以在给定列表（一个列表）和其他列表之间查找常见的、不常见的项目及其比率每个用户（4000个用户）的列表（60000个列表）。在循环下运行需要太长的时间和很高的内存使用率使用部分列表构造和崩溃。我认为由于返回的列表很长，元素（元组）很重，所以我将它分为两个函数，如下所示，但在元组中添加列表项似乎有问题， [（用户、[项目]、费率），（用户、[项目]、费率），…]。我想根据返回的值创建一个dataframes 我应该对一个算法做些什么来绕过这个问题并减少内存使用 Iam使用py

我有一个函数，可以在给定列表（一个列表）和其他列表之间查找常见的、不常见的项目及其比率每个用户（4000个用户）的列表（60000个列表）。在循环下运行需要太长的时间和很高的内存使用率使用部分列表构造和崩溃。我认为由于返回的列表很长，元素（元组）很重，所以我将它分为两个函数，如下所示，但在元组中添加列表项似乎有问题，

[（用户、[项目]、费率），（用户、[项目]、费率），…]

。我想根据返回的值创建一个dataframes

我应该对一个算法做些什么来绕过这个问题并减少内存使用

Iam使用python 3.7、windows 10、64位、RAM 8G

常用项功能：

def common_items(user,list1, list2):

    com_items = list(set(list1).intersection(set(list2)))
    com_items_rate = len(com_items)/len(set(list1).union(set(list2))) 
    
       
    return user, com_items, com_items_rate

def uncommon_items(user,list1, list2):

    com_items = list(set(list1).intersection(set(list2)))
    com_items_rate = len(com_items)/len(set(list1).union(set(list2))) 
    
    
    uncom_items = list(set(list2) - set(com_items)) # uncommon items that blonge to list2
    uncom_items_rate = len(uncom_items)/len(set(list1).union(set(list2)))
    
    return user, com_items_rate, uncom_items, uncom_items_rate # common_items_rate is also needed

不常见项目功能：

def common_items(user,list1, list2):

    com_items = list(set(list1).intersection(set(list2)))
    com_items_rate = len(com_items)/len(set(list1).union(set(list2))) 
    
       
    return user, com_items, com_items_rate

def uncommon_items(user,list1, list2):

    com_items = list(set(list1).intersection(set(list2)))
    com_items_rate = len(com_items)/len(set(list1).union(set(list2))) 
    
    
    uncom_items = list(set(list2) - set(com_items)) # uncommon items that blonge to list2
    uncom_items_rate = len(uncom_items)/len(set(list1).union(set(list2)))
    
    return user, com_items_rate, uncom_items, uncom_items_rate # common_items_rate is also needed

构建列表：

common_item_rate_tuple_list = [] 

for usr in users: # users.shape = 4,000
    list1 = get_user_list(usr) # a function to get list1, it takes 0:00:00.015632 or less for a user
#     print(usr, len(list1))            

    for list2 in df["list2"]: # df.shape = 60,000

        common_item_rate_tuple = common_items(usr,list1, list2) 
        common_item_rate_tuple_list.append(common_item_rate_tuple)
        
print(len(common_item_rate_tuple_list)) # 4,000 * 60,000 = 240,000,000‬ items
# sample of common_item_rate_tuple_list:
#[(1,[2,5,8], 0.676), (1,[7,4], 0.788), ....(4000,[1,5,7,9],0.318), (4000,[8,9,6],0.521)

我看着（）和

（）他们处理构造的列表。我不能处理建议的答案（），

< P>对于数据和速度这样大的数据，你应该考虑两件事。< /P>

您现在或应该只处理
```
集合
```
，因为顺序在列表中没有意义，而且您正在进行大量集合的相交。那么，您能否更改
```
get\u user\u list（）
```
函数以返回集合而不是列表？这将防止您正在进行的所有不必要的转换。清单2也一样，只要马上做一套就行了
在查找“不常见项”时，应在集合上使用对称差分运算符。更快、更少的列表->设置转换
在循环结束时，是否确实要创建一个包含240M子列表的列表？这可能是你的记忆爆炸。我建议使用以键为用户名的字典。如果有公共项，您只需要在其中创建一个条目。如果存在“稀疏”匹配，您将得到一个非常小的数据容器

---编辑w/示例

因此，我认为您希望将其保存在数据框架中的希望太大了。也许您可以做需要做的事情，而无需将其存储在数据帧中。字典是有意义的。您甚至可以“动态”计算事物，而不存储数据。无论如何下面是一个玩具示例，显示了使用4K用户和10K“其他列表”时的内存问题。当然，相交集的大小可能会使其发生变化，但它提供了信息：

import sys
import pandas as pd

# create list of users by index
users = list(range(4000))

match_data = list()

size_list2 = 10_000

for user in users:
    for t in range(size_list2):
        match_data.append(( user, (1,5,6,9), 0.55))   # 4 dummy matches and fake percentage


print(match_data[:4])
print(f'size of match: {sys.getsizeof(match_data)/1_000_000} MB')

df = pd.DataFrame(match_data)

print(df.head())

print(f'size of dataframe {sys.getsizeof(df)/1_000_000} MB')

这将产生以下结果：

[(0, (1, 5, 6, 9), 0.55), (0, (1, 5, 6, 9), 0.55), (0, (1, 5, 6, 9), 0.55), (0, (1, 5, 6, 9), 0.55)]
size of match: 335.072536 MB
   0             1     2
0  0  (1, 5, 6, 9)  0.55
1  0  (1, 5, 6, 9)  0.55
2  0  (1, 5, 6, 9)  0.55
3  0  (1, 5, 6, 9)  0.55
4  0  (1, 5, 6, 9)  0.55
size of dataframe 3200.00016 MB

您可以看到，对于仅10K其他列表，您的想法的一个简单部分是数据帧中的3.2GB。这将是难以管理的

下面是一个数据结构的想法，它可以一直使用字典

del df

# just keep it in a dictionary
data = {}   # intended format:  key= (usr, other_list) : value= [common elements]

# some fake data
user_items = {  1: {2,3,5,7,99},
                2: {3,5,88,790},
                3: {2,4,100} }

# some fake "list 2 data"
list2 = [   {1,2,3,4,5},
            {88, 100},
            {11, 13, 200}]

for user in user_items.keys():
    for idx, other_set in enumerate(list2):     # using enumerate to get the index of the other list
        common_elements = user_items.get(user) & other_set   # set intersection
        if common_elements:  # only put it into the dictionary if it is not empty
            data[(user, idx)] = common_elements

# try a couple data pulls
print(f'for user 1 and other list 0: {data.get((1, 0))}')
print(f'for user 2 and other list 2: {data.get((2, 2))}')  # use .get() to be safe.  It will return None if no entry

这里的输出是：

for user 1 and other list 0: {2, 3, 5}
for user 2 and other list 2: None

你的另一个选择，如果你要处理这个数据很多，就是把这些表放进一个数据库中，比如“代码> SQLite ，这是内置的，不会炸毁你的内存。

有两件事你需要考虑的是速度和内存管理这样大的数据。

您现在或应该只处理
```
集合
```
，因为顺序在列表中没有意义，而且您正在进行大量集合的相交。那么，您能否更改
```
get\u user\u list（）
```
函数以返回集合而不是列表？这将防止您正在进行的所有不必要的转换。清单2也一样，只要马上做一套就行了
在查找“不常见项”时，应在集合上使用对称差分运算符。更快、更少的列表->设置转换
在循环结束时，是否确实要创建一个包含240M子列表的列表？这可能是你的记忆爆炸。我建议使用以键为用户名的字典。如果有公共项，您只需要在其中创建一个条目。如果存在“稀疏”匹配，您将得到一个非常小的数据容器

---编辑w/示例

import sys
import pandas as pd

# create list of users by index
users = list(range(4000))

match_data = list()

size_list2 = 10_000

for user in users:
    for t in range(size_list2):
        match_data.append(( user, (1,5,6,9), 0.55))   # 4 dummy matches and fake percentage


print(match_data[:4])
print(f'size of match: {sys.getsizeof(match_data)/1_000_000} MB')

df = pd.DataFrame(match_data)

print(df.head())

print(f'size of dataframe {sys.getsizeof(df)/1_000_000} MB')

这将产生以下结果：

[(0, (1, 5, 6, 9), 0.55), (0, (1, 5, 6, 9), 0.55), (0, (1, 5, 6, 9), 0.55), (0, (1, 5, 6, 9), 0.55)]
size of match: 335.072536 MB
   0             1     2
0  0  (1, 5, 6, 9)  0.55
1  0  (1, 5, 6, 9)  0.55
2  0  (1, 5, 6, 9)  0.55
3  0  (1, 5, 6, 9)  0.55
4  0  (1, 5, 6, 9)  0.55
size of dataframe 3200.00016 MB

您可以看到，对于仅10K其他列表，您的想法的一个简单部分是数据帧中的3.2GB。这将是难以管理的

下面是一个数据结构的想法，它可以一直使用字典

del df

# just keep it in a dictionary
data = {}   # intended format:  key= (usr, other_list) : value= [common elements]

# some fake data
user_items = {  1: {2,3,5,7,99},
                2: {3,5,88,790},
                3: {2,4,100} }

# some fake "list 2 data"
list2 = [   {1,2,3,4,5},
            {88, 100},
            {11, 13, 200}]

for user in user_items.keys():
    for idx, other_set in enumerate(list2):     # using enumerate to get the index of the other list
        common_elements = user_items.get(user) & other_set   # set intersection
        if common_elements:  # only put it into the dictionary if it is not empty
            data[(user, idx)] = common_elements

# try a couple data pulls
print(f'for user 1 and other list 0: {data.get((1, 0))}')
print(f'for user 2 and other list 2: {data.get((2, 2))}')  # use .get() to be safe.  It will return None if no entry

这里的输出是：

for user 1 and other list 0: {2, 3, 5}
for user 2 and other list 2: None

如果你打算大量使用这些数据，你的另一个选择就是将这些表放入一个内置的数据库中，比如

sqlite

，它不会耗尽你的内存。

看看

常用项

，我发现两个问题。第一，调用

set（list1）

两次（两次调用所需时间相同，与列表的大小有关），即使第二次调用将生成相同的

set

值，因为

list1

没有更改。第二，

交叉口

可以采取任何措施；首先从

list2

中设置

并没有什么特别的好处<代码>x=集合（列表1）；y=x.交点（列表2）；return user，x，len（x）/len（x.union（list2））

@chepner谢谢你的评论，我会更改它。只需查看

常见项目

，我就发现两个问题。第一，调用

set（list1）

两次（两次调用所需时间相同，与列表的大小有关），即使第二次调用将生成相同的

set

值，因为

list1

没有更改。第二，

交叉口

可以采取任何措施；首先从

list2

中设置

并没有什么特别的好处<代码>x=集合（列表1）；y=x.交点（列表2）；return user，x，len（x）/len（x.union（list2））

@chepner谢谢你的评论，我会修改它。（第3点）这是主要问题，这就是原因