Python 获取出现在3个或更多列表中的元素_Python_List_Comparison

Python 获取出现在3个或更多列表中的元素

python list

Python 获取出现在3个或更多列表中的元素,python,list,comparison,Python,List,Comparison,假设我一共有5张清单 # Sample data a1 = [1,2,3,4,5,6,7] a2= [1,21,35,45,58] a3= [1,2,15,27,36] a4=[2,3,1,45,85,51,105,147,201] a5=[3,458,665] 我需要找到a1的元素，它们也存在于a2、a3、a4、a5中超过3倍，包括a1中的元素或我需要所有列表（a1-a5）中频率大于或等于3的元素，以及它们的频率根据上述示例，预期输出为 1，频率为4 2，频率为3 3，频率为3 对

假设我一共有5张清单

# Sample data

a1 = [1,2,3,4,5,6,7]

a2= [1,21,35,45,58]
a3= [1,2,15,27,36]
a4=[2,3,1,45,85,51,105,147,201]
a5=[3,458,665]

我需要找到a1的元素，它们也存在于a2、a3、a4、a5中超过3倍，包括a1中的元素

或

我需要所有列表（a1-a5）中频率大于或等于3的元素，以及它们的频率

根据上述示例，预期输出为

1，频率为4

2，频率为3

3，频率为3

对于我的实际问题，列表的数量和长度都是如此之大，有人能给我一个简单而快速的方法吗

谢谢

Prithivi

正如Patrick在评论中写道的那样，

chain

和

Counter

是你在这里的朋友：

import itertools
import collections

targets = [1,2,3,4,5,6,7]

lists = [
    [1,21,35,45,58],
    [1,2,15,27,36],
    [2,3,1,45,85,51,105,147,201],
    [3,458,665]
    ]

chained = itertools.chain(*lists)
counter = collections.Counter(chained)
result = [(t, counter[t]) for t in targets if counter[t] >= 2]

以致

>>> results
[(1, 3), (2, 2), (3, 2)]

你说你有很多清单，每个清单都很长。试试这个解决方案，看看需要多长时间。如果需要加速，那是另一个问题。这可能是因为

集合。计数器

对于您的应用程序来说太慢了。

正如Patrick在评论中所写，

链

和

计数器

是您的朋友：

import itertools
import collections

targets = [1,2,3,4,5,6,7]

lists = [
    [1,21,35,45,58],
    [1,2,15,27,36],
    [2,3,1,45,85,51,105,147,201],
    [3,458,665]
    ]

chained = itertools.chain(*lists)
counter = collections.Counter(chained)
result = [(t, counter[t]) for t in targets if counter[t] >= 2]

a1= [1,2,3,4,5,6,7]
a2= [1,21,35,45,58]
a3= [1,2,15,27,36]
a4= [2,3,1,45,85,51,105,147,201]
a5= [3,458,665]

b = a1+a2+a3+a4+a5                              #make b all lists together

for x in set(b):                                #iterate though b's set
    print(x, 'with a frequency of', b.count(x)) #print the count

以致

>>> results
[(1, 3), (2, 2), (3, 2)]

你说你有很多清单，每个清单都很长。试试这个解决方案，看看需要多长时间。如果需要加速，那是另一个问题。可能是

集合。计数器对您的应用程序来说太慢
a1= [1,2,3,4,5,6,7]
a2= [1,21,35,45,58]
a3= [1,2,15,27,36]
a4= [2,3,1,45,85,51,105,147,201]
a5= [3,458,665]

b = a1+a2+a3+a4+a5                              #make b all lists together

for x in set(b):                                #iterate though b's set
    print(x, 'with a frequency of', b.count(x)) #print the count

将为您提供：
1 with a frequency of 4
2 with a frequency of 3
3 with a frequency of 3
4 with a frequency of 1
5 with a frequency of 1
6 with a frequency of 1
7 with a frequency of 1
35 with a frequency of 1
36 with a frequency of 1
...

编辑：
使用：
for x in range(9000):
    a1.append(random.randint(1,10000))
    a2.append(random.randint(1,10000))
    a3.append(random.randint(1,10000))
    a4.append(random.randint(1,10000))

我将列表延长了很多，并使用time
检查了程序的运行时间（不打印而是保存信息），程序运行了4.9395秒。我希望这足够快
将为您提供：
1 with a frequency of 4
2 with a frequency of 3
3 with a frequency of 3
4 with a frequency of 1
5 with a frequency of 1
6 with a frequency of 1
7 with a frequency of 1
35 with a frequency of 1
36 with a frequency of 1
...

编辑：
使用：
for x in range(9000):
    a1.append(random.randint(1,10000))
    a2.append(random.randint(1,10000))
    a3.append(random.randint(1,10000))
    a4.append(random.randint(1,10000))

我将列表延长了很多，并使用time
检查了程序的运行时间（不打印而是保存信息），程序运行了4.9395秒。我希望这足够快。
这个使用熊猫的解决方案相当快
import pandas as pd

a1=[1,2,3,4,5,6,7]
a2=[1,21,35,45,58]
a3=[1,2,15,27,36]
a4=[2,3,1,45,85,51,105,147,201]
a5=[3,458,665]

# convert each list to a DataFrame with an indicator column
A = [a1, a2, a3, a4, a5]
D = [ pd.DataFrame({'A': a, 'ind{0}'.format(i):[1]*len(a)}) for i,a in enumerate(A)]

# left join each dataframe onto a1
# if you know the integers are distinct then you don't need drop_duplicates
df = pd.merge(D[0], D[1].drop_duplicates(['A']), how='left', on='A')
for d in D[2:]:
    df = pd.merge(df, d.drop_duplicates(['A']), how='left', on='A')

# sum accross the indicators
df['freq'] = df[['ind{0}'.format(i) for i,d in enumerate(D)]].sum(axis=1)

# drop frequencies less than 3
print df[['A','freq']].loc[df['freq'] >= 3]

使用以下较大输入的测试在我的机器上运行时间不到0.2秒
import numpy.random as npr
a1 = xrange(10000)
a2 = npr.randint(10000, size=100000) 
a3 = npr.randint(10000, size=100000) 
a4 = npr.randint(10000, size=100000) 
a5 = npr.randint(10000, size=100000)

使用熊猫的这个解决方案相当快
import pandas as pd

a1=[1,2,3,4,5,6,7]
a2=[1,21,35,45,58]
a3=[1,2,15,27,36]
a4=[2,3,1,45,85,51,105,147,201]
a5=[3,458,665]

# convert each list to a DataFrame with an indicator column
A = [a1, a2, a3, a4, a5]
D = [ pd.DataFrame({'A': a, 'ind{0}'.format(i):[1]*len(a)}) for i,a in enumerate(A)]

# left join each dataframe onto a1
# if you know the integers are distinct then you don't need drop_duplicates
df = pd.merge(D[0], D[1].drop_duplicates(['A']), how='left', on='A')
for d in D[2:]:
    df = pd.merge(df, d.drop_duplicates(['A']), how='left', on='A')

# sum accross the indicators
df['freq'] = df[['ind{0}'.format(i) for i,d in enumerate(D)]].sum(axis=1)

# drop frequencies less than 3
print df[['A','freq']].loc[df['freq'] >= 3]

使用以下较大输入的测试在我的机器上运行时间不到0.2秒
import numpy.random as npr
a1 = xrange(10000)
a2 = npr.randint(10000, size=100000) 
a3 = npr.randint(10000, size=100000) 
a4 = npr.randint(10000, size=100000) 
a5 = npr.randint(10000, size=100000)

从itertools
模块查看chain
，从collections
模块查看chain
，从itertools
模块查看Counter
从collections
模块查看chain