Python 如何使用字典来加速查找和计数的任务？_Python_Pandas_Dataframe_Dictionary

Python 如何使用字典来加速查找和计数的任务？

python pandas dataframe dictionary

Python 如何使用字典来加速查找和计数的任务？,python,pandas,dataframe,dictionary,Python,Pandas,Dataframe,Dictionary,考虑以下代码段： data = {"col1":["aaa","bbb","ccc","aaa","ddd","bbb"], "col2":["fff","aaa","ggg","eee","ccc","ttt&qu

考虑以下代码段：

data = {"col1":["aaa","bbb","ccc","aaa","ddd","bbb"],
       "col2":["fff","aaa","ggg","eee","ccc","ttt"]}
df = pd.DataFrame(data,columns=["col1","col2"]) # my actual dataframe has
                                                # 20,00,000 such rows

list_a = ["ccc","aaa","mmm","nnn","ccc"]
list_b = ["ggg","fff","eee","ooo","ddd"]

# After doing a combination of 2 elements between the 2 lists in both orders,
# we get a list that resembles something like this:
new_list = ["ccc-ggg", "ggg-ccc", "aaa-fff", "fff-aaa", ..."ccc-fff", "fff-ccc", ...]

给定一个巨大的数据帧和两个列表，我想计算新列表中数据帧中相同的元素的数量。在上面的伪示例中，结果将是3，因为“aaa fff”、“ccc ggg”和“ddd ccc”在数据帧的同一行中

df['col3']=df['col1']+"-"+df['col2']
for a in list_a:
    c1 = 0
    for b in list_b:
        str1=a+"-"+b
        str2=b+"-"+a
        str1=a+"-"+b
        c2 = (df['col3'].str.contains(str1).sum())+(df['col3'].str.contains(str2).sum())
    c1+=c2
return c1

现在，我使用的是线性搜索算法，但速度非常慢，因为我必须扫描整个数据帧

df['col3']=df['col1']+"-"+df['col2']
for a in list_a:
    c1 = 0
    for b in list_b:
        str1=a+"-"+b
        str2=b+"-"+a
        str1=a+"-"+b
        c2 = (df['col3'].str.contains(str1).sum())+(df['col3'].str.contains(str2).sum())
    c1+=c2
return c1

有人能帮我实现一个更快的算法，最好是使用字典数据结构吗

注意：我必须遍历另一个数据帧的7000行，动态创建2个列表，并获取每行的聚合计数。

尝试以下方法：

from itertools import product

# all combinations of the two lists as tuples
all_list_combinations = list(product(list_a, list_b)) 

# tuples of the two columns
dftuples = [x for x in df.itertuples(index=False, name=None)] 

# take the length of hte intersection of the two sets and print it
print(len(set(dftuples).intersection(set(all_list_combinations))))

屈服

三,

首先在循环之前连接列，然后将可选的正则表达式传递给包含所有可能字符串的contains，而不是循环

joined = df.col1+ '-' + df.col2
pat = '|'.join([f'({a}-{b})' for a in list_a for b in list_b] +
    [f'({b}-{a})' for a in list_a for b in list_b]) # substitute for itertools.product
ct = joined.str.contains(pat).sum()

要使用DICT而不是数据帧，可以使用

filter（re，joined）

如下所示

第三个选项带有

series.isin（）

，灵感来自

速度测试

为了进行可伸缩性测试，我重复了100000次数据

series.isin（）

需要一天的时间，而jsmart的答案很快，但找不到所有出现的地方，因为它从

joined

with dicts: 400000 matches, 1.00 s
with pandas: 400000 matches, 1.77 s
with series.isin(): 400000 matches, 0.39 s
with jsmart answer: 4 matches, 0.50 s

这是另一种方法。首先，我使用了您对df的定义（有两列）、list_a和list_b

# combine two columns in the data frame
df['col3'] = df['col1'] + '-' + df['col2']

# create set with list_a and list_b pairs
s = ({ f'{a}-{b}' for a, b in zip(list_a, list_b)} | 
     { f'{b}-{a}' for a, b in zip(list_a, list_b)})

# find intersection
result = set(df['col3']) & s
print(len(result), '\n', result)

3 
 {'ddd-ccc', 'ccc-ggg', 'aaa-fff'}

更新以处理重复的值

# build list (not set) from list_a and list_b
idx =  ([ f'{a}-{b}' for a, b in zip(list_a, list_b) ] +
        [ f'{b}-{a}' for a, b in zip(list_a, list_b) ])

# create `col3`, and do `value_counts()` to preserve info about duplicates
df['col3'] = df['col1'] + '-' + df['col2']
tmp = df['col3'].value_counts()

# use idx to sub-select from to value counts:
tmp[ tmp.index.isin(idx) ]

# results:
ddd-ccc    1
aaa-fff    1
ccc-ggg    1
Name: col3, dtype: int64

你能给答案计时吗？我在打电话嗨，有什么答案有用吗？因为joined很长，我想ct会花很多时间计算。我必须重复这个操作7000次，得到7000个不同的计数值。谢谢：）列表a和列表b要多长时间？这种方法应该比循环更快，即使有许多可选模式，因为我们只进行一次python调用，而不是对每个可能的模式进行一次调用。。。利用C实现是加速python代码的最简单的方法，因为它们的长度是可变的。平均长度为10。我们可以用字典实现这个问题吗？在意识到这个选项在搜索匹配项之前从连接列中删除重复项之前，投票通过了…更新以解决@RichieV对重复项的评论。转换为分类数据类型（然后使用category.code）或使用pd.factorize可能会减少计算时间（匹配整数，而不是字符串）

# build list (not set) from list_a and list_b
idx =  ([ f'{a}-{b}' for a, b in zip(list_a, list_b) ] +
        [ f'{b}-{a}' for a, b in zip(list_a, list_b) ])

# create `col3`, and do `value_counts()` to preserve info about duplicates
df['col3'] = df['col1'] + '-' + df['col2']
tmp = df['col3'].value_counts()

# use idx to sub-select from to value counts:
tmp[ tmp.index.isin(idx) ]

# results:
ddd-ccc    1
aaa-fff    1
ccc-ggg    1
Name: col3, dtype: int64