Python Pandas Dataframes-从另一个数据帧的字符串列中的一个数据帧中搜索整数_Python_Pandas_Dataframe

Python Pandas Dataframes-从另一个数据帧的字符串列中的一个数据帧中搜索整数

python pandas dataframe

Python Pandas Dataframes-从另一个数据帧的字符串列中的一个数据帧中搜索整数,python,pandas,dataframe,Python,Pandas,Dataframe,我有两个数据帧： DF1 cid dt tm id distance 2 ed032f716995 2021-01-22 16:42:48 43 21.420561 3 16e2fd96f9ca 2021-01-23 23:19:43 539 198.359355 102 cf092e68fa82 2021-01-22 09:03:14 8 39.5996

我有两个数据帧：

DF1

                cid          dt        tm    id    distance
2      ed032f716995  2021-01-22  16:42:48    43   21.420561
3      16e2fd96f9ca  2021-01-23  23:19:43   539  198.359355
102    cf092e68fa82  2021-01-22  09:03:14     8   39.599627
104    833ccf05433b  2021-01-24  02:53:08    11   33.168314

DF2

        id            cluster  
0        3                      
1        6             7,8,43  
2       20               1817  
3       25   
4       10  11,13,14,15,9,539

我想在

df2

的

cluster

列中的

df1

中搜索每个

id

。所需输出为：

                cid          dt        tm    id    distance     cluster
2      ed032f716995  2021-01-22  16:42:48    43   21.420561     7,8,43
3      16e2fd96f9ca  2021-01-23  23:19:43   539  198.359355     11,13,14,15,9,539
102    cf092e68fa82  2021-01-22  09:03:14     8   39.599627     7,8,43 
104    833ccf05433b  2021-01-24  02:53:08    11   33.168314     11,13,14,15,9,539

在上面的df1-1行中，由于df2中存在43，因此我将包括df1-1行的整个集群详细信息

我尝试了以下方法：

for index, rows in df1.iterrows():
    for idx,rws in df2.iterrows():
        if (str(rows['id']) in str(rws['cluster'])):
            print([rows['id'],rws['cluster']])

这看起来很有效。但是，由于

df2['cluster']

是一个字符串，即使存在部分匹配，它也会返回结果。例如，如果df1['id']=34，而df2['cluster']有344432，等等，它仍然基于344进行匹配，并返回一个正结果

我在这里尝试了另一种选择：

d = {k: set(v.split(',')) for k, v in df2.set_index('id')['cluster'].items()}
df1['idc'] = [next(iter([k for k, v in d.items() if set(x).issubset(v)]), '') for x in str(df1['id'])]

然而，在上面的例子中，我得到了一个错误，表明两个数据集之间变量的长度不同

如何根据df1中id列的精确匹配来映射集群？

一种方法是拆分

集群

，

分解它并映射：
to_map = (df2.assign(cluster_i=df2.cluster.str.split(','))
    .explode('cluster_i').dropna()
    .set_index('cluster_i')['cluster']
)

df1['cluster'] = df1['id'].astype(str).map(to_map)

输出：
              cid          dt        tm   id    distance            cluster
2    ed032f716995  2021-01-22  16:42:48   43   21.420561             7,8,43
3    16e2fd96f9ca  2021-01-23  23:19:43  539  198.359355  11,13,14,15,9,539
102  cf092e68fa82  2021-01-22  09:03:14    8   39.599627             7,8,43
104  833ccf05433b  2021-01-24  02:53:08   11   33.168314  11,13,14,15,9,539

一种方法是拆分集群
，分解它并映射：
to_map = (df2.assign(cluster_i=df2.cluster.str.split(','))
    .explode('cluster_i').dropna()
    .set_index('cluster_i')['cluster']
)

df1['cluster'] = df1['id'].astype(str).map(to_map)

输出：
              cid          dt        tm   id    distance            cluster
2    ed032f716995  2021-01-22  16:42:48   43   21.420561             7,8,43
3    16e2fd96f9ca  2021-01-23  23:19:43  539  198.359355  11,13,14,15,9,539
102  cf092e68fa82  2021-01-22  09:03:14    8   39.599627             7,8,43
104  833ccf05433b  2021-01-24  02:53:08   11   33.168314  11,13,14,15,9,539

非常感谢。当我使用代码时，我得到了这个错误TypeError:explode（）接受1个位置参数，但2个位置参数被赋予了
@Apricot，这很奇怪。您的Pandas版本是什么？对不起，我的不好…代码工作得很好…我的生产数据集中有太多的复杂问题。谢谢。当我使用代码时，我得到了这个错误TypeError:explode（）接受1个位置参数，但2个位置参数被赋予了
@Apricot，这很奇怪。您的Pandas版本是什么？对不起，我的不好……代码工作得很好……我的生产数据集中有太多的复杂问题。