Python 如何在匹配多个不同列时比较熊猫中的列？_Python_Pandas_Numpy_Dataframe

Python 如何在匹配多个不同列时比较熊猫中的列？

python pandas numpy dataframe

Python 如何在匹配多个不同列时比较熊猫中的列？,python,pandas,numpy,dataframe,Python,Pandas,Numpy,Dataframe,我正在构建一个机器学习软件来分割大数据包的页面。我试图通过自动化验证预测输出与带有标签的目标输出的过程来对模型进行一些分析。为此，我创建了一个熊猫数据框，如下所示： page_num file predicted label -------------------------------------- 1 file1 0 0 1 file1 0 0 2 fil

我正在构建一个机器学习软件来分割大数据包的页面。我试图通过自动化验证预测输出与带有标签的目标输出的过程来对模型进行一些分析。为此，我创建了一个熊猫数据框，如下所示：

page_num    file    predicted    label
--------------------------------------
1           file1       0          0
1           file1       0          0
2           file1       0          0
2           file1       0          0
2           file1       0          0
3           file1       1          1
3           file1       1          1
3           file1       1          1
1           file2       0          0
1           file2       0          0
1           file2       0          0
2           file2       2          2
2           file2       2          2
...
n           filen       0          0

page_num    file    predicted    label
--------------------------------------
1           file1       0          0
2           file1       0          0
3           file1       1          1
1           file2       0          0
2           file2       2          2
...
n           filen       0          0

为了简洁起见，我还遗漏了其他一些列，总共13列，不包括索引。我对熊猫比较陌生，但我基本上希望让数据帧看起来像这样：

page_num    file    predicted    label
--------------------------------------
1           file1       0          0
1           file1       0          0
2           file1       0          0
2           file1       0          0
2           file1       0          0
3           file1       1          1
3           file1       1          1
3           file1       1          1
1           file2       0          0
1           file2       0          0
1           file2       0          0
2           file2       2          2
2           file2       2          2
...
n           filen       0          0

page_num    file    predicted    label
--------------------------------------
1           file1       0          0
2           file1       0          0
3           file1       1          1
1           file2       0          0
2           file2       2          2
...
n           filen       0          0

因此，我可以验证每个文件中每个页面的predicted==标签中的值

我试过几件事：

首先，我尝试了df[df.groupby['file'，'page_num']]，但产生了错误“ValueError:无法将大小为489的序列复制到维度为13的数组轴”

我检查了df.groupby['file'，'page_num'].groups，并注意到这些组就是我想要的：文件和它们的页面。但是我不能使用DataFrame where函数，我也不认为apply是我想要的

我也试着遍历这些组并检查数据帧，但是我得到了很多错误的结果。Jupyter笔记本的输出如下所示：

for group in df.groupby(['file', 'page_num']).groups:
    df[df.file == group[0], df.page_num == group[1]].reset_index(drop=True)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-39-b34f0ce41321> in <module>
      1 for group in df.groupby(['file', 'page_num']).groups:
----> 2     temp_df = df[df.file == group[0], df.page_num == group[1]].reset_index(drop=True)
      3     print(temp_df.label)

~\AppData\Local\Continuum\anaconda3\envs\base\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)
   2925             if self.columns.nlevels > 1:
   2926                 return self._getitem_multilevel(key)
-> 2927             indexer = self.columns.get_loc(key)
   2928             if is_integer(indexer):
   2929                 indexer = [indexer]

~\AppData\Local\Continuum\anaconda3\envs\base\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   2655                                  'backfill or nearest lookups')
   2656             try:
-> 2657                 return self._engine.get_loc(key)
   2658             except KeyError:
   2659                 return self._engine.get_loc(self._maybe_cast_indexer(key))

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

TypeError: '(0           True
1           True
2           True
3           True
4           True
5           True
6           True
7           True
8           True
9           True
10          True
11          True
12          True
13          True
14          True
15          True
16          True
17          True
18          True
19          True
20          True
21          True
22          True
23          True
24          True
25          True
26          True
27          True
28          True
29          True
           ...  
2028635    False
2028636    False
2028637    False
2028638    False
2028639    False
2028640    False
2028641    False
2028642    False
2028643    False
2028644    False
2028645    False
2028646    False
2028647    False
2028648    False
2028649    False
2028650    False
2028651    False
2028652    False
2028653    False
2028654    False
2028655    False
2028656    False
2028657    False
2028658    False
2028659    False
2028660    False
2028661    False
2028662    False
2028663    False
2028664    False
Name: file, Length: 2028665, dtype: bool, 0           True
1           True
2           True
3           True
4           True
5           True
6           True
7           True
8           True
9           True
10          True
11          True
12          True
13          True
14          True
15          True
16          True
17          True
18          True
19          True
20          True
21          True
22          True
23          True
24          True
25          True
26          True
27          True
28          True
29          True
           ...  
2028635    False
2028636    False
2028637    False
2028638    False
2028639    False
2028640    False
2028641    False
2028642    False
2028643    False
2028644    False
2028645    False
2028646    False
2028647    False
2028648    False
2028649    False
2028650    False
2028651    False
2028652    False
2028653    False
2028654    False
2028655    False
2028656    False
2028657    False
2028658    False
2028659    False
2028660    False
2028661    False
2028662    False
2028663    False
2028664    False
Name: page_num, Length: 2028665, dtype: bool)' is an invalid key

我真的不明白发生了什么，因为每次我试图改变一些东西时，我都会得到一个不同的ValueError或TypeError或类似的东西。我希望能够遍历df.groupby['file'，'page_num'].groups生成的组，并检查我的主数据帧df在label中是否有匹配的值，并预测df['file'==group[0]]和df['page_num'==group[1]]的位置

我对熊猫很陌生，所以我可能错过了一些小东西。感谢您的帮助。谢谢大家!

通过drop\u duplicates删除重复的行并按sort\u值进行排序首先按文件名排序，然后按页码排序：

df.drop_duplicates().sort_values(['file','page_num'],ascending = True)

输出：

了解df.drop\u重复.sort\u值是很有趣的['page_num'，'file']，升序=True不会产生相同的结果，因为它先按page_num排序，然后按file排序

所以您想先按file排序，然后再按page排序？似乎您也想删除重复项？@Yuca是的，我想先按file排序，然后再按page排序。然后在事实之后比较两个单独的列，并对结果进行比较。不清楚如何处理文件1和第2页有3行的事实。这就是为什么阿洛兹建议DUP。这是什么逻辑？删除重复项就足够了吗？这应该让您了解如何启动df1.assigncorrect=df1.predicted==df1.label.groupby['file'，'page_num'].correct.countThank！我删除了所有额外的列，然后将结果存储到一个新的数据帧中，这在np中运行得非常好。我在哪里做了大量的分析。再次感谢你！