Python 将列与另一个列表进行比较_Python_Pandas

Python 将列与另一个列表进行比较

python pandas

Python 将列与另一个列表进行比较,python,pandas,Python,Pandas,我有一个像这样的数据框 item1 = {'category':'food::cafe::restaurant::business', 'name':'Bob Cafe'} item2 = {'category':'food::take away::restaurant::business', 'name':'John Take Away'} item3 = {'category':'cafeteria::business', 'name':'Annie Cafe'} item4 = {'cat

我有一个像这样的数据框

item1 = {'category':'food::cafe::restaurant::business', 'name':'Bob Cafe'}
item2 = {'category':'food::take away::restaurant::business', 'name':'John Take Away'}
item3 = {'category':'cafeteria::business', 'name':'Annie Cafe'}
item4 = {'category':'hotel::business', 'name':'Premier Inn'}
df = pd.DataFrame([item1, item2, item3, item4])
lookup_table = ['cafe', 'cafeteria', 'restaurant']

我想在DF（Yes/No）中创建一个新列，该列将category列与lookup_表相匹配。类别列需要按“：：”进行拆分，以获取单个类别并将其与列表中的不同值进行比较。在上面的示例中，除item4之外的所有内容都应为True

我不想遍历df.category列中的每一项并检查它是否存在于表中。我对python比较陌生……因此，除了解决方案之外，我还热衷于用“python”的方式解决这个问题的思维过程

谢谢选项1

str.contains

m = df.category.str.contains('|'.join(lookup_table))
df['Yes/No'] = np.where(m, 'Yes', 'No')

df    
                                category            name Yes/No
0       food::cafe::restaurant::business        Bob Cafe    Yes
1  food::take away::restaurant::business  John Take Away    Yes
2                    cafeteria::business      Annie Cafe    Yes
3                        hotel::business     Premier Inn     No

只需将由

lookup\u table

中每个字符串的管道形成的“regex模式”传递到

str.contains

。然后返回一个掩码（基于行中是否匹配了任何类别）。使用

np.where

将此掩码转换为

Yes

No

答案

选项2

str.split

isin

any

m = df.category.str.split('::', expand=True).isin(lookup_table).any(1)
df['Yes/No'] = np.where(m, 'Yes', 'No')

df    
                                category            name Yes/No
0       food::cafe::restaurant::business        Bob Cafe    Yes
1  food::take away::restaurant::business  John Take Away    Yes
2                    cafeteria::business      Annie Cafe    Yes
3                        hotel::business     Premier Inn     No

与上面的选项类似，但这是纯字符串匹配，而不是正则表达式匹配。利用您的数据在

：：

（双冒号）上进行拆分，从而生成如下所示的数据帧-

i = df.category.str.split('::', expand=True)
i
           0          1           2         3
0       food       cafe  restaurant  business
1       food  take away  restaurant  business
2  cafeteria   business        None      None
3      hotel   business        None      None

现在，调用

df.isin

，对

lookup\u表中的每个字符串执行“is equals？”检查。这导致-
j = i.isin(lookup_table)

      0      1      2      3
0  False   True   True  False
1  False  False   True  False
2   True  False  False  False
3  False  False  False  False

下一步是在任何列中查找哪些行具有此类别。。。所以使用any

j.any(axis=1)

0     True
1     True
2     True
3    False
dtype: bool

与之前一样，此掩码使用np.where
转换为Yes
/No
答案，但还有其他方法（例如replace
/str.replace
）

计时
df = pd.concat([df] * 100000, ignore_index=True)



结果可能因您的数据和lookup\u table
中的项目数而异-非常感谢coldspeed。让你更加欣赏这门语言。忘了提一下，我的数据集很大（>20mn行）…对这两个选项如此大的数据集的性能有何评论？@user2097496问得好。答案会根据您的数据有所不同，但我敢打赌第一个选项更有效，因为它不会像第二个选项那样扩展您的列。如果您感兴趣，我将创建一些测试数据并计时。@user2097496第一个选项是更好的，对于更大的数据帧更有效。
%%timeit
m = df.category.str.contains('|'.join(lookup_table))
np.where(m, 'Yes', 'No')

1 loop, best of 3: 536 ms per loop

%%timeit 
m = df.category.str.split('::', expand=True).isin(lookup_table).any(1)
df['Yes/No'] = np.where(m, 'Yes', 'No')

1 loop, best of 3: 2.31 s per loop