Python 使用函数和for循环将多个文件的文本与列表进行比较_Python_Pandas_Apply

Python 使用函数和for循环将多个文件的文本与列表进行比较

python pandas

Python 使用函数和for循环将多个文件的文本与列表进行比较,python,pandas,apply,Python,Pandas,Apply,我的最终目标是创建一个覆盖多个文件的for循环，并使用一个额外的for循环将术语索引与数据帧进行比较。为了使这更有趣，我还包括了一个函数，因为我可能必须将相同的原理应用于同一数据帧中的另一个变量。有几个问题我不确定在这种情况下是否应该使用regex，或者在语句中使用一个简单的我使用的方法效率不高（更不用说它不起作用了）。我希望有类似于isin语句的东西，但是列表中的每个单词都需要对照数据帧的一行进行检查。然而，我不知道如何应用它，当我试图做这样的事情所需的输出将是与公司匹配的标题列表：

我的最终目标是创建一个覆盖多个文件的for循环，并使用一个额外的for循环将术语索引与数据帧进行比较。为了使这更有趣，我还包括了一个函数，因为我可能必须将相同的原理应用于同一数据帧中的另一个变量。有几个问题

我不确定在这种情况下是否应该使用regex，或者在语句中使用一个简单的


我使用的方法效率不高（更不用说它不起作用了）。我希望有类似于isin
语句的东西，但是列表中的每个单词都需要对照数据帧的一行进行检查。然而，我不知道如何应用它，当我试图做这样的事情
所需的输出将是与公司匹配的标题列表：

目标队在雄鹿队中创造了更好的明星
未经加工的钻石员工午睡过多

编辑：我的脚本已被编辑，因此现在它可以工作并显示标题。但是，输出显示的不是所需的输出，而是数据帧的所有行，只填充了适用的行
。。。在这种情况下应该使用正则表达式，或者如果一个简单的in语句就足够了
在

中使用

就可以了，因为您显然已经规范化为.lower（）
并删除了标点符号
您确实应该尝试使用更有意义的标识符。例如，与i
不同，通常的习惯用法是公司中的公司：

您已经学会了如何使用

.tolist（）

，这很好。但是您确实希望创建一个

集合

，而不是一个

列表

，以便在测试中支持有效的

。这是O（1）散列查找与列表线性扫描的嵌套循环之间的区别
这毫无意义：
        for i in ccompanies:
            i = [x]

你开始迭代，但是i
本质上变成了一个常数？不清楚你要干什么
<>如果你把这个项目做得再深入一点，你可能会考虑把公司与NLTK相匹配。
或scikit learn的TFIDF矢量器，
或
在纯熊猫中，无需迭代并转换为列表
首先，将数据
连接到df
中，这样每个与之比较的公司名称都会“复制”标题。临时列“键”用于促进此联接
In [60]: data_df = data.to_frame()

In [61]: data_df['key'] = 1

In [63]: df['key'] = 1

In [65]: merged = pd.merge(df, data_df, how='outer', on='key').drop('key', axis=1)

merged
将如下所示。如您所见，根据数据的大小
，使用此方法可能会得到一个巨大的数据帧
In [66]: merged
Out[66]:
                                             headline            source               company
0         targets is making better stars in the bucks       target news               targets
1         targets is making better stars in the bucks       target news    stars in the bucks
2         targets is making better stars in the bucks       target news            wallymarty
3         targets is making better stars in the bucks       target news       velocity global
4         targets is making better stars in the bucks       target news  diamond in the rough
5            more diamonds than rocks in saturn rings  wishful thinking               targets
6            more diamonds than rocks in saturn rings  wishful thinking    stars in the bucks
7            more diamonds than rocks in saturn rings  wishful thinking            wallymarty
8            more diamonds than rocks in saturn rings  wishful thinking       velocity global
9            more diamonds than rocks in saturn rings  wishful thinking  diamond in the rough
10  diamond in the rough employees take too many naps     refresh sleep               targets
11  diamond in the rough employees take too many naps     refresh sleep    stars in the bucks
12  diamond in the rough employees take too many naps     refresh sleep            wallymarty
13  diamond in the rough employees take too many naps     refresh sleep       velocity global
14  diamond in the rough employees take too many naps     refresh sleep  diamond in the rough

然后在标题中查找文本。如果找到，则在新的“找到”列中输入True，否则输入False
In [67]: merged['found'] = merged.apply(lambda x: x['company'] in x['headline'], axis=1)

然后删除未找到匹配项的标题：
In [68]: found_df = merged.drop(merged[merged['found']==False].index)

In [69]: found_df
Out[69]:
                                             headline         source               company  found
0         targets is making better stars in the bucks    target news               targets   True
1         targets is making better stars in the bucks    target news    stars in the bucks   True
14  diamond in the rough employees take too many naps  refresh sleep  diamond in the rough   True

如有必要，仅对标题和公司进行总结
In [70]: found_df[['headline', 'company']]
Out[70]:
                                             headline               company
0         targets is making better stars in the bucks               targets
1         targets is making better stars in the bucks    stars in the bucks
14  diamond in the rough employees take too many naps  diamond in the rough

快捷方式：可使用此命令总结结束前的步骤67
merged.drop(merged[merged.apply(lambda x: x['company'] in x['headline'], axis=1) == False].index)[['headline', 'source']]

这些是列表，不是数据帧？最好显示实际的数据帧。你是想把headlinedataframe与一家上市公司进行比较吗？好的，我会进行编辑……但是的，这正是我想做的。应该对照整个列表检查数据框中的标题。感谢您的回答-使for循环运行得更快。您建议我如何修复for循环以使其工作？出于某种原因，即使没有这一行，它也无法工作……python为df数据集中的每一行带来了所有公司的完整列表。
merged.drop(merged[merged.apply(lambda x: x['company'] in x['headline'], axis=1) == False].index)[['headline', 'source']]