Python 如何加快一系列文档中存在密钥的总和？-熊猫_Python_String_List_Pandas_Nltk

Python 如何加快一系列文档中存在密钥的总和？-熊猫

python string list pandas

Python 如何加快一系列文档中存在密钥的总和？-熊猫,python,string,list,pandas,nltk,Python,String,List,Pandas,Nltk,我有一个dataframe列，其中包含如下文档 38909 Hotel is an old style Red Roof and has not bee... 38913 I will never ever stay at this Hotel again. I ... 38914 After being on a bus for -- hours and finally ... 38918 We were excited about our stay at the Bl

我有一个dataframe列，其中包含如下文档

38909 Hotel is an old style Red Roof and has not bee... 38913 I will never ever stay at this Hotel again. I ... 38914 After being on a bus for -- hours and finally ... 38918 We were excited about our stay at the Blu Aqua... 38922 This hotel has a great location if you want to... Name: Description, dtype: object 它根据示例键提供以下输出

38909 2 38913 2 38914 3 38918 0 38922 1 Name: Description, dtype: int64 我有50000行。在nltk或pandas中是否有更快的方法可以做到这一点

编辑：以防查找文档数组

array([ 'Hotel is an old style Red Roof and has not been renovated up to the new standard, but the price was also not up to the level of the newer style Red Roofs. So, in overview it was an OK stay, and a safe',
   'I will never ever stay at this Hotel again. I stayed there a few weeks ago, and I had my doubts the second I checked in. The guy that checked me in, I think his name was Julio, and his name tag read F',
   "After being on a bus for -- hours and finally arriving at the Hotel Lawerence at - am, I bawled my eyes out when we got to the room. I realize it's suppose to be a boutique hotel but, there was nothin",
   "We were excited about our stay at the Blu Aqua. A new hotel in downtown Chicago. It's architecturally stunning and has generally good reviews on TripAdvisor. The look and feel of the place is great, t",
   'This hotel has a great location if you want to be right at Times Square and the theaters. It was an easy couple of blocks for us to go to theater, eat, shop, and see all of the activity day and night '], dtype=object)

如果需要检查，仅当列表的当前值：

from numpy.core.defchararray import find

v = df['col'].values.astype(str)
a = (find(v[:, None], keys) >= 0).sum(axis=1)
print (a)
[2 1 1 0 0]

或：

如果需要检查，仅当列表的当前值：

from numpy.core.defchararray import find

v = df['col'].values.astype(str)
a = (find(v[:, None], keys) >= 0).sum(axis=1)
print (a)
[2 1 1 0 0]

或：

以下代码并不完全等同于您的（慢速）版本，但它演示了这一想法：

keyset = frozenset(keys)
df.apply(lambda x : len(keyset.intersection(x.split())))

差异/限制：

在您的版本中，即使某个单词作为子字符串包含在文档中的某个单词中，也会对该单词进行计数。例如，如果您的

键

包含单词tyl，则由于您的第一个文档中出现了“样式”，因此将对其进行计数

我的解决方案没有考虑文档中的标点符号。例如，第二个文档中的单词同样来自
split（）
，并附加了句号。这可以通过使用删除标点符号的函数对文档进行预处理（或对
split（）
的结果进行后处理）来解决

以下代码并不完全等同于您的（慢速）版本，但它演示了这一想法：

keyset = frozenset(keys) df.apply(lambda x : len(keyset.intersection(x.split())))
差异/限制：

在您的版本中，即使某个单词作为子字符串包含在文档中的某个单词中，也会对该单词进行计数。例如，如果您的
键
包含单词tyl，则由于您的第一个文档中出现了“样式”，因此将对其进行计数

我的解决方案没有考虑文档中的标点符号。例如，第二个文档中的单词同样来自
split（）
，并附加了句号。这可以通过使用删除标点符号的函数对文档进行预处理（或对
split（）
的结果进行后处理）来解决
看来你可以用-
最好输入一个布尔数组进行计数-

[np.count_nonzero(np.char.count(i, keys)!=0) for i in arr]
看来你可以用-
最好输入一个布尔数组进行计数-

[np.count_nonzero(np.char.count(i, keys)!=0) for i in arr]

我的内核在尝试我的方法时崩溃了。重新启动它。很快就会有速度比较了。我不明白你的意思。键作为列表的列表？否所有单词都是单个动词，按空格分隔。在45000Hmmm左右的问题中，没有任何措辞是可以重复的，比如在第一排
酒店是一家老式的红色酒店，没有蜜蜂。
？我的内核在尝试我的方法时崩溃了。重新启动它。很快就会有速度比较了。我不明白你的意思。键作为列表的列表？否所有单词都是单个动词，按空格分隔。在45000Hmmm左右的问题中，没有任何措辞是可以重复的，比如在第一排
酒店是一家老式的红色酒店，没有蜜蜂。
？我使用的是set。让我试试frozenset版本。@Bharathshetty我不认为
set
vs
frozenset
会造成很大的不同，如果有什么的话，请忘了提及我的set方法失败了。你的方法非常快，是的。如果您只是像问题中的代码那样迭代集合的元素，那么对
键使用集合是毫无意义的。这个解决方案是可行的。TBH我使用的是set。让我试试frozenset版本。@Bharathshetty我不认为set vsfrozenset 会造成很大的不同，如果有什么的话，请忘了提及我的set方法失败了。你的方法非常快，是的。如果您只是像问题中的代码那样迭代集合的元素，那么对键使用集合是毫无意义的。这个解决方案是可行的。在您的实际案例中，键中有多少元素？我补充说大约有45000个键，所以对于编辑示例，输出应该是：[2,1,2,0,0] ，对吗？您正在寻找区分大小写的输出吗？是的。不太担心区分大小写，不区分大小写也可以。关键是什么？像词汇表一样？在您的实际案例中，键中有多少元素？我补充说大约有45000个键，所以对于编辑示例，输出应该是：[2,1,2,0,0]，对吗？您正在寻找区分大小写的输出吗？是的。不太担心区分大小写，不区分大小写也可以。关键是什么？喜欢词汇吗？我觉得numpy在字符串方面很慢。虽然我学到了一些东西new@Bharathshetty是的，事实就是这样。我认为numpy在弦乐方面很慢。虽然我学到了一些东西new@Bharathshetty是的，情况就是这样。 [np.count_nonzero(np.char.count(i, keys)!=0) for i in arr]