Python 使用apply方法在熊猫栏上使用gensim短语_Python_Pandas_Gensim_N Gram_Phrase

Python 使用apply方法在熊猫栏上使用gensim短语

python pandas

Python 使用apply方法在熊猫栏上使用gensim短语,python,pandas,gensim,n-gram,phrase,Python,Pandas,Gensim,N Gram,Phrase,我试图在df中的一列上使用gensim短语。下面给出了示例df col1 col2 1 "this is test1 and is used for test1" 2 "this is content of row which is second row" 3 "this is the third row" 我为bigrams编写了一个方法 def bigrams(text): bigram = Phrases(text, min_count=1)

我试图在df中的一列上使用gensim短语。下面给出了示例df

col1   col2
1      "this is test1 and is used for test1"
2      "this is content of row which is second row"
3      "this is the third row"

我为bigrams编写了一个方法

def bigrams(text):
    bigram = Phrases(text, min_count=1)
    bigram_mod = Phraser(bigram)
    return [bigram_mod[doc] for doc in text]

我试过了

df['col2'].apply(bigrams)
df['col2'].apply(lambda x: bigrams([x])) - so that the text is enclosed in list

但是我把字符作为输出，而不是bigrams。我在这里遗漏了什么。

短语

需要一个已经标记的语料库

您的问题目前没有显示您提供给

bigrams（）

函数的

text

的值，但不能将这些行值作为普通字符串：您必须首先以某种方式将它们分解为所需的单词

另外：不要期望从一个玩具大小的小例子中得到任何有意义的结果，因为

短语

需要大量数据才能使其基于统计的单词配对变得有用。请注意，即使在有用的情况下，配对通常也不会符合人类关于有意义的分组/实体是什么的想法——既有我们想要的缺失配对，也有我们不想要的配对，即使是仔细的参数调整也会留下这种“不自然”的选择。但是，这样的

短语

处理后的文本对于后端分类/信息检索仍然非常有用。

因此gensim phraser需要一个标记列表所以我的解决方案是将文本转换为标记将令牌转换为列表的列表

df['tokens']=df['text'].apply(tokenization_function)
df['tokens']=df['tokens'].apply(lambda x:[x])
df['bigrams']=df['tokens'].apply(bigrams)

我尝试在列表中包含令牌的列上使用apply。在这种情况下，输出中将每个单词拆分为字符。问题中的文本是示例文本。我不能把实际文本放在这里。