Python 使用nltk snowball词干分析器将列中的值作为参数传递_Python_Nltk_Apply_Snowball

Python 使用nltk snowball词干分析器将列中的值作为参数传递

python

Python 使用nltk snowball词干分析器将列中的值作为参数传递,python,nltk,apply,snowball,Python,Nltk,Apply,Snowball,传递df[language]适用于stopwords，但不适用于snowball词干分析器。有什么办法可以让我绕过它吗到目前为止我还没有找到任何线索 import nltk from nltk.corpus import stopwords import pandas as pd import re df = pd.DataFrame([['A sentence in English', 'english'], ['En mening på svenska', 'swedish']], co

传递

df[language]

适用于stopwords，但不适用于snowball词干分析器。有什么办法可以让我绕过它吗

到目前为止我还没有找到任何线索

import nltk
from nltk.corpus import stopwords
import pandas as pd
import re

df = pd.DataFrame([['A sentence in English', 'english'], ['En mening på svenska', 'swedish']], columns = ['text', 'language'])

def tokenize(text):
    tokens = re.split('\W+', text)
    return tokens

def remove_stopwords(tokenized_list, language):
    stopword = nltk.corpus.stopwords.words(language)
    text = [word for word in tokenized_list if word not in stopword]
    return text

def stemming(tokenized_text, l):
    ss = nltk.stem.SnowballStemmer(l)
    text = [ss.stem(word) for word in tokenized_text]
    return text

df['text_tokenized'] = df['text'].apply(lambda x: tokenize(x.lower()))
df['text_nostop'] = df['text_tokenized'].apply(lambda x: remove_stopwords(x, df['language']))
df['text_stemmed'] = df['text_nostop'].apply(lambda x: stemming(x, df['language']))

我希望它能用英语和瑞典语做雪球词干分析，就像去除停止词一样。我收到如下

错误消息：
ValueError：序列的真值不明确。使用a.empty、a.bool（）、a.item（）、a.any（）或a.all（）
试试这个
df['text_stemmed']=df.apply(lambda x: stemming(x['text_nostop'], x['language']), axis=1)

编辑：当您在特定列上使用apply
时，如df['text\u tokenized']。apply（lambda x:…）
，lambda函数在x上，x是text\u tokenized
列的每一行，而df['language']
不应用于特定行，而是应用于整个系列
也就是说，当您尝试lambda x:remove_stopwords（x，df['language']）
时，df['language']
的返回值不是相应行的特定“language”值，而是同时包含“english”和“swedish”的pandas系列
0    english
1    swedish

因此，使用apply
的第二个代码也应该更改：
df['text_nostop'] = df.apply(lambda x: remove_stopwords(x['text_tokenized'], x['language']), axis=1)

你能为未来的读者添加一些解释吗？为什么你的代码能解决这个问题？