如何在Python中对dataframe列进行柠檬化_Python_Dataframe_Lemmatization

如何在Python中对dataframe列进行柠檬化

python dataframe

如何在Python中对dataframe列进行柠檬化,python,dataframe,lemmatization,Python,Dataframe,Lemmatization,如何对数据框列进行柠檬化。CSV文件“train.CSV”如下所示 id tweet 1 retweet if you agree 2 happy birthday your majesty 3 essential oils are not made of chemicals 我执行了以下操作 import pandas as pd from nltk.tokenize import TweetTokenizer from nltk.corpus import stopwords

如何对数据框列进行柠檬化。CSV文件“train.CSV”如下所示

id  tweet
1   retweet if you agree
2   happy birthday your majesty
3   essential oils are not made of chemicals

我执行了以下操作

import pandas as pd
from nltk.tokenize import TweetTokenizer
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer

train_data = pd.read_csv('train.csv', error_bad_lines=False)
print(train_data)

# Removing stop words
stop = stopwords.words('english')
test = pd.DataFrame(train_data['tweet'])
test.columns = ['tweet']

test['tweet_without_stopwords'] = test['tweet'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
print(test['tweet_without_stopwords'])

# TOKENIZATION
tt = TweetTokenizer()
test['tokenised_tweet'] = test['tweet_without_stopwords'].apply(tt.tokenize)
print(test)

输出：

0 retweet if you agree ... [retweet, agree]
1 happy birthday your majesty ... [happy, birthday, majesty]
2 essential oils are not made of chemicals ... [essential, oils, made, chemicals]

我尝试了以下方法来柠檬化，但得到了以下错误类型错误：unhabable类型：“list”

我会对数据帧本身进行计算：

更改：

lmtzr = WordNetLemmatizer()
lemmatized = [[lmtzr.lemmatize(word) for word in test['tokenised_tweet']]]
print(lemmatized)

完整代码：

from io import StringIO
import pandas as pd
data=StringIO(
"""id;tweet
1;retweet if you agree
2;happy birthday your majesty
3;essential oils are not made of chemicals"""
)
test = pd.read_csv(data,sep=";")

import pandas as pd
from nltk.tokenize import TweetTokenizer
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer

# Removing stop words
stop = stopwords.words('english')

test['tweet_without_stopwords'] = test['tweet'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
print(test['tweet_without_stopwords'])

# TOKENIZATION
tt = TweetTokenizer()
test['tokenised_tweet'] = test['tweet_without_stopwords'].apply(tt.tokenize)
print(test)

lmtzr = WordNetLemmatizer()
test['lemmatize'] = test['tokenised_tweet'].apply(
                    lambda lst:[lmtzr.lemmatize(word) for word in lst])
print(test['lemmatize'])

输出

0                    [retweet, agree]
1          [happy, birthday, majesty]
2    [essential, oil, made, chemical]
Name: lemmatize, dtype: object

似乎test是一个列表，所以您不能调用test['tokenized_tweet'，我认为您需要提供更多代码的详细信息。。。什么是考试？是吗？测试=pd.数据帧（列车数据['tweet']）

from io import StringIO
import pandas as pd
data=StringIO(
"""id;tweet
1;retweet if you agree
2;happy birthday your majesty
3;essential oils are not made of chemicals"""
)
test = pd.read_csv(data,sep=";")

import pandas as pd
from nltk.tokenize import TweetTokenizer
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer

# Removing stop words
stop = stopwords.words('english')

test['tweet_without_stopwords'] = test['tweet'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
print(test['tweet_without_stopwords'])

# TOKENIZATION
tt = TweetTokenizer()
test['tokenised_tweet'] = test['tweet_without_stopwords'].apply(tt.tokenize)
print(test)

lmtzr = WordNetLemmatizer()
test['lemmatize'] = test['tokenised_tweet'].apply(
                    lambda lst:[lmtzr.lemmatize(word) for word in lst])
print(test['lemmatize'])

0                    [retweet, agree]
1          [happy, birthday, majesty]
2    [essential, oil, made, chemical]
Name: lemmatize, dtype: object