Python 3.x 数据帧中字符串数据的预处理
我有一个用户评论数据集。我已经加载了这个数据集,现在我想在将其装配到分类器之前预处理用户评论(即删除停止词、标点符号、转换为小写、删除问候语等),但我遇到了错误。这是我的密码:Python 3.x 数据帧中字符串数据的预处理,python-3.x,pandas,machine-learning,nltk,data-cleaning,Python 3.x,Pandas,Machine Learning,Nltk,Data Cleaning,我有一个用户评论数据集。我已经加载了这个数据集,现在我想在将其装配到分类器之前预处理用户评论(即删除停止词、标点符号、转换为小写、删除问候语等),但我遇到了错误。这是我的密码: import pandas as pd import numpy as np df=pd.read_json("C:/Users/ABC/Downloads/Compressed/reviews_Musical_Instruments_5.json/Musical_Instruments_5.js
import pandas as pd
import numpy as np
df=pd.read_json("C:/Users/ABC/Downloads/Compressed/reviews_Musical_Instruments_5.json/Musical_Instruments_5.json",lines=True)
dataset=df.filter(['overall','reviewText'],axis=1)
def cleanText(text):
"""
removes punctuation, stopwords and returns lowercase text in a list
of single words
"""
text = (text.lower() for text in text)
from bs4 import BeautifulSoup
text = BeautifulSoup(text).get_text()
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')
text = tokenizer.tokenize(text)
from nltk.corpus import stopwords
clean = [word for word in text if word not in
stopwords.words('english')]
return clean
dataset['reviewText']=dataset['reviewText'].apply(cleanText)
dataset['reviewText']
我发现以下错误:
TypeError Traceback (most recent call last)
<ipython-input-68-f42f70ec46e5> in <module>()
----> 1 dataset['reviewText']=dataset['reviewText'].apply(cleanText)
2 dataset['reviewText']
~\Anaconda3\lib\site-packages\pandas\core\series.py in apply(self, func, convert_dtype, args, **kwds)
2353 else:
2354 values = self.asobject
-> 2355 mapped = lib.map_infer(values, f, convert=convert_dtype)
2356
2357 if len(mapped) and isinstance(mapped[0], Series):
pandas/_libs/src\inference.pyx in pandas._libs.lib.map_infer()
<ipython-input-64-5c6792de405c> in cleanText(text)
10 from nltk.tokenize import RegexpTokenizer
11 tokenizer = RegexpTokenizer(r'\w+')
---> 12 text = tokenizer.tokenize(text)
13
14 from nltk.corpus import stopwords
~\Anaconda3\lib\site-packages\nltk\tokenize\regexp.py in tokenize(self, text)
127 # If our regexp matches tokens, use re.findall:
128 else:
--> 129 return self._regexp.findall(text)
130
131 def span_tokenize(self, text):
TypeError: expected string or bytes-like object
打印(df)
转换成小写
df.loc[:,"reviewText"] = df.reviewText.apply(lambda x : str.lower(x))
删除标点符号和数字的步骤
import re
df.loc[:,"reviewText"] = df.reviewText.apply(lambda x : " ".join(re.findall('[\w]+',x)))
要删除stopwords,您可以安装stopwords,也可以创建自己的stopwords列表并将其与函数一起使用
from stop_words import get_stop_words
stop_words = get_stop_words('en')
def remove_stopWords(s):
'''For removing stop words
'''
s = ' '.join(word for word in s.split() if word not in stop_words)
return s
df.loc[:,"reviewText"] = df.reviewText.apply(lambda x: remove_stopWords(x))
cleanText
的定义非常糟糕。这里面有太多的错误。但首先,请正确缩进您的代码。@Abdou我现在已经缩进了我的代码。因为我是新来的,我正试图找到一种方法来预处理我的数据。请不要这样做。。。您在几个步骤中多次重复循环浏览同一文本,它们可以集中到一个步骤中。看看,谢谢Sahil,你的回答真的很有帮助。你能不能也建议一个用于词干分析的函数。@Shubhamsing看看这个用于词干分析的函数,请不要这样做。。。您在几个步骤中多次重复循环浏览同一文本,它们可以集中到一个步骤中。查看并获取AttributeError:“Series”对象没有属性“split”
df.loc[:,"reviewText"] = df.reviewText.apply(lambda x : str.lower(x))
import re
df.loc[:,"reviewText"] = df.reviewText.apply(lambda x : " ".join(re.findall('[\w]+',x)))
from stop_words import get_stop_words
stop_words = get_stop_words('en')
def remove_stopWords(s):
'''For removing stop words
'''
s = ' '.join(word for word in s.split() if word not in stop_words)
return s
df.loc[:,"reviewText"] = df.reviewText.apply(lambda x: remove_stopWords(x))