Machine learning 值错误：应为原始文本文档上的Iterable，收到字符串对象。（TFIDF（矢量化））_Machine Learning_Nlp_Data Science_Python 3.9

Machine learning 值错误：应为原始文本文档上的Iterable，收到字符串对象。（TFIDF（矢量化））

machine-learning nlp

Machine learning 值错误：应为原始文本文档上的Iterable，收到字符串对象。（TFIDF（矢量化））,machine-learning,nlp,data-science,python-3.9,Machine Learning,Nlp,Data Science,Python 3.9,我已经尝试了很多方法来解决这个问题，但仍然会出现错误。我不知道为什么 #导入的库 import pandas as pd import numpy as np import nltk import sklearn from sklearn import svm from sklearn.model_selection import train_test_split from nltk.probability import FreqDist from collections import Coun

我已经尝试了很多方法来解决这个问题，但仍然会出现错误。我不知道为什么 #导入的库

import pandas as pd import numpy as np import nltk import sklearn from sklearn import svm from sklearn.model_selection import train_test_split from nltk.probability import FreqDist from collections import Counter from nltk import word_tokenize,sent_tokenize from nltk.corpus import stopwords from nltk.stem import WordNetLemmatizer import matplotlib.pyplot as pt import string from wordcloud import WordCloud from PIL import Image from nltk import pos_tag from sklearn.feature_extraction.text import TfidfVectorizer import re from nltk.corpus import wordnet from nltk.tag import pos_tag#Libraries imported import pandas as pd import numpy as np import nltk import sklearn from sklearn import svm from sklearn.model_selection import train_test_split from nltk.probability import FreqDist from collections import Counter from nltk import word_tokenize,sent_tokenize from nltk.corpus import stopwords from nltk.stem import WordNetLemmatizer import matplotlib.pyplot as pt import string from wordcloud import WordCloud from PIL import Image from nltk import pos_tag from sklearn.feature_extraction.text import TfidfVectorizer import re from nltk.corpus import wordnet from nltk.tag import pos_tag TR_DS=pd.read_csv("D:\DataSet\Training_Dataset.csv") TR_DS.drop(["title","date","subject"],axis=1,inplace=True)#Delete unnecessary columns TR_DS.isnull().any() TR_DS.head() #text column with small letters news_text=TR_DS['text'].str.lower() #Remove Punctuations news_text=news_text.str.replace(r'[^\w\d\s]',' ') #Remove Digits news_text=news_text.str.replace('\d+', '') #Tokenization (Text as Words) def tokenize(text): tokens= re.split('\W+',text) return tokens news_text=news_text.apply(lambda x : word_tokenize(x.lower())) #Remove StopWords stop_words=stopwords.words("english") def remove_stopwords(text): text = [word for word in text if word not in stop_words] return text news_text=news_text.apply(lambda x: remove_stopwords(x)) #Lemmatization Words in Stem Form WordNetLemmatizer = nltk.WordNetLemmatizer() def lemmer(text): text = [WordNetLemmatizer.lemmatize(word, pos='v') for word in text] return text news_text=news_text.apply(lambda x: lemmer(x)) print(news_text)
输出

0 [reuters, man, accuse, tackle, u, senator, ran... 1 [washington, reuters, republican, u, senators,... 2 [reuters, proposals, u, republicans, repeal, r... 3 [reuters, tax, plan, u, senate, republicans, r... 4 [washington, reuters, americans, likely, belie... ... 295 [donald, trump, kick, hispanic, heritage, mont... 296 [us, even, conservatives, know, donald, trump,... 297 [take, office, donald, trump, repeatedly, take... 298 [sunday, august, th, salt, lake, city, police,... 299 [u, rep, tim, murphy, staunch, pro, life, repu... Name: text, Length: 300, dtype: object tfidf_vect = TfidfVectorizer(max_features=5000, ngram_range = (1,2)) def TFIDF(text): text = [tfidf_vect.fit(word) for word in text] return text news_text=news_text.apply(lambda x: TFIDF(x))
**错误

ValueError Traceback (most recent call last) <ipython-input-232-886e227564fb> in <module> 4 text = [tfidf_vect.fit(word) for word in text] 5 return text ----> 6 news_text=news_text.apply(lambda x: TFIDF(x)) c:\users\mahad maqsood\appdata\local\programs\python\python39\lib\site-packages\pandas\core\series.py in apply(self, func, convert_dtype, args, **kwds) 4210 else: 4211 values = self.astype(object)._values -> 4212 mapped = lib.map_infer(values, f, convert=convert_dtype) 4213 4214 if len(mapped) and isinstance(mapped[0], Series): pandas\_libs\lib.pyx in pandas._libs.lib.map_infer() <ipython-input-232-886e227564fb> in <lambda>(x) 4 text = [tfidf_vect.fit(word) for word in text] 5 return text ----> 6 news_text=news_text.apply(lambda x: TFIDF(x)) <ipython-input-232-886e227564fb> in TFIDF(text) 2 tfidf_vect = TfidfVectorizer() 3 def TFIDF(text): ----> 4 text = [tfidf_vect.fit(word) for word in text] 5 return text 6 news_text=news_text.apply(lambda x: TFIDF(x)) <ipython-input-232-886e227564fb> in <listcomp>(.0) 2 tfidf_vect = TfidfVectorizer() 3 def TFIDF(text): ----> 4 text = [tfidf_vect.fit(word) for word in text] 5 return text 6 news_text=news_text.apply(lambda x: TFIDF(x)) c:\users\mahad maqsood\appdata\local\programs\python\python39\lib\site-packages\sklearn\feature_extraction\text.py in fit(self, raw_documents, y) 1816 self._check_params() 1817 self._warn_for_unused_params() -> 1818 X = super().fit_transform(raw_documents) 1819 self._tfidf.fit(X) 1820 return self c:\users\mahad maqsood\appdata\local\programs\python\python39\lib\site-packages\sklearn\feature_extraction\text.py in fit_transform(self, raw_documents, y) 1186 # TfidfVectorizer. 1187 if isinstance(raw_documents, str): -> 1188 raise ValueError( 1189 "Iterable over raw text documents expected, " 1190 "string object received.") ValueError: Iterable over raw text documents expected, string object received.

ValueError回溯（最后一次最近调用） 4 text=[tfidf_vect.fit（word）表示文本中的单词] 5返回文本 ---->6新闻文本=新闻文本。应用（lambda x:TFIDF（x）） c:\users\mahad maqsood\appdata\local\programs\python 39\lib\site packages\pandas\core\series.py in apply（self，func，convert\u dtype，args，**kwds）4210-else:4211-values=self.astype（object）。\u-values ->4212 mapped=lib.map\u推断（值，f，convert=convert\u dtype）4213 4214如果len（映射）和isinstance（映射[0]，序列）：熊猫\\u libs\lib.pyx在熊猫中。_libs.lib.map\u infere（） in（x） 4 text=[tfidf_vect.fit（word）表示文本中的单词] 5返回文本 ---->6新闻文本=新闻文本。应用（lambda x:TFIDF（x））在TFIDF中（文本） 2 tfidf_vect=TFIDFvectorier（） 3 def TFIDF（文本）： ---->4 text=[tfidf_vect.fit（word）表示文本中的单词] 5返回文本 6新闻文本=新闻文本。应用（lambda x:TFIDF（x））英寸（.0） 2 tfidf_vect=TFIDFvectorier（） 3 def TFIDF（文本）： ---->4 text=[tfidf_vect.fit（word）表示文本中的单词] 5返回文本 6新闻文本=新闻文本。应用（lambda x:TFIDF（x）） c:\users\mahad maqsood\appdata\local\programs\python\39\lib\site packages\sklearn\feature\u extraction\text.py in fit（self，raw\u documents，y）1816 self.\u check\u params（）1817 self.\u warn\u未使用的参数（） ->1818 X=super（）.fit\u转换（原始文档）1819 self.\u tfidf.fit（X）1820返回self c:\users\mahad maqsood\appdata\local\programs\python\python39\lib\site packages\sklearn\feature\u extraction\text.py in fit_transform（self，raw_documents，y）1186#tfidfvectorier。1187如果存在（原始文档，str）： ->1188 raise VALUERROR（预期1189“Iterable超过原始文本文档”，接收到1190“字符串对象”） ValueError:Iterable超过原始文本文档，收到字符串对象。