Python 要从数据帧中删除数字并实现CountVectorizer吗_Python_Pandas_Dataframe_Nlp

Python 要从数据帧中删除数字并实现CountVectorizer吗

python pandas dataframe nlp

Python 要从数据帧中删除数字并实现CountVectorizer吗,python,pandas,dataframe,nlp,Python,Pandas,Dataframe,Nlp,我有以下格式的数据： author text 0 garyvee A lot of people misunderstand Gary’s message o... 1 jasonfried "I can’t remember having a goal. An actual goa... 2 biz "Tools that can create media that looks and so... 我尝试了以下方法来清理文本： text_da

我有以下格式的数据：

    author  text
0   garyvee     A lot of people misunderstand Gary’s message o...
1   jasonfried  "I can’t remember having a goal. An actual goa...
2   biz         "Tools that can create media that looks and so...

我尝试了以下方法来清理文本：

text_data.loc[:,"text"] = text_data.text.apply(lambda x : str.lower(x))
text_data.loc[:,"text"] = text_data.text.apply(lambda x : " ".join(re.findall('[\w]+',x)))

我得到了输出，但它包含数字，我不希望用于文本分析

0    a lot of people misunderstand gary s message o...
1    i can t remember having a goal an actual goal ...
2    tools that can create media that looks and sou...
Name: text, dtype: object

但在删除文本字符串中的数字时：

text_data.loc[:,"text"] = text_data.text.apply(lambda x : " ".join(re.sub('^[0-9\.]*$','',x)))

我得到了输出：

0    a l o t o f p e o p l e m i s u n d e r s t a ...
1    i c a n t r e m e m b e r h a v i n g a g o a ...
2    t o o l s t h a t c a n c r e a t e m e d i a ...
Name: text, dtype: object

如何避免呢？如何实现CountVectorizer？

我在这个阶段确实犯了错误：

text_data.loc[:,"text"] = text_data.text.apply(lambda x : " ".join(re.sub('^[0-9\.]*$','',x)))

应该是

text_data.loc[:,"text"] = text_data.text.apply(lambda x : re.sub('^[0-9\.]*$','',x))

为什么要使用

“”。join

？已删除，但文本数据中仍有数字，但现在所有单词都是离散的。您的正则表达式正确吗？手动检查您的正则表达式是否正确。'000'，'100'，'12'，'16'，'1st'，'20'，'200'，'20s'，'2nd'，'30s'，'3rd'，'50'，'5000'，'503c'，'52'，'57'，'a12zracs8z'，如何删除这些单词？哦，算出了np