Python 要从数据帧中删除数字并实现CountVectorizer吗
我有以下格式的数据:Python 要从数据帧中删除数字并实现CountVectorizer吗,python,pandas,dataframe,nlp,Python,Pandas,Dataframe,Nlp,我有以下格式的数据: author text 0 garyvee A lot of people misunderstand Gary’s message o... 1 jasonfried "I can’t remember having a goal. An actual goa... 2 biz "Tools that can create media that looks and so... 我尝试了以下方法来清理文本: text_da
author text
0 garyvee A lot of people misunderstand Gary’s message o...
1 jasonfried "I can’t remember having a goal. An actual goa...
2 biz "Tools that can create media that looks and so...
我尝试了以下方法来清理文本:
text_data.loc[:,"text"] = text_data.text.apply(lambda x : str.lower(x))
text_data.loc[:,"text"] = text_data.text.apply(lambda x : " ".join(re.findall('[\w]+',x)))
我得到了输出,但它包含数字,我不希望用于文本分析
0 a lot of people misunderstand gary s message o...
1 i can t remember having a goal an actual goal ...
2 tools that can create media that looks and sou...
Name: text, dtype: object
但在删除文本字符串中的数字时:
text_data.loc[:,"text"] = text_data.text.apply(lambda x : " ".join(re.sub('^[0-9\.]*$','',x)))
我得到了输出:
0 a l o t o f p e o p l e m i s u n d e r s t a ...
1 i c a n t r e m e m b e r h a v i n g a g o a ...
2 t o o l s t h a t c a n c r e a t e m e d i a ...
Name: text, dtype: object
如何避免呢?如何实现CountVectorizer?我在这个阶段确实犯了错误:
text_data.loc[:,"text"] = text_data.text.apply(lambda x : " ".join(re.sub('^[0-9\.]*$','',x)))
应该是
text_data.loc[:,"text"] = text_data.text.apply(lambda x : re.sub('^[0-9\.]*$','',x))
为什么要使用
“”。join
?已删除,但文本数据中仍有数字,但现在所有单词都是离散的。您的正则表达式正确吗?手动检查您的正则表达式是否正确。'000','100','12','16','1st','20','200','20s','2nd','30s','3rd','50','5000','503c','52','57','a12zracs8z',如何删除这些单词?哦,算出了np