Python 如何一次预处理NLP文本(小写、删除特殊字符、删除数字、删除电子邮件等)?

Python 如何一次预处理NLP文本(小写、删除特殊字符、删除数字、删除电子邮件等)?,python,pandas,nlp,Python,Pandas,Nlp,如何使用Python一次性预处理NLP文本(小写、删除特殊字符、删除数字、删除电子邮件等) Here are all the things I want to do to a Pandas dataframe in one pass in python: 1. Lowercase text 2. Remove whitespace 3. Remove numbers 4. Remove special characters 5. Remove emails 6. Remove stop word

如何使用Python一次性预处理NLP文本(小写、删除特殊字符、删除数字、删除电子邮件等)

Here are all the things I want to do to a Pandas dataframe in one pass in python:
1. Lowercase text
2. Remove whitespace
3. Remove numbers
4. Remove special characters
5. Remove emails
6. Remove stop words
7. Remove NAN
8. Remove weblinks
9. Expand contractions (if possible not necessary)
10. Tokenize
以下是我个人的做法:

    def preprocess(self, dataframe):


    self.log.info("In preprocess function.")

    dataframe1 = self.remove_nan(dataframe)
    dataframe2 = self.lowercase(dataframe1)
    dataframe3 = self.remove_whitespace(dataframe2)

    # Remove emails and websites before removing special characters
    dataframe4 = self.remove_emails(self, dataframe3)
    dataframe5 = self.remove_website_links(self, dataframe4)

    dataframe6 = self.remove_special_characters(dataframe5)
    dataframe7 - self.remove_numbers(dataframe6)
    self.remove_stop_words(dataframe8) # Doesn't return anything for now
    dataframe7 = self.tokenize(dataframe6)

    self.log.info(f"Sample of preprocessed data: {dataframe4.head()}")

    return dataframe7

def remove_nan(self, dataframe):
    """Pass in a dataframe to remove NAN from those columns."""
    return dataframe.dropna()

def lowercase(self, dataframe):
    logging.info("Converting dataframe to lowercase")
    lowercase_dataframe = dataframe.apply(lambda x: x.lower())
    return lowercase_dataframe


def remove_special_characters(self, dataframe):
    self.log.info("Removing special characters from dataframe")
    no_special_characters = dataframe.replace(r'[^A-Za-z0-9 ]+', '', regex=True)
    return no_special_characters

def remove_numbers(self, dataframe):
    self.log.info("Removing numbers from dataframe")
    removed_numbers = dataframe.str.replace(r'\d+','')
    return removed_numbers

def remove_whitespace(self, dataframe):
    self.log.info("Removing whitespace from dataframe")
    # replace more than 1 space with 1 space
    merged_spaces = dataframe.str.replace(r"\s\s+",' ')
    # delete beginning and trailing spaces
    trimmed_spaces = merged_spaces.apply(lambda x: x.str.strip())
    return trimmed_spaces

def remove_stop_words(self, dataframe):
    # TODO: An option to pass in a custom list of stopwords would be cool.
    set(stopwords.words('english'))

def remove_website_links(self, dataframe):
    self.log.info("Removing website links from dataframe")
    no_website_links = dataframe.str.replace(r"http\S+", "")
    return no_website_links

def tokenize(self, dataframe):
    tokenized_dataframe = dataframe.apply(lambda row: word_tokenize(row))
    return tokenized_dataframe

def remove_emails(self, dataframe):
    no_emails = dataframe.str.replace(r"\S*@\S*\s?")
    return no_emails

def expand_contractions(self, dataframe):
    # TODO: Not a priority right now. Come back to it later.
    return dataframe

如果没有示例数据帧,我无法提供适当的代码,但正如评论中提到的,apply在我看来是最好的选择。差不多

def预处理_文本:
s=s.str.下()
s=pd.fillna(填充值)
你可以使用

#确保只有字符串列是对象,数字可以是数字datetimes是datetimes等
str_columns=df。选择数据类型(inlcude='object')。列
df[str\u columns]=df[str\u columns]。应用(预处理\u文本)


同样,如果没有示例数据帧,很难更加具体,但这种方法可以工作。

以下函数执行您提到的所有操作

import nltk
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer,PorterStemmer
from nltk.corpus import stopwords
import re
lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer() 

 def preprocess(sentence):
    sentence=str(sentence)
    sentence = sentence.lower()
    sentence=sentence.replace('{html}',"") 
    cleanr = re.compile('<.*?>')
    cleantext = re.sub(cleanr, '', sentence)
    rem_url=re.sub(r'http\S+', '',cleantext)
    rem_num = re.sub('[0-9]+', '', rem_url)
    tokenizer = RegexpTokenizer(r'\w+')
    tokens = tokenizer.tokenize(rem_num)  
    filtered_words = [w for w in tokens if len(w) > 2 if not w in stopwords.words('english')]
    stem_words=[stemmer.stem(w) for w in filtered_words]
    lemma_words=[lemmatizer.lemmatize(w) for w in stem_words]
    return " ".join(filtered_words)


df['cleanText']=df['Text'].map(lambda s:preprocess(s)) 
导入nltk
从nltk.tokenize导入RegexpTokenizer
从nltk.stem导入WordNetLemmatizer,PorterStemmer
从nltk.corpus导入停止词
进口稀土
lemmatizer=WordNetLemmatizer()
stemmer=PorterStemmer()
def预处理(句子):
句子=str(句子)
句子=句子。较低()
句子=句子。替换({html}',“”)
cleanr=re.compile(“”)
cleantext=re.sub(cleanr,,,句子)
rem_url=re.sub(r'http\S+','',cleantext)
rem_num=re.sub('[0-9]+','',rem_url)
标记器=RegexpTokenizer(r'\w+'))
tokens=标记器。标记化(rem_num)
过滤词=[w表示标记中的w,如果len(w)>2,如果stopwords中不是w,则表示w。单词('english')]
stem_words=[stemmer.stem(w)表示过滤后的_单词中的w]
引理词=[引理化器。在词干词中为w引理化(w)]
返回“”。加入(已过滤的单词)
df['cleanText']=df['Text'].map(lambda s:预处理)

我决定使用Dask,它允许您在本地计算机上并行处理Python任务,并与Pandas、numpy和scikitlearn配合使用:

使用
df.apply(预处理)