Python 基于NLTK的文本预处理_Python_Twitter_Nltk

Python 基于NLTK的文本预处理

python twitter

Python 基于NLTK的文本预处理,python,twitter,nltk,Python,Twitter,Nltk,我正在练习使用NLTK从原始推文中删除某些功能，然后希望删除（对我来说）与我相关的推文（例如空推文或单词推文）。然而，似乎有些单字推文没有被删除。我还面临着一个问题，即无法删除句子开头或结尾的任何停止词有什么建议吗？目前，我希望将一个句子作为输出传回，而不是一系列标记化的单词欢迎对改进代码（处理时间、优雅度）提出任何其他意见 import string import numpy as np import nltk from nltk.corpus import stopwords cach

我正在练习使用NLTK从原始推文中删除某些功能，然后希望删除（对我来说）与我相关的推文（例如空推文或单词推文）。然而，似乎有些单字推文没有被删除。我还面临着一个问题，即无法删除句子开头或结尾的任何停止词

有什么建议吗？目前，我希望将一个句子作为输出传回，而不是一系列标记化的单词

欢迎对改进代码（处理时间、优雅度）提出任何其他意见

import string
import numpy as np
import nltk
from nltk.corpus import stopwords

cache_english_stopwords=stopwords.words('english')
cache_en_tweet_stopwords=stopwords.words('english_tweet')

# For clarity, df is a pandas dataframe with a column['text'] together with other headers.

def tweet_clean(df):
    temp_df = df.copy()
    # Remove hyperlinks
    temp_df.loc[:, "text"] = temp_df.loc[:, "text"].replace('https?:\/\/.*\/\w*', '', regex=True)
    # Remove hashtags
    # temp_df.loc[:,"text"]=temp_df.loc[:,"text"].replace('#\w*', '', regex=True)
    temp_df.loc[:, "text"] = temp_df.loc[:, "text"].replace('#', ' ', regex=True)
    # Remove citations
    temp_df.loc[:, "text"] = temp_df.loc[:, "text"].replace('\@\w*', '', regex=True)
    # Remove tickers
    temp_df.loc[:, "text"] = temp_df.loc[:, "text"].replace('\$\w*', '', regex=True)
    # Remove punctuation
    temp_df.loc[:, "text"] = temp_df.loc[:, "text"].replace('[' + string.punctuation + ']+', '', regex=True)
    # Remove stopwords
    for tweet in temp_df.loc[:,"text"]:
        tweet_tokenized=nltk.word_tokenize(tweet)
        for w in tweet_tokenized:
            if (w.lower() in cache_english_stopwords) | (w.lower() in cache_en_tweet_stopwords):
                temp_df.loc[:, "text"] = temp_df.loc[:, "text"].replace('[\W*\s?\n?]'+w+'[\W*\s?]', ' ', regex=True)
                #print("w in stopword")
    # Remove quotes
    temp_df.loc[:, "text"] = temp_df.loc[:, "text"].replace('\&*[amp]*\;|gt+', '', regex=True)
    # Remove RT
    temp_df.loc[:, "text"] = temp_df.loc[:, "text"].replace('\s+rt\s+', '', regex=True)
    # Remove linebreak, tab, return
    temp_df.loc[:, "text"] = temp_df.loc[:, "text"].replace('[\n\t\r]+', ' ', regex=True)
    # Remove via with blank
    temp_df.loc[:, "text"] = temp_df.loc[:, "text"].replace('via+\s', '', regex=True)
    # Remove multiple whitespace
    temp_df.loc[:, "text"] = temp_df.loc[:, "text"].replace('\s+\s+', ' ', regex=True)
    # Remove single word sentence
    for tweet_sw in temp_df.loc[:, "text"]:
        tweet_sw_tokenized = nltk.word_tokenize(tweet_sw)
        if len(tweet_sw_tokenized) <= 1:
            temp_df.loc["text"] = np.nan
    # Remove empty rows
    temp_df.loc[(temp_df["text"] == '') | (temp_df['text'] == ' ')] = np.nan
    temp_df = temp_df.dropna()
    return temp_df

导入字符串
将numpy作为np导入
导入nltk
从nltk.corpus导入停止词
cache\u english\u stopwords=stopwords.words（'english'）
cache\u en\u tweet\u stopwords=stopwords.words（'english\u tweet'）
#为清楚起见，df是一个数据帧，包含一列['text']和其他标题。
def tweet_清洁（df）：
temp_df=df.copy（）
#删除超链接
temp_-df.loc[：，“text”]=temp_-df.loc[：，“text”].替换（'https？：\/\/.\/.\/\w*'，''，regex=True）
#删除hashtag
#temp#u df.loc[：，“text”]=temp#u df.loc[：，“text”].替换（“#\w*”，“，regex=True）
temp_-df.loc[：，“text”]=temp_-df.loc[：，“text”].替换（'#'，''，regex=True）
#删除引用
temp_-df.loc[：，“text”]=temp_-df.loc[：，“text”].替换（“\@\w*”，“”，regex=True）
#删除标记
temp_-df.loc[：，“text”]=temp_-df.loc[：，“text”].替换（“\$\w*”，“”，regex=True）
#删除标点符号
temp_-df.loc[：，“text”]=temp_-df.loc[：，“text”].替换（'['+string.标点符号+']+'，''，regex=True）
#删除停止字
对于temp_df.loc[：，“text”]中的tweet：
tweet\u tokenized=nltk.word\u tokenize（tweet）
对于tweet_中的w标记化：
如果（缓存中的w.lower（））（缓存中的w.lower（））（tweet中的w.lower（））：
temp_-df.loc[：，“text”]=temp_-df.loc[：，“text”].替换（“[\W*\s？\n？]”+W+“[\W*\s？]”，“，regex=True）
#打印（“停止字中的w”）
#删除引号
temp_-df.loc[：，“text”]=temp_-df.loc[：，“text”].替换（“\&*[amp]*\；”gt+“，”，regex=True）
#移除RT
temp_-df.loc[：，“text”]=temp_-df.loc[：，“text”].替换（'\s+rt\s+，''，regex=True）
#删除换行符、制表符、回车
temp_-df.loc[：，“text”]=temp_-df.loc[：，“text”].替换（'[\n\t\r]+'，''，regex=True）
#用挡片移除通孔
temp_-df.loc[：，“text”]=temp_-df.loc[：，“text”].替换（'via+\s'，''，regex=True）
#删除多个空格
temp_-df.loc[：，“text”]=temp_-df.loc[：，“text”].替换（'\s+\s+'，''，regex=True）
#删除单字句
对于temp_df.loc[：，“text”]中的tweet_sw：
tweet\u sw\u tokenized=nltk.word\u tokenize（tweet\u sw）
如果len（tweet\u sw\u标记化）什么是df？推文列表？
你也许应该考虑一个接一个的推特，而不是一个推特列表。使用功能tweet\u cleaner（单推）
会更容易
nltk提供了清理推文的方法
提供了使用regex的良好解决方案
我建议您创建一个变量，以便更方便地使用temp_df.loc[：，“text”]

删除句子中的停止词描述如下[此处]（）：
clean_wordlist=[i代表句子中的i.lower（）.split（），如果我不在stopwords中]

如果您想使用regex（与re包一起），您可以
创建一个由所有stopwords组成的正则表达式模式（使用tweet_clean函数）：
stop_pattern=re.compile（“|”）.join（stoplist）（？siu））


（？siu）用于多行、ignorecase、unicode
并使用此模式清理任何字符串
clean\u string=stop\u pattern.sub（“”，input\u string）

（如果不需要单独的停止列表，则可以将两个停止列表连接起来）
要删除1个单词的tweet，您只能保留比1个单词长的单词：

如果len（tweet\u sw\u标记化）>=1：
保留。追加（tweet\u sw）
根据mquantin的建议，我修改了我的代码，将tweet作为一个句子单独清理。下面是我的业余尝试，我相信它涵盖了大多数情况（如果您遇到任何其他值得清理的情况，请告诉我）：
从我那乱七八糟的代码里抄来的。df是一个pandas.dataframe，其中有一个“文本”列。我更喜欢为不同类型的推文设置单独的停止列表，这样我就不会弄乱原始的nltk停止列表。请编辑您的问题，解释df
是一个pandas数据框（正如我从您的评论中收集到的）。理想情况下，您应该添加几行代码，这样如果有人想给您提供更好的答案，就可以运行完整的代码段。如果你认为你自己的答案解决了你的问题，你最终应该把它标记为“接受”。（但首先我会修正你的问题，等待更好的答案）。将cache\u english\u stopwords更改为一组。我如何读取一个包含多行tweet、project3.txt或.json文件的文件我正在处理这段代码，我注意到最后一条tweet中保留了部分URL。我做了一些改变来解决这个问题，并改进了结构：
import string
import re
from nltk.corpus import stopwords
from nltk.tokenize import TweetTokenizer


cache_english_stopwords=stopwords.words('english')



def tweet_clean(tweet):
    # Remove tickers
    sent_no_tickers=re.sub(r'\$\w*','',tweet)
    print('No tickers:')
    print(sent_no_tickers)
    tw_tknzr=TweetTokenizer(strip_handles=True, reduce_len=True)
    temp_tw_list = tw_tknzr.tokenize(sent_no_tickers)
    print('Temp_list:')
    print(temp_tw_list)
    # Remove stopwords
    list_no_stopwords=[i for i in temp_tw_list if i.lower() not in     cache_english_stopwords]
    print('No Stopwords:')
    print(list_no_stopwords)
    # Remove hyperlinks
    list_no_hyperlinks=[re.sub(r'https?:\/\/.*\/\w*','',i) for i in list_no_stopwords]
    print('No hyperlinks:')
    print(list_no_hyperlinks)
    # Remove hashtags
    list_no_hashtags=[re.sub(r'#', '', i) for i in list_no_hyperlinks]
    print('No hashtags:')
    print(list_no_hashtags)
    # Remove Punctuation and split 's, 't, 've with a space for filter
    list_no_punctuation=[re.sub(r'['+string.punctuation+']+', ' ', i) for i in list_no_hashtags]
    print('No punctuation:')
    print(list_no_punctuation)
    # Remove multiple whitespace
    new_sent = ' '.join(list_no_punctuation)
    # Remove any words with 2 or fewer letters
    filtered_list = tw_tknzr.tokenize(new_sent)
    list_filtered = [re.sub(r'^\w\w?$', '', i) for i in filtered_list]
    print('Clean list of words:')
    print(list_filtered)
    filtered_sent =' '.join(list_filtered)
    clean_sent=re.sub(r'\s\s+', ' ', filtered_sent)
    #Remove any whitespace at the front of the sentence
    clean_sent=clean_sent.lstrip(' ')
    print('Clean sentence:')
    print(clean_sent)

s0='    RT @Amila #Test\nTom\'s newly listed Co. &amp; Mary\'s unlisted     Group to supply tech for nlTK.\nh.. $TSLA $AAPL https:// t.co/x34afsfQsh'
tweet_clean(s0)