在python中，预处理推文、删除@和#、删除停止词并从列表中删除用户_Python_Nlp_Nltk_Spacy

在python中，预处理推文、删除@和#、删除停止词并从列表中删除用户

python nlp

在python中，预处理推文、删除@和#、删除停止词并从列表中删除用户,python,nlp,nltk,spacy,Python,Nlp,Nltk,Spacy,我写了下面的代码，但现在我想p重新处理，所以我转换为lower，我写了一些单词来消除停止词，但它不起作用，我想删除@和#，同时删除user，你能帮我吗 ! pip install wget import wget url = 'https://raw.githubusercontent.com/dirkhovy/NLPclass/master/data/tweets_en.txt' wget.download(url, 'tweets_en.txt') tweets = [line.str

我写了下面的代码，但现在我想p重新处理，所以我转换为lower，我写了一些单词来消除停止词，但它不起作用，我想删除@和#，同时删除user，你能帮我吗




! pip install wget
import wget
url = 'https://raw.githubusercontent.com/dirkhovy/NLPclass/master/data/tweets_en.txt'
wget.download(url, 'tweets_en.txt')
tweets = [line.strip() for line in open('tweets_en.txt', encoding='utf8')]

import spacy
from collections import Counter
# your code here
import itertools
nlp = spacy.load('en')

#Creates a list of lists of tokens
tokens = [[token.text for token in nlp(sentence)] for sentence in tweets[:200]]
print(tokens)


#to lower
token_l=[[w.lower() for w in line] for line in tokens]
token_l[:1]

#remove #

#remove stop word

#remove user

#remove @

from nltk.corpus import stopwords

# filtered_words = [[w for w in line] for line in tokens if w not in # stopwords.words('english')]

始终尝试将代码组织成函数：它们是可重用的、可读的和可循环的

from nltk.corpus import stopwords
import spacy, re

nlp = spacy.load('en')

stop_words = [w.lower() for w in stopwords.words()]

def sanitize(input_string):
    """ Sanitize one string """

    # normalize to lowercase 
    string = input_string.lower()

    # spacy tokenizer 
    string_split = [token.text for token in nlp(string)]

    # in case the string is empty 
    if not string_split:
        return '' 

    # remove user 
    # assuming user is the first word and contains an @
    if '@' in string_split[0]:
        del string_split[0]

    # join back to string 
    string = ' '.join(string_split)

    #remove # and @
    for punc in '@#':
       string = string.replace(punc, '')

    # remove 't.co/' links
    string = re.sub(r't.co\/[^\s]+', '', string, flags=re.MULTILINE)

    # removing stop words 
    string = ' '.join([w for w in string.split() if w not in stop_words])

    return string 


list = ['@Jeff_Atwood Thank you for #stackoverflow', 'All hail @Joel_Spolsky t.co/Gsb7V1oVLU #stackoverflow' ]

list_sanitized = [sanitize(string) for string in list]

输出：

['thank stackoverflow', 'hail joel_spolsky stackoverflow']

“#字符串”。替换（#“，”）

对不起，我是python新手，我应该如何使用这行代码？这是否回答了您的问题？谢谢，我已经搜索了很多，但问题是，我有列表列表，我现在不知道如何将这些代码应用到列表列表中谢谢很多，但名称错误：名称“字符串”未定义是的，您的代码是正确的，但我想应用到列表中（因为我有列表），我不知道如何在列表中迭代，你能帮我写代码吗？把代码放在一个函数中，这样你就可以循环列表中的所有字符串。不，你创建了一个列表。你的数据是一个句子（或推文）列表。我会先清理句子，然后在标记列表中分割结果。使用spacy标记更新。