从Python和标点符号列表中删除表情符号和@users，NLP问题，我的表情符号函数就不起作用了_Python_Nlp

从Python和标点符号列表中删除表情符号和@users，NLP问题，我的表情符号函数就不起作用了

python nlp

从Python和标点符号列表中删除表情符号和@users，NLP问题，我的表情符号函数就不起作用了,python,nlp,Python,Nlp,我写了下面的代码。我的句子是推特的一部分。我想从列表中删除所有表情符号，但我的表情符号功能不起作用。为什么? 我还想删除用户。用户从句首开始，但有时保留用户，有时删除用户。另外，我的标点符号不起作用，我对此进行了评论。我怎样才能解决这个问题 import spacy, re nlp = spacy.load('en') stop_words = [w.lower() for w in stopwords.words()] def sanitize(input_string): &q

我写了下面的代码。我的句子是推特的一部分。我想从列表中删除所有表情符号，但我的表情符号功能不起作用。为什么?

我还想删除用户。用户从句首开始，但有时保留用户，有时删除用户。另外，我的标点符号不起作用，我对此进行了评论。我怎样才能解决这个问题

import spacy, re

nlp = spacy.load('en')

stop_words = [w.lower() for w in stopwords.words()]

def sanitize(input_string):
    """ Sanitize one string """

  # Remove emoji
    emoji_pattern = re.compile("["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           "]+", flags=re.UNICODE)
    
    string = emoji_pattern.sub(r'', input_string) # No emoji

    # Normalize to lowercase 
    string = input_string.lower()

    # Spacy tokenizer 
    string_split = [token.text for token in nlp(string)]
    

    # In case the string is empty 
    if not string_split:
        return '' 

    # Remove user

    # Assuming user is the first word and contains an @
    if '@' in string_split[0]:
        del string_split[0]

    # Join back to string 
    string = ' '.join(string_split)

    # Remove # and @
    for punc in '":!@#':
       string = string.replace(punc, '')

    # Remove 't.co/' links
    string = re.sub(r'http//t.co\/[^\s]+', '', string, flags=re.MULTILINE)

    # Removing stop words 
    string = ' '.join([w for w in string.split() if w not in stop_words])

#Punctuation

   # string = [''.join(w for w in string.split() if w not in string.punctuation) for w in string]





    # return string 





#list = ['@cosmetic_candy I think a lot of people just enjoy being a pain in the ass on there',

 'Best get ready sunbed and dinner with nana today :)',

 '@hardlyin70 thats awesome!',

 'Loving this weather',

 '“@danny_boy_37: Just seen an absolute idiot in shorts! Be serious!” Desperado gentleman',

 '@SamanthaOrmerod trying to resist a hardcore rave haha! Resisting towns a doddle! Posh dance floor should wear them in quite easy xx',

 '59 days until @Beyonce!!! Wooo @jfracassini #cannotwait',

 'That was the dumbest tweet I ever seen',

 'Oh what to do on this fine sunny day?',

 '@Brooke_C_X hows the fish ? Hope they r ok. Xx',

 '@Jbowe_ I'm drawing on some other SO answers here:


removing textual emojis: https://stackoverflow.com/a/61758471/42346
removing graphical emojis: https://stackoverflow.com/a/50602709/42346


This will also remove any Twitter username wherever it appears in the string.

import emoji
import spacy
import stop_words

nlp = spacy.load('en_core_web_sm')

stopwords = [w.lower() for w in stop_words.get_stop_words('en')]

emoticon_string = r"""
    (?:
      [<>]?
      [:;=8]                     # eyes
      [\-o\*\']?                 # optional nose
      [\)\]\(\[dDpP/\:\}\{@\|\\] # mouth      
      |
      [\)\]\(\[dDpP/\:\}\{@\|\\] # mouth
      [\-o\*\']?                 # optional nose
      [:;=8]                     # eyes
      [<>]?
    )"""

def give_emoji_free_text(text): 
    return emoji.get_emoji_regexp().sub(r'', text)

def sanitize(string):
    """ Sanitize one string """

    # remove graphical emoji
    string = give_emoji_free_text(string) 

    # remove textual emoji
    string = re.sub(emoticon_string,'',string)

    # normalize to lowercase 
    string = string.lower()

    # spacy tokenizer 
    string_split = [token.text for token in nlp(string)]

    # in case the string is empty 
    if not string_split:
        return '' 

    # join back to string 
    string = ' '.join(string_split)

    # remove user 
    # assuming user has @ in front
    string = re.sub(r"""(?:@[\w_]+)""",'',string)

    #remove # and @
    for punc in '":!@#':
       string = string.replace(punc, '')

    # remove 't.co/' links
    string = re.sub(r'http//t.co\/[^\s]+', '', string, flags=re.MULTILINE)

    # removing stop words 
    string = ' '.join([w for w in string.split() if w not in stopwords])

    return string

导入空间，重新
nlp=spacy.load（'en'）
stop_words=[w.lower（）表示stopwords.words（）中的w]
def消毒（输入字符串）：
“”“清理一个字符串”“”
#删除表情符号
表情符号模式=重新编译（“[”
u“\U0001F600-\U0001F64F”#表情符号
u“\U0001F300-\U0001F5FF”#符号和象形文字
u“\U0001F680-\U0001F6FF”#运输和地图符号
u“\U0001F1E0-\U0001F1FF”标志（iOS）
“]+”，标志=re.UNICODE）
string=emoji_pattern.sub（r''，输入_字符串）#无表情
#规范化为小写
string=输入\字符串.lower（）
#空间标记器
string_split=[nlp中令牌的token.text（字符串）]
#以防字符串为空
如果不是字符串分割：
返回“”
#删除用户
#假设user是第一个单词，并且包含@
如果字符串[0]中的“@”：
删除字符串\u拆分[0]
#连接回字符串
string=''.join（string\u split）
#移除#和@
对于“！@#”中的punc：
string=string.replace（punc，“”）
#删除“t.co/”链接
string=re.sub（r'http//t.co\/[^\s]+'，''，string，flags=re.MULTILINE）
#删除停止词
string=''.join（[w表示string.split（）中的w，如果w不在stop\u单词中]）
#标点符号
#string=[''.join（如果w不在string.split（）中，则w代表string.split（）中的w）表示string中的w
#返回字符串
#list=['@cosmetic_candy我想很多人只是喜欢在那里成为一个讨厌鬼'，
“今天最好准备好日光浴床和与娜娜共进晚餐：）”，
“@hardlyin70真棒！”，
“喜欢这个天气”，
“@danny_boy_37:刚刚看到一个穿短裤的十足的白痴！说真的！“亡命之徒绅士”，
“@SamanthaOrmerod试图抵抗一个铁杆狂野哈哈！抵抗城镇一点点！时髦的舞池应该穿着它们很容易xx”，
“离碧昂丝还有59天！！！哇哦，jfracassini#等不及了”，
“这是我见过的最愚蠢的推特”，
“哦，在这个阳光明媚的日子里该怎么办？”，
“@Brooke_C_X鱼怎么样？希望它们没事。Xx”，
“@Jbowe\我在借鉴其他一些答案，所以这里：

删除文本表情符号：
删除图形表情：

这也将删除任何出现在字符串中的Twitter用户名
导入表情符号
进口空间
输入停止字
nlp=spacy.load（'en\u core\u web\u sm'）
stopwords=[w.lower（）表示停止词中的w。获取停止词（'en'）]
表情符号\u字符串=r”“”
(?:
[]?
眼睛
[\-o\*\']？#可选鼻子
[\）\]\（\[dDpP/\:\}{@\\\\\\\\]\{嘴
|
[\）\]\（\[dDpP/\:\}{@\\\\\\\\]\{嘴
[\-o\*\']？#可选鼻子
眼睛
[]?
)"""
def提供表情符号免费文本（文本）：
返回emoji.get_emoji_regexp（）.sub（r''，文本）
def消毒（字符串）：
“”“清理一个字符串”“”
#删除图形表情符号
string=提供\u表情符号\u自由\u文本（string）
#删除文本表情符号
string=re.sub（表情符号字符串“”，字符串）
#规范化为小写
string=string.lower（）
#空间标记器
string_split=[nlp中令牌的token.text（字符串）]
#以防字符串为空
如果不是字符串分割：
返回“”
#连接回字符串
string=''.join（string\u split）
#删除用户
#假设用户前面有@
string=re.sub（r“”（？：@[\w\u]+）“”，“”，string）
#移除#和@
对于“！@#”中的punc：
string=string.replace（punc，“”）
#删除“t.co/”链接
string=re.sub（r'http//t.co\/[^\s]+'，''，string，flags=re.MULTILINE）
#删除停止词
string=''.join（[w表示string.split（）中的w，如果w不在stopwords中]）
返回字符串
我在借鉴其他一些答案，所以这里：

删除文本表情符号：
删除图形表情：

这也将删除任何出现在字符串中的Twitter用户名
导入表情符号
进口空间
输入停止字
nlp=spacy.load（'en\u core\u web\u sm'）
stopwords=[w.lower（）表示停止词中的w。获取停止词（'en'）]
表情符号\u字符串=r”“”
(?:
[]?
眼睛
[\-o\*\']？#可选鼻子
[\）\]\（\[dDpP/\:\}{@\\\\\\\\]\{嘴
|
[\）\]\（\[dDpP/\:\}{@\\\\\\\\]\{嘴
[\-o\*\']？#可选鼻子
眼睛
[]?
)"""
def提供表情符号免费文本（文本）：
返回emoji.get_emoji_regexp（）.sub（r''，文本）
def消毒（字符串）：
“”“清理一个字符串”“”
#删除图形表情符号
string=提供\u表情符号\u自由\u文本（string）
#删除文本表情符号
string=re.sub（表情符号字符串“”，字符串）
#规范化为小写
string=string.lower（）
#空间标记器
string_split=[nlp中令牌的token.text（字符串）]
#以防字符串为空
如果不是字符串分割：
返回“”
#连接回字符串
string=''.join（string\u split）
#删除用户
#假设用户前面有@
string=re.sub（r“”（？：@[\w\u]+）“”，“”，string）
#移除#和@
对于“！@#”中的punc：
string=string.replace（punc，“”）
#删除“t.co/”链接
string=re.sub（r'http//t.co\/[^\s]+'，''，string，flags=re.MULTILINE）
#删除停止词
string=''.join（[w表示string.split（）中的w，如果w不在stopwords中]）
返回字符串