删除Python中的标点符号，但保留表情符号_Python_Sentiment Analysis

删除Python中的标点符号，但保留表情符号

python

删除Python中的标点符号，但保留表情符号,python,sentiment-analysis,Python,Sentiment Analysis,我在做情绪分析的研究。在一个数据列表中，我想删除所有标点符号，以便获得纯版本的单词。但我想保留表情符号，例如：和：/ 有没有一种方法可以在Python中说我想删除所有标点符号，除非它们以组合形式出现，例如：，：/，您最好的选择可能是简单地将表情符号列表声明为变量。然后将你的标点符号与列表进行比较。如果它不在列表中，请将其从字符串中删除编辑：您可以尝试以下方法，而不是反复使用整块str.replace： to_remove = ".,;:!()\" for char in to_remove:

我在做情绪分析的研究。在一个数据列表中，我想删除所有标点符号，以便获得纯版本的单词。但我想保留表情符号，例如：和：/

有没有一种方法可以在Python中说我想删除所有标点符号，除非它们以组合形式出现，例如：，：/，您最好的选择可能是简单地将表情符号列表声明为变量。然后将你的标点符号与列表进行比较。如果它不在列表中，请将其从字符串中删除

编辑：您可以尝试以下方法，而不是反复使用整块str.replace：

to_remove = ".,;:!()\"
for char in to_remove:
    message = message.replace(char, "")

编辑2：

技巧方面最简单的方法可能是：

from string import punctuation
emoticons = [":)" ":D" ":("]
word_list = message.split(" ")
for word in word_list:
    if word not in emoticons:
        word = word.translate(None, punctuation)
output = " ".join(word_list)

同样，这只适用于与其他字符分开的表情符号，即肯定：D但不抱歉：。

您可以尝试此正则表达式：

(?<=\w)[^\s\w](?![^\s\w])

用法：

import re
print(re.sub(r'(?<=\w)[^\s\w](?![^\s\w])', '', your_data))

这是一个在线演示

这个想法是匹配一个特殊字符，如果它前面有一个字母

如果正则表达式不能像您期望的那样工作，您可以对其进行一些自定义。例如，如果您不希望它与逗号匹配，可以将它们从character类中删除，如下所示：？在使用str.replace已经完成的工作的基础上，可以执行以下操作：

lines = [
    "Sentence 1.",
    "Sentence 2 :)",
    "Sentence <3 ?"
]

emoticons = {
    ":)": "000smile",
    "<3": "000heart"
}

emoticons_inverse = {v: k for k, v in emoticons.items()}

punctuation = ",./<>?;':\"[]\\{}|`~!@#$%^&*()_+-="

lines_clean = []
for line in lines:
    #Replace emoticons with non-punctuation
    for emote, rpl in emoticons.items():
        line = line.replace(emote, rpl)

    #Remove punctuation
    for char in line:
        if char in punctuation:
            line = line.replace(char, "")

    #Revert emoticons
    for emote, rpl in emoticons_inverse.items():
        line = line.replace(emote, rpl)

    lines_clean.append(line)

print(lines_clean)

听起来你想做字符串操作？有多种方法可以做到这一点，从str.replace到regex。到目前为止你试过什么？这里有一些有用的链接：我用str.replace做的。我编辑了我的原始帖子来展示我的代码。你能详细说明一下我是如何具体做到这一点的吗？好吧，你可以试着用几种不同的方式来做。就技巧而言，我能想到的最简单的方法是尝试使用message.split将您的消息拆分为文字。你好，朋友：将结束为[你好，朋友：]。然后遍历列表，将每个单词与您要查找的表情符号列表进行比较。如果不在列表中，请使用word.replace删除所有标点符号。请记住，这只适用于分隔的表情符号，即适用于Hi:，但不适用于Hello:D。我完成了拆分的第一步，但我不明白如何编写for循环？我在帖子中添加了一个简单的代码，因此它比注释中更为清晰可见。让我知道你的想法。

lines = [
    "Sentence 1.",
    "Sentence 2 :)",
    "Sentence <3 ?"
]

emoticons = {
    ":)": "000smile",
    "<3": "000heart"
}

emoticons_inverse = {v: k for k, v in emoticons.items()}

punctuation = ",./<>?;':\"[]\\{}|`~!@#$%^&*()_+-="

lines_clean = []
for line in lines:
    #Replace emoticons with non-punctuation
    for emote, rpl in emoticons.items():
        line = line.replace(emote, rpl)

    #Remove punctuation
    for char in line:
        if char in punctuation:
            line = line.replace(char, "")

    #Revert emoticons
    for emote, rpl in emoticons_inverse.items():
        line = line.replace(emote, rpl)

    lines_clean.append(line)

print(lines_clean)

['Sentence 1', 'Sentence 2 :)', 'Sentence <3 ']