带正则表达式的标点符号-python_Python_Regex

带正则表达式的标点符号-python

python regex

带正则表达式的标点符号-python,python,regex,Python,Regex,我需要使用正则表达式去除单词开头和结尾的标点符号。看来正则表达式是最好的选择。我不想从“you're”这样的单词中删除标点符号，这就是我不使用.replace（）的原因。执行此任务不需要正则表达式。用于：导入字符串 >>>字符串、标点符号 '!"#$%&\'()*+,-./:;?@[\\]^_`{|}~' >>>“！Hello..”.strip（字符串.标点符号） “你好” >>>''.join（word.strip（string.标点符号）表示“你好，世界。我是男孩，你是女孩。”.split

我需要使用正则表达式去除单词开头和结尾的标点符号。看来正则表达式是最好的选择。我不想从“you're”这样的单词中删除标点符号，这就是我不使用.replace（）的原因。

执行此任务不需要正则表达式。用于：

导入字符串 >>>字符串、标点符号 '!"#$%&\'()*+,-./:;?@[\\]^_`{|}~' >>>“！Hello..”.strip（字符串.标点符号） “你好” >>>''.join（word.strip（string.标点符号）表示“你好，世界。我是男孩，你是女孩。”.split（）） “你好，世界，我是男孩，你是女孩”

您可以使用正则表达式从文本文件或特定字符串文件中删除标点，如下所示-

new_data=[]
with open('/home/rahul/align.txt','r') as f:
    f1 = f.read()
    f2 = f1.split()



    all_words = f2 
    punctuations = '''!()-[]{};:'"\,<>./?@#$%^&*_~''' 
    # You can add and remove punctuations as per your choice 
    #removing stop words in hungarian text and  english text and 
    #display the unpunctuated string
    # To remove from a string, replace new_data with new_str 
    # new_str = "My name$#@ is . rahul -~"

    for word in all_words: 
        if word not in punctuations:
           new_data.append(word)

    print (new_data)

new_data=[]
将open（'/home/rahul/align.txt'，r'）作为f：
f1=f.read（）
f2=f1.split（）
所有单词=f2
标点符号=''！（）-[]{}；：''，./？@$%^&*.''
#您可以根据自己的选择添加和删除标点符号
#删除匈牙利语文本和英语文本中的停止词，以及
#显示未定时的字符串
#要从字符串中删除，请将新的\u数据替换为新的\u str
#new_str=“我的名字$#@是.rahul-~”
对于所有单词中的单词：
如果单词不在标点符号中：
新增数据。追加（word）
打印（新数据）

p.S.-按照要求正确识别。

希望这有帮助

我认为此函数将有助于删除标点符号，且简洁明了：

import re
def remove_punct(text):
    new_words = []
    for word in text:
        w = re.sub(r'[^\w\s]','',word) #remove everything except words and space
        w = re.sub(r'_','',w) #how to remove underscore as well
        new_words.append(w)
    return new_words

如果您坚持使用Regex，我建议您使用以下解决方案：

import re
import string
p = re.compile("[" + re.escape(string.punctuation) + "]")
print(p.sub("", "\"hello world!\", he's told me."))
### hello world hes told me

还请注意，您可以传递自己的标点符号：

my_punct = ['!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '.',
           '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', 
           '`', '{', '|', '}', '~', '»', '«', '“', '”']

punct_pattern = re.compile("[" + re.escape("".join(my_punct)) + "]")
re.sub(punct_pattern, "", "I've been vaccinated against *covid-19*!") # the "-" symbol should remain
### Ive been vaccinated against covid-19

my_punct=['！'、''“'、'#'、'$'、'%、'&'、'、'、'（'、'）、'*'、'+'、'、'、'、'，
'/', ':', ';', '', '?', '@', '[', '\\', ']', '^', '_', 
'`', '{', '|', '}', '~', '»', '«', '“', '”']
punct\u pattern=re.compile（“[”+re.escape（“.join（my\u punct））+”]））
关于sub（点状模式，“，“我已经接种了*新冠病毒19*！”）#应保留“-”符号
###我已经接种了新冠肺炎疫苗

对于那些来这里寻找区分Unicode字母数字字符和其他所有字符的方法的人，在使用Python 3.x时，您可以在正则表达式中使用\w和\w。这有助于我在Tkinter文本小部件中编写控件左/右移位功能（跳过所有的东西，比如单词前的标点符号）。在我找到解决方案之前，我找到了你的帖子。所以，我想这可能会对类似的人有所帮助。只是出于好奇，正则表达式的方法是什么？

re.sub（'\S+'，lambda m:re.sub（'^\W+| \W+$，''，m.group（），'…）

。注意：您需要准确地替换

\W

，因为它将排除

\u

（如果您将

\u

视为标点符号）。演示：太好了！非常感谢！顺便问一下，这是否等同于您的-

re.sub（'\S+'，lambda m:re.match（r'^\W*（.\W）\W*$'，m.group（））.group（1），text）

？如果是，那么哪一个更快（或更好）？@AnmolSinghJaggi，使用

timeit

，您可以测量它们。好的，但是它们是等效的吗？注意，您可以使用

w=re.sub（r'（[^\w\s]|)，，，word）进一步简化您的条件。

my_punct = ['!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '.',
           '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', 
           '`', '{', '|', '}', '~', '»', '«', '“', '”']

punct_pattern = re.compile("[" + re.escape("".join(my_punct)) + "]")
re.sub(punct_pattern, "", "I've been vaccinated against *covid-19*!") # the "-" symbol should remain
### Ive been vaccinated against covid-19