Python 在标点符号上拆分字符串（标记除外）_Python_Regex_String_Split

Python 在标点符号上拆分字符串（标记除外）

python regex string

Python 在标点符号上拆分字符串（标记除外）,python,regex,string,split,Python,Regex,String,Split,除了#字符外，如何在任何标点符号和空格上拆分字符串 tweet="I went on #Russia to see the world cup. We lost!" 我想像这样拆分下面的字符串： ["I", "went", "to", "#Russia", "to, "see", "the", "world", "cup", "We","lost"] 我的尝试： p = re.compile(r"\w+|[^\w\s]", re.UNICODE) 不起作用，因为它使用re.findall功

除了#字符外，如何在任何标点符号和空格上拆分字符串

tweet="I went on #Russia to see the world cup. We lost!"

我想像这样拆分下面的字符串：

["I", "went", "to", "#Russia", "to, "see", "the", "world", "cup", "We","lost"]

我的尝试：

p = re.compile(r"\w+|[^\w\s]", re.UNICODE)

不起作用，因为它使用

re.findall

功能创建“俄罗斯”而不是“俄罗斯”：

tweet="I went on #Russia to see the world cup. We lost!"
words = re.findall(r'[\w#]+', tweet)
print(words)

输出：

['I', 'went', 'on', '#Russia', 'to', 'see', 'the', 'world', 'cup', 'We', 'lost']

使用

re.findall

功能：

tweet="I went on #Russia to see the world cup. We lost!"
words = re.findall(r'[\w#]+', tweet)
print(words)

输出：

['I', 'went', 'on', '#Russia', 'to', 'see', 'the', 'world', 'cup', 'We', 'lost']

使用

re.sub

Ex:

import re
tweet="I went on #Russia to see the world cup. We lost!"
res = list(map(lambda x: re.sub("[^\w#]", "", x), tweet.split()))
print(res)

['I', 'went', 'on', '#Russia', 'to', 'see', 'the', 'world', 'cup', 'We', 'lost']

输出：

import re
tweet="I went on #Russia to see the world cup. We lost!"
res = list(map(lambda x: re.sub("[^\w#]", "", x), tweet.split()))
print(res)

['I', 'went', 'on', '#Russia', 'to', 'see', 'the', 'world', 'cup', 'We', 'lost']

使用

re.sub

Ex:

import re
tweet="I went on #Russia to see the world cup. We lost!"
res = list(map(lambda x: re.sub("[^\w#]", "", x), tweet.split()))
print(res)

['I', 'went', 'on', '#Russia', 'to', 'see', 'the', 'world', 'cup', 'We', 'lost']

输出：

import re
tweet="I went on #Russia to see the world cup. We lost!"
res = list(map(lambda x: re.sub("[^\w#]", "", x), tweet.split()))
print(res)

['I', 'went', 'on', '#Russia', 'to', 'see', 'the', 'world', 'cup', 'We', 'lost']

只需包含“#”

python中的split函数呢？但是我需要为每个punctuaction标记重复split命令。。。。正则表达式不是更快吗？它看起来不像是在标记化字符串（您使用的正则表达式标记化字符串，但预期的输出与标记化字符串不同）。您正在提取前面可能有

的字母单词吗？只要尝试

re.findall（r'#？\b[^\W\d\u]+\b'，s）

如果还需要匹配数字和

，您可以简单地使用

re.findall（r'#？\b\W+\b'，s）

。顺便说一句，您使用的是Python 2.x吗？Python中的split函数呢？但是我需要为每个punctuaction标记重复split命令。。。。正则表达式不是更快吗？它看起来不像是在标记化字符串（您使用的正则表达式标记化字符串，但预期的输出与标记化字符串不同）。您正在提取前面可能有

的字母单词吗？只要尝试

re.findall（r'#？\b[^\W\d\u]+\b'，s）

如果还需要匹配数字和

，您可以简单地使用

re.findall（r'#？\b\W+\b'，s）

。顺便说一句，您正在使用Python2.x吗？请注意，这可能太“贪婪”，例如，使用诸如

#rusia#football

之类的字符串。另外，

re.UNICODE

仅在Python2.x中需要，就像在Python3.x中一样，这是一种默认行为。请注意，这可能太“贪婪”，例如，使用诸如

#俄罗斯#足球

之类的字符串。另外，

re.UNICODE

仅在Python2.x中需要，就像在Python3.x中一样，这是一种默认行为。