在python中创建一个正则表达式，用于删除换行符后的空白_Python_Regex_Removing Whitespace

在python中创建一个正则表达式，用于删除换行符后的空白

python regex

在python中创建一个正则表达式，用于删除换行符后的空白,python,regex,removing-whitespace,Python,Regex,Removing Whitespace,我想知道如何创建正则表达式来删除换行符后的空白，例如，如果我的文本如下所示： So she refused to ex- change the feather and the rock be- cause she was afraid. 我如何创造一些东西来获得： ["so","she","refused","to","exchange", "the","feather","and","the","rock","because","she","was","afrai

我想知道如何创建正则表达式来删除换行符后的空白，例如，如果我的文本如下所示：

So she refused to ex-
       change the feather and the rock be-
       cause she was afraid.

我如何创造一些东西来获得：

["so","she","refused","to","exchange", "the","feather","and","the","rock","because","she","was","afraid" ]

我尝试使用“replace（“-\n”，”）”将它们组合在一起，但我只得到如下结果：

[“be”，“cause”]和[“ex”，“change”]

有什么建议吗？谢谢

import re

s = '''So she refused to ex-
       change the feather and the rock be-
       cause she was afraid.'''.lower()

s = re.sub(r'-\n\s*', '', s)   # join hyphens
s = re.sub(r'[^\w\s]', '', s)  # remove punctuation

print(s.split())

\s*

表示0个或更多空格。

据我所知，Alex Hall的回答更充分地回答了您的问题（明确地说是正则表达式，隐含地说是调整大小写并删除标点符号），但它作为生成器的一个很好的候选者脱颖而出

这里，使用生成器连接从类似堆栈的列表中弹出的令牌：

s = '''So she refused to ex-
       change the feather and the rock be-
       cause she was afraid.'''


def condense(lst):
    while lst:
        tok = lst.pop(0)
        if tok.endswith('-'):
            yield tok[:-1] + lst.pop(0)
        else:
            yield tok


print(list(condense(s.split())))

# Result:
# ['So', 'she', 'refused', 'to', 'exchange', 'the', 'feather', 
#  'and', 'the', 'rock', 'because', 'she', 'was', 'afraid.']

您可以使用可选的贪婪表达式：

-?\n\s+

此项不需要替换，请参阅。
对于第二部分，我建议

nltk

，这样您就可以：

import re
from nltk import word_tokenize

string = """
So she refused to ex-
       change the feather and the rock be-
       cause she was afraid.
"""

rx = re.compile(r'-?\n\s+')
words = word_tokenize(rx.sub('', string))
print(words)
# ['So', 'she', 'refused', 'to', 'exchange', 'the', 'feather', 'and', 'the', 'rock', 'because', 'she', 'was', 'afraid', '.']

如果催眠采用不同的编码，会发生什么？我试图将您的代码与给我的文本一起使用，当我看到打印内容时，它会给我一个“\xad”@john替换

为

[-\xad]

。

import re
from nltk import word_tokenize

string = """
So she refused to ex-
       change the feather and the rock be-
       cause she was afraid.
"""

rx = re.compile(r'-?\n\s+')
words = word_tokenize(rx.sub('', string))
print(words)
# ['So', 'she', 'refused', 'to', 'exchange', 'the', 'feather', 'and', 'the', 'rock', 'because', 'she', 'was', 'afraid', '.']