从python字符串中删除字符_Python_Regex

从python字符串中删除字符

python regex

从python字符串中删除字符,python,regex,Python,Regex,我有几个python字符串，希望从中删除不需要的字符示例： "This is '-' a test" should be "This is a test" "This is a test L)[_U_O-Y OH : l’J1.l'}/" should be "This is a test" "> FOO < BAR" should be "FOO BAR" "I<<W5§!‘1“¢!°\" I" should be ""

我有几个python字符串，希望从中删除不需要的字符

示例：

"This is '-' a test" 
     should be "This is a test"
"This is a test L)[_U_O-Y OH : l’J1.l'}/"
     should be "This is a test"
"> FOO < BAR" 
     should be "FOO BAR"
"I<<W5§!‘1“¢!°\" I" 
     should be "" 
     (because if only words are extracted then it returns I W I and none of them form words)
"l‘?£§l%nbia  ;‘\\~siI.ve_rswinq m"
     should be ""
"2|'J]B"
     should be ""

def myfilter(string):
    words = {'this', 'test', 'i', 'a', 'foo', 'bar'}
    return ' '.join(word for word in line.split() if word.isalpha() and word.lower() in words)

>>> myfilter("This is '-' a test")
'This a test'
>>> myfilter("This is a test L)[_U_O-Y OH : l’J1.l'}/")
'This a test'
>>> myfilter("> FOO < BAR")
'FOO BAR'
>>> myfilter("I<<W5§!‘1“¢!°\" I")
'I'
>>> myfilter("l‘?£§l%nbia  ;‘\\~siI.ve_rswinq m")
''
>>> myfilter("2|'J]B")
''

“这是一个测试”
应该是“这是一个测试”
“这是一个测试L）[[U_O-Y OH:L'J1.L'}/”
应该是“这是一个测试”
“>FOOline=re.sub（r“\W+”，“这是一个测试”）
>>>线
“这是个测试”
>>>line=re.sub（r“\W+”，“，”这是一个测试L）[U_O-Y OH:L'J1.L'}/”）
>>>线
“这是一个测试”
#虽然我希望这是“这是一个测试”，但如果不可能，我会
更喜欢“这是一个测试”
>>>line=re.sub（r“\W+”，“，”>FOO>>线
“FOOBAR”
>>>line=re.sub（r“\W+”，“”，“I>行
“IW51I”
>>>line=re.sub（r“\W+”，“l”？§l%nbia；“\\~siI.ve\rswinq m”）
>>>线
“llnbiasilive_rswinqm”
>>>line=re.sub（r“\W+”，“，”2 |'J]B“）
>>>线
“2JB”

稍后，我将通过预定义单词列表筛选正则表达式清理过的单词。

我将使用拆分和筛选，如下所示：

' '.join(word for word in line.split() if word.isalpha() and word.lower() in list)

这将删除列表中不包含的所有非字母单词和字母单词

示例：

"This is '-' a test" 
     should be "This is a test"
"This is a test L)[_U_O-Y OH : l’J1.l'}/"
     should be "This is a test"
"> FOO < BAR" 
     should be "FOO BAR"
"I<<W5§!‘1“¢!°\" I" 
     should be "" 
     (because if only words are extracted then it returns I W I and none of them form words)
"l‘?£§l%nbia  ;‘\\~siI.ve_rswinq m"
     should be ""
"2|'J]B"
     should be ""

def myfilter(string):
    words = {'this', 'test', 'i', 'a', 'foo', 'bar'}
    return ' '.join(word for word in line.split() if word.isalpha() and word.lower() in words)

>>> myfilter("This is '-' a test")
'This a test'
>>> myfilter("This is a test L)[_U_O-Y OH : l’J1.l'}/")
'This a test'
>>> myfilter("> FOO < BAR")
'FOO BAR'
>>> myfilter("I<<W5§!‘1“¢!°\" I")
'I'
>>> myfilter("l‘?£§l%nbia  ;‘\\~siI.ve_rswinq m")
''
>>> myfilter("2|'J]B")
''

def myfilter（字符串）：
单词={'this'，'test'，'i'，'a'，'foo'，'bar'}
返回“”。join（如果word.isalpha（）和word.lower（）在words中，则在.split（）行中逐字连接）
>>>myfilter（“这是一个测试”）
“这是一个测试”
>>>myfilter（“这是一个测试L）[[U_O-Y OH:L'J1.L'}/”）
“这是一个测试”
>>>myfilter（“>FOO>>myfilter（“I>myfilter（“l'？§l%nbia；'\\~siI.ve\u rswinq m”）
''
>>>myfilter（“2 |'J]B”）
''

此选项可清除至少包含一个非字母字符的任何非空格符号组。但会留下一些不需要的字母组：

re.sub(r"\w*[^a-zA-Z ]+\w*","","This is a test L)[_U_O-Y OH : l’J1.l'}/")

给出：

'This is a test  OH  '

它还将保留多个空间的组：

re.sub(r"[^a-zA-Z ]+\w*","","This is '-' a test")
'This is  a test'  # two spaces

不过，有一个字母的单词——“I”和“a”/“a”在本例中更新了，我不会匹配字典中的单词，而是预定义的单词列表。所以，是的，如果“I”在预定义的列表中，那么就可以了……删除我的答案，因为我没有仔细遵守单词提取要求。不过，对于字符串”l'？§l%nbia；'\\~siI.ve\rswinq m“，是否应该提取任何单词？

r'[^\w\s]+'

将匹配所有非单词非空格字符……将您的过滤器描述为“在空格上拆分字符串，删除所有包含非字母字符的元素，在空格上连接它们”是否正确“？这是一个很好的答案，因为它可以很好地扩展，但需要进行两次调整。

列表

应该是一个

集合

，用于O（1）查找。2）不要用局部变量（列表）来隐藏内置类型。@roippi：谢谢你的建议，我会在我的答案中加入它们。”。