Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/296.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 正则表达式从列表中删除非字母单词A-Z A-Z(例外)_Python_Regex_List - Fatal编程技术网

Python 正则表达式从列表中删除非字母单词A-Z A-Z(例外)

Python 正则表达式从列表中删除非字母单词A-Z A-Z(例外),python,regex,list,Python,Regex,List,我正在尝试从包含非字母字符的字符串列表中删除单词,例如: ["The", "sailor", "is", "sick", "."] -> ["The", "sailor", "is", "sick"] 但我不能随意删除包含非alpha字符的单词,因为可能出现以下情况: ["The", "U.S.", "is", "big", "."] -> ["The", "U.S.", "is", "big"] (acronym kept but period is removed) 我需要想

我正在尝试从包含非字母字符的字符串列表中删除单词,例如:

["The", "sailor", "is", "sick", "."] -> ["The", "sailor", "is", "sick"]
但我不能随意删除包含非alpha字符的单词,因为可能出现以下情况:

["The", "U.S.", "is", "big", "."] -> ["The", "U.S.", "is", "big"] (acronym kept but period is removed)
我需要想出一个正则表达式或类似的方法来处理这样的简单情况(所有类型的标点符号):

我使用一个自然语言包装类将句子转换为左侧的列表,但有时列表要复杂得多:

string:   "round up the "blonde bombshells' a all (well almost all)"
list: ["round", "up", "the", "''", "blonde", "bombshell", "\\", 
          "a", "all", "-lrb-", "well", "almost", "all", "-rrb-"]
正如您所看到的,一些字符(如括号和撇号)被包装器转换或删除。我想将所有这些无关的子字符串处理成一个更干净的外观:

list: ["round", "up", "the", "blonde", "bombshell",
          "a", "all", "well", "almost", "all"]

我对python相当陌生,我的印象是,正则表达式将是我在这里的最佳方法,但不知道如何将第一个列表转换为经过清理的第二个列表,如果您能提供帮助,我将不胜感激

确保每个字符串至少包含一个字母数字:

import re

expr = re.compile(r"\w+")
test = ["And", ",", "there", "she", "is", ".", "U.S."]

filtered = [v for v in test if expr.search(v)]
print(filtered)
印刷品

['And', 'there', 'she', 'is', 'U.S.']
备选方案是排除数字,并确保字符串不以非字母字符开头:

# only alpha
expr = re.compile(r"[a-zA-Z]+")
test = ["round", "up", "the", "''", "blonde", "bombshell", "\\",
        "a", "all", "-lrb-", "well", "almost", "all", "-rrb-"]
# use match() here
filtered = [v for v in test if expr.match(v)]
print(filtered)
印刷品

['round', 'up', 'the', 'blonde', 'bombshell', 'a', 'all', 'well', 'almost', 'all']

这似乎符合您的描述:

cases=[
    ["The", "sailor", "is", "sick", "."],
    ["The", "U.S.", "is", "big", "."],
    ["round", "up", "the", "''", "blonde", "bombshell", "\\", 
    "a", "all", "-lrb-", "well", "almost", "all", "-rrb-"],
]

import re

for li in cases:
    print '{}\n\t->{}'.format(li, [w for w in li if re.search(r'^[a-zA-Z]', w)])
印刷品:

['The', 'sailor', 'is', 'sick', '.']
    ->['The', 'sailor', 'is', 'sick']
['The', 'U.S.', 'is', 'big', '.']
    ->['The', 'U.S.', 'is', 'big']
['round', 'up', 'the', "''", 'blonde', 'bombshell', '\\', 'a', 'all', '-lrb-', 'well', 'almost', 'all', '-rrb-']
    ->['round', 'up', 'the', 'blonde', 'bombshell', 'a', 'all', 'well', 'almost', 'all']
如果正确,您完全可以不使用正则表达式:

for li in cases:
    print '{}\n\t->{}'.format(li, [w for w in li if w[0].isalpha()])
您可以使用来执行此操作:

>>> from string import punctuation
>>> [i for i in lst if i not in punctuation]   
['The', 'U.S.', 'is', 'big']

不适用于最后一个字符串[“round”、“up”、“the”、““””、“blonde”、“bombshell”、“\\”、“a”、“all”、“-lrb-”、“well”、“几乎”、“all”、“-rrb-”]在Python 3.5close上对我很好,但这不适用于blonde bombshell情况,我之所以使用Python2.7,是因为包装器类是Python2特有的,不幸的是,它不能在python3I中完成,我不必打印列表(只需返回它)。只是想知道下面的内容是否足以说明一个案例。抱歉,我对python非常陌生,我知道这是列表理解,所以这似乎是正确的li=[w代表li中的w如果w[0].isalpha()]),然后您将执行
def(li):返回[w代表li中的w如果w[0].isalpha()])
--完成。
>>> from string import punctuation
>>> [i for i in lst if i not in punctuation]   
['The', 'U.S.', 'is', 'big']