Python.split（）在每个可拆分标记栏空白处的字符串上，但忽略某些特定字符串_Python_Regex_String_Split

Python.split（）在每个可拆分标记栏空白处的字符串上，但忽略某些特定字符串

python regex string

Python.split（）在每个可拆分标记栏空白处的字符串上，但忽略某些特定字符串,python,regex,string,split,Python,Regex,String,Split,我希望将一个句子拆分为标记，但忽略2个特定字符串，也忽略空格例如：人均国民总收入；根据世界银行的数据，PPP-LRB-美元-RRB-在位置位置的最后一次测量是在2011年的数字位置。应划分为[GNI，人均，；，PPP，，，LRB，，，US，dollar，，，RRB，，，in，LOCATION_SLOT，was，last，measured，at，NUMBER_SLOT，in，in，the，the，the，the，World，Bank，，][/code> 我不希望将位置\u插槽或编号\u插槽拆

我希望将一个句子拆分为标记，但忽略2个特定字符串，也忽略空格

例如：

人均国民总收入；根据世界银行的数据，PPP-LRB-美元-RRB-在位置位置的最后一次测量是在2011年的数字位置。

应划分为[GNI，人均，；，PPP，，，LRB，，，US，dollar，，，RRB，，，in，LOCATION_SLOT，was，last，measured，at，NUMBER_SLOT，in，in，the，the，the，the，World，Bank，，][/code>

我不希望将

位置\u插槽

或

编号\u插槽

拆分，例如，将前者拆分为

[LOCATION，\uu，SLOT]

。但我确实想解释点

我当前的函数只允许基于字符的单词，但正在删除数字和类似

；，，，：etc在这里-我不希望它删除这些：
def sentence_to_words(sentence,remove_stopwords=False):
    letters_only = re.sub("[^a-zA-Z| LOCATION_SLOT | NUMBER_SLOT]", " ", sentence)
    words = letters_only.lower().split() 
    if remove_stopwords:
            stops = set(stopwords.words("english"))
            words = [w for w in words if not w in stops]
    return(words)

这将生成以下令牌：
gni人均购买力平价lrb美元rrb位置\u最后测量的数字\u根据世界银行的位置
您可以简单地使用split
>>> x = "GNI per capita ; PPP -LRB- US dollar -RRB- in LOCATION_SLOT was last measured at NUMBER_SLOT in 2011 , according to the World Bank ."
>>>
>>> x.split()
['GNI', 'per', 'capita', ';', 'PPP', '-LRB-', 'US', 'dollar', '-RRB-', 'in', 'LOCATION_SLOT', 'was', 'last', 'measured', 'at', 'NUMBER_SLOT', 'in', '2011', ',', 'according', 'to', 'the', 'World', 'Bank', '.']

要删除-around-LBR-请执行以下操作：
>>> z = [y.strip('-') for y in x]
>>> z
['GNI', 'per', 'capita', ';', 'PPP', 'LRB', 'US', 'dollar', 'RRB', 'in', 'LOCATION_SLOT', 'was', 'last', 'measured', 'at', 'NUMBER_SLOT', 'in', '2011', ',', 'according', 'to', 'the', 'World', 'Bank', '.']
>>> 

如果要保留破折号，请执行以下操作：
>>> y = []
>>> for item in x:
...   if item.startswith('-') and item.endswith('-'):
...     y.append(',')
...     y.append(item.strip('-'))
...     y.append('-')
...   else:
...     y.append(item)
... 

您可以使用re.findall
并从开始和结束处删除空格
>>> [x.strip() for x in re.findall('\s*(\w+|\W+)', line)]
#['GNI', 'per', 'capita', ';', 'PPP', '-', 'LRB', '-', 'US', 'dollar', '-', 'RRB', '-', 'in', 'LOCATION_SLOT', 'was', 'last', 'measured', 'at', 'NUMBER_SLOT', 'in', '2011', ',', 'according', 'to', 'the', 'World', 'Bank', '.']

正则表达式解释
一方面，你说在之前和之后都有一个空格
，但是您没有考虑在-
RRB之后的中的空格，对不起，我应该澄清一下，我分割的空格，因为它们不在，
之间，实际上不是空格。啊，多傻啊。我不知道split（）不会在\uu
上拆分。但问题是，这里我们没有将-RRB-
拆分为3个部分，我如何将其包含在其中？第二个部分将删除-，这不是OP想要的
> \w matches word character [A-Za-z0-9_].
> \W is negation of \w. i.e. it matches anything except word character.