Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/285.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 如何标记复合词?_Python_Python 3.x_Regex_Tokenize - Fatal编程技术网

Python 如何标记复合词?

Python 如何标记复合词?,python,python-3.x,regex,tokenize,Python,Python 3.x,Regex,Tokenize,有了一个原始的列表元素,比如[“southnorth”],我想在列表的基础上添加一个空格[“south”、“north”、“island”]。然后,只要标记化所基于的列表包含['southnoth','north'],列表就会从['southnoth']更改为['south','north'] 但是,如果存在列表[“南”、“岛”],则列表[“南北方”]应按原样保存在一起 我的想法如下: list1= ['southnorth'] #list2= ['south','north','island'

有了一个原始的列表元素,比如
[“southnorth”]
,我想在列表的基础上添加一个空格
[“south”、“north”、“island”]
。然后,只要标记化所基于的列表包含
['southnoth','north']
,列表就会从
['southnoth']
更改为
['south','north']

但是,如果存在列表
[“南”、“岛”]
,则列表
[“南北方”]
应按原样保存在一起

我的想法如下:

list1= ['southnorth']
#list2= ['south','north','island']
list2=['south','island']

str1= " ".join(list1)
str2= " ".join(list2)

Get the alternators to apply regex:
list_compound = sorted(list1 + list2, key=len)
alternators = '|'.join(map(re.escape, list_compound)
regex = re.compile(r''.format(alternators)

str1_split = re.sub(r'({})'.format(alternators),r'\1 ',str1,0, re.IGNORECASE)

str2_split = re.sub(r'({})'.format(alternators),r'\1 ',str2,0, re.IGNORECASE)

但是,上面的方法失败了,因为我需要确保序列的顺序。例如,要分解
[“southnorth”]
我需要确保另一个列表具有
[“south”,“north”]
。否则,请保持原样。

不是最漂亮的解决方案,也可能不是性能最好的,但这里有一个简单的暴力尝试:

def tokenize(word, tokens):
    tokenized_word = word
    for t in tokens:
        tokenized_word = tokenized_word.replace(t, f"{t} ").strip()

    for w in tokenized_word.split(" "):
        if w.strip() not in tokens:
            return word

    return tokenized_word


tokens = ["south", "north", "island"]

assert tokenize("south", tokens) == "south"
assert tokenize("southnorth", tokens) == "south north"
assert tokenize("islandsouthnorth", tokens) == "island south north"
assert tokenize("southwestnorth", tokens) == "southwestnorth"

组合字符串中可以有两个以上的部分吗?如果字符串是西南北?您希望输出是西南-北还是西南-北?我会保留西南-北的原始形式,因为标记化的唯一方法是南和北是连续的。