如何拆分字符串并将其子字符串与子字符串列表相匹配?-python
我需要在不改变字符顺序的情况下将字符串拆分为所有可能的方式。我理解这项任务可以被看作是NLP中的标记化或柠檬化,但我正试图从一个更简单、更健壮的纯字符串搜索角度来完成它。鉴于 任务1:如何生成所有可能的子字符串,以便获得:如何拆分字符串并将其子字符串与子字符串列表相匹配?-python,python,string,dictionary,substring,string-matching,Python,String,Dictionary,Substring,String Matching,我需要在不改变字符顺序的情况下将字符串拆分为所有可能的方式。我理解这项任务可以被看作是NLP中的标记化或柠檬化,但我正试图从一个更简单、更健壮的纯字符串搜索角度来完成它。鉴于 任务1:如何生成所有可能的子字符串,以便获得: all_possible_substrings = [['f','iretrainstation'], ['fo','retrainstation'], ... ['firetrainstatio','n'], ['f','i','retrainstation'], ...
all_possible_substrings = [['f','iretrainstation'],
['fo','retrainstation'], ...
['firetrainstatio','n'],
['f','i','retrainstation'], ... , ...
['fire','train','station'], ... , ...
['fire','tr','a','instation'], ... , ...
['fire','tr','a','in','station'], ... , ...
['f','i','r','e','t','r','a','i','n','s','t','a','t','i','o','n']
任务2:然后从所有可能的子字符串
中,我如何检查以确定包含字典中所有元素的子字符串集是正确的输出。所需的输出将是字典中从左到右匹配最多字符数的子字符串列表。所需输出为:
"".join(desire_substring_list) == str1 and \
[i for i desire_substring_list if in dictionary] == len(desire_substring_list)
#(let's assume, the above condition can be true for any input string since my english
#language dictionary is very big and all my strings are human language
#just written without spaces)
期望输出:
'fire','train','station'
我做了什么?
对于任务1,我已经这样做了,但我知道它不会给我所有可能的空格插入:
all_possible_substrings.append(" ".join(str1))
我已经这样做了,但这只执行任务2:
import re
seed = ['train','station', 'fire', 'a','trainer','in']
str1 = "firetrainstation"
all_possible_string = [['f','iretrainstation'],
['fo','retrainstation'],
['firetrainstatio','n'],
['f','i','retrainstation'],
['fire','train','station'],
['fire','tr','a','instation'],
['fire','tr','a','in','station'],
['f','i','r','e','t','r','a','i','n','s','t','a','t','i','o','n']]
pattern = re.compile(r'\b(?:' + '|'.join(re.escape(s) for s in seed) + r')\b')
highest_match = ""
for i in all_possible_string:
x = pattern.findall(" ".join(i))
if "".join(x) == str1 and len([i for i in x if i in seed]) == len(x):
print " ".join(x)
对于第一部分,您可以编写类似以下内容的递归生成器:
>>> def all_substr(string):
for i in range(len(string)):
if i == len(string) - 1:
yield string
first_part = string[0:i+1]
second_part = string[i+1:]
for j in all_substr(second_part):
yield ','.join([first_part, j])
>>> for x in all_substr('apple'):
print(x)
a,p,p,l,e
a,p,p,le
a,p,pl,e
a,p,ple
a,pp,l,e
a,pp,le
a,ppl,e
a,pple
ap,p,l,e
ap,p,le
ap,pl,e
ap,ple
app,l,e
app,le
appl,e
apple
请注意,您的词典实际上是一个
列表
。此外,我很确定您需要做更多的解释。为什么“'foo'、'bar'、'bar'、'str'是所需输出?根据所需输出进行更新。在这种情况下更清楚吗?如何从字典获取str1
?我可能会误解,但“字典中从左到右匹配最多字符数的子字符串列表”不总是减去最后一个字母吗?(假设您不需要整个字符串。)
>>> def all_substr(string):
for i in range(len(string)):
if i == len(string) - 1:
yield string
first_part = string[0:i+1]
second_part = string[i+1:]
for j in all_substr(second_part):
yield ','.join([first_part, j])
>>> for x in all_substr('apple'):
print(x)
a,p,p,l,e
a,p,p,le
a,p,pl,e
a,p,ple
a,pp,l,e
a,pp,le
a,ppl,e
a,pple
ap,p,l,e
ap,p,le
ap,pl,e
ap,ple
app,l,e
app,le
appl,e
apple