Python 根据标签列表中某个单词的索引位置,查找该单词在字符串中的起始位置和结束位置
我有一个句子Python 根据标签列表中某个单词的索引位置,查找该单词在字符串中的起始位置和结束位置,python,string,list,Python,String,List,我有一个句子 str = 'cold weather gives me cold' 和一份清单 tag = ['O','O','O','O','disease'] 这表明句子中的第五个单词是一种疾病类型。现在我需要得到第五个单词的开始和结束位置 如果我只使用'cold'进行字符串搜索,它将给出首先出现的'cold'的起始位置 以下内容将输出给定单词的起始和结束位置,假设单词之间用空格分隔: str = 'cold weather gives me cold' word_idx = 4
str = 'cold weather gives me cold'
和一份清单
tag = ['O','O','O','O','disease']
这表明句子中的第五个单词是一种疾病类型。现在我需要得到第五个单词的开始和结束位置
如果我只使用'cold'进行字符串搜索,它将给出首先出现的'cold'的起始位置 以下内容将输出给定单词的起始和结束位置,假设单词之间用空格分隔:
str = 'cold weather gives me cold'
word_idx = 4 # index of the word we are looking for
split_str = str.split(' ')
print(split_str[word_idx]) # outputs 'cold'
start_pos = 0
for i in range(word_idx):
start_pos += len(split_str[i]) + 1 # add one because of the spaces between words
end_pos = start_pos + len(split_str[word_idx]) - 1
print(start_pos) # prints 22
print(end_pos) # prints 25
您可以简单地拆分字符串,然后再次连接它,但这有点尴尬
string_list = string.split(" ")
word_start = len(" ".join(string_list[:4])) + 1
word_end = word_start + len(string_list[4])
使用
itertools
和re
:
import re
from itertools import accumulate
def find_index(string, n):
words = string.split()
len_word = len(words[n])
end_index = list(accumulate(map(len, re.split('(\s)' , string))))[::2][n]
return end_index - len_word, end_index - 1
使用它:
find_index('cold weather gives me cold', 4) #5th word means 4 in indexing
输出:
(22, 25)
(22, 25)
[[22, 26]]
这应该可以做到
def get(str,target_index):
start = len(" ".join(str.split(" ")[:target_index])) + 1
end = start + len(str.replace('.','').split(' ')[target_index])
return (start,end)
str = 'cold weather gives me cold.'
tag = ['O','O','O','O','disease']
start,end = get(str,tag.index('disease'))
print(start,end,str[start:end]) # outputs 22 26 cold
str = 'cold weather gives me cold'
tag = ['O','O','O','O','disease']
start,end = get(str,tag.index('disease'))
print(start,end,str[start:end]) # outputs 22 26 cold
str = 'cold weather gives me cold and cough'
tag = ['O','O','O','O','disease']
start,end = get(str,tag.index('disease'))
print(start,end,str[start:end]) # outputs 22 26 cold
看到它在行动
希望有帮助 首先从标签中查找疾病索引,然后从数据中查找疾病名称,然后查找开始和结束索引:
strData = 'cold weather gives me cold'
tag = ['O','O','O','O','disease']
diseaseIndex = tag.index('disease')
diseaseName = strData.split()[diseaseIndex]
print(diseaseName)
diseaseNameStartIndex = sum(len(word) for (index, word) in enumerate(strData.split()) if index< diseaseIndex ) + diseaseIndex
diseaseNameEndIndex = diseaseNameStartIndex + len(diseaseName) -1
print("diseaseNameStartIndex = ",diseaseNameStartIndex)
print("diseaseNameEndIndex = ",diseaseNameEndIndex)
尝试使用此功能:
def find_index(s, n):
length = len(s.split()[n])
index = [(0, len(s.split()[0]) - 1)]
for i in s.split():
index.append((index[-1][0] + len(i), index[-1][1] + len(i)))
return index[n + 1]
print(find_index('cold weather gives me cold', 4))
输出:
(22, 25)
(22, 25)
[[22, 26]]
如果您必须对一个长行执行此操作,最好使用
迭代器
,该迭代器将使用re.finditer
方法生成单词的起始和结束位置,然后使用islice
>>> str = 'cold weather gives me cold'
>>> word_pos = iter((match.group(), match.span(1)) for match in re.finditer(r'(\S+)\S', string))
>>>
>>> n=4
>>> next(islice(word_pos, n, n+1))
('cold', (22, 25))
您可以将
re
与列表一起使用:
import re
s = 'cold weather gives me cold'
new_s = re.findall('\w+|\s+', s)
l = [(a, sum(map(len, new_s[:i]))) for i, a in enumerate(new_s) if a != ' ']
输出:
(22, 25)
(22, 25)
[[22, 26]]
所以你基本上需要字符串的第五个字?或者你需要它的索引吗?你想要COLD的第一个和最后一个字吗?我只想要最后一个COLD,字符串索引的第五个字。看起来你只需要
str.split()[4]
btw,好吧,这里只是一个例子,但是'str'变量名将隐藏相应的内置项。