Python 从文本中的单词索引获取字符索引_Python_Nltk

Python 从文本中的单词索引获取字符索引

python

Python 从文本中的单词索引获取字符索引,python,nltk,Python,Nltk,给定文本中单词的索引，我需要得到字符索引。例如，在以下文本中： "The cat called other cats." 单词“cat”的索引是1。我需要cat I.e.c的第一个字符的索引，它将是4。我不知道这是否相关，但我正在使用python nltk来获取单词。现在我唯一能想到的办法是： - Get the first character, find the number of words in this piece of text - Get the first two c

给定文本中单词的索引，我需要得到字符索引。例如，在以下文本中：

"The cat called other cats."

单词“cat”的索引是1。我需要cat I.e.c的第一个字符的索引，它将是4。我不知道这是否相关，但我正在使用python nltk来获取单词。现在我唯一能想到的办法是：

 - Get the first character, find the number of words in this piece of text
 - Get the first two characters, find the number of words in this piece of text
 - Get the first three characters, find the number of words in this piece of text
 Repeat until we get to the required word.

但这将是非常低效的。如有任何意见，将不胜感激

import re
def char_index(sentence, word_index):
    sentence = re.split('(\s)',sentence) #Parentheses keep split characters
    return len(''.join(sentence[:word_index*2]))

使用

enumerate（）

使用

enumerate（）

您可以在此处使用

dict

：

>>> import re
>>> r = re.compile(r'\w+')
>>> text = "The cat called other cats."
>>> dic = { i :(m.start(0), m.group(0)) for i, m in enumerate(r.finditer(text))}
>>> dic
{0: (0, 'The'), 1: (4, 'cat'), 2: (8, 'called'), 3: (15, 'other'), 4: (21, 'cats')}
def char_index(char, word_ind):
    start, word = dic[word_ind]
    ind = word.find(char)
    if ind != -1:
        return start + ind
...     
>>> char_index('c',1)
4
>>> char_index('c',2)
8
>>> char_index('c',3)
>>> char_index('c',4)
21

您可以在此处使用

dict

：

>>> import re
>>> r = re.compile(r'\w+')
>>> text = "The cat called other cats."
>>> dic = { i :(m.start(0), m.group(0)) for i, m in enumerate(r.finditer(text))}
>>> dic
{0: (0, 'The'), 1: (4, 'cat'), 2: (8, 'called'), 3: (15, 'other'), 4: (21, 'cats')}
def char_index(char, word_ind):
    start, word = dic[word_ind]
    ind = word.find(char)
    if ind != -1:
        return start + ind
...     
>>> char_index('c',1)
4
>>> char_index('c',2)
8
>>> char_index('c',3)
>>> char_index('c',4)
21

当单词的第一个字符，在这个例子中是“b”，在句子的前面被使用时会发生什么呢？只要“been”在句子的前面没有被使用就可以了啊，对，你正在搜索这个单词。那么，当目标词在句子的前面被使用时会发生什么呢？当单词的第一个字符，在这个例子中是“b”，在句子的前面被使用时会发生什么呢？只要“been”在句子的前面没有被使用就可以了啊，对，你正在搜索这个词。那么，当目标词出现在句子的前面时会发生什么呢？这是错误的，如果单词的第一个字符出现在句子的前面，它会返回那个索引。@JeremyBentham，我注意到了。我正在修正：）这是错误的，如果单词的第一个字符出现在句子的前面，它会返回那个索引。@JeremyBentham，我注意到了。我在修正：）我想你比OP要求的更进一步了。我不认为OP希望在给定单词索引的情况下找到特定的字符索引，而是希望每次都找到第一个字符索引。但是，像这样的通用解决方案可能更好+1.我认为你比OP的要求更进一步了。我不认为OP希望在给定单词索引的情况下找到特定的字符索引，而是希望每次都找到第一个字符索引。但是，像这样的通用解决方案可能更好+1.谢谢你的建议。但我不能仅仅通过拆分空格来获得单词。我正在使用树形文字标记器。谢谢你的想法。但我不能仅仅通过拆分空格来获得单词。我正在使用TreebankWordTokenizer。

>>> import re
>>> r = re.compile(r'\w+')
>>> text = "The cat called other cats."
>>> dic = { i :(m.start(0), m.group(0)) for i, m in enumerate(r.finditer(text))}
>>> dic
{0: (0, 'The'), 1: (4, 'cat'), 2: (8, 'called'), 3: (15, 'other'), 4: (21, 'cats')}
def char_index(char, word_ind):
    start, word = dic[word_ind]
    ind = word.find(char)
    if ind != -1:
        return start + ind
...     
>>> char_index('c',1)
4
>>> char_index('c',2)
8
>>> char_index('c',3)
>>> char_index('c',4)
21