Python 获取句子中单词的索引_Python_Python 2.7_Python 3.x

Python 获取句子中单词的索引

python python-2.7 python-3.x

Python 获取句子中单词的索引,python,python-2.7,python-3.x,Python,Python 2.7,Python 3.x,我有一个平行文本。每行包含源语言（src）和目标语言（tgt）。在src和tgt中，括号之间都有一些表达式。文件看起来像这样 parallel(src('he is a [good man]'),tgt('lui è un [buon uomo]')). parallel(src('she is a [good woman]'),tgt('lei è una donna buona')). parallel(src('he is a beautiful man]'),tgt('lei è

我有一个平行文本。每行包含源语言（src）和目标语言（tgt）。在src和tgt中，括号之间都有一些表达式。文件看起来像这样

parallel(src('he is a [good man]'),tgt('lui è un [buon uomo]')). 

parallel(src('she is a [good woman]'),tgt('lei è una donna buona')). 

parallel(src('he is a beautiful man]'),tgt('lei è una bella donna')).

因此，在某些行中，括号之间有表达式，而在其他行中，括号之间没有表达式

我想在每行旁边提取括号之间的表达式，以及表达式的第一个单词在src和tgt语言中的位置。我试过这个密码

with open(file) as fi:
    for line in fi.readlines():
    src = line[12:line.index('tgt')]
    tgt = line[line.index('tgt'): ]
    srcs = src.split()
    tgts = tgt.split()
    ss = ""
    tt = ""
    match = re.search(r"\[(.*?)\]",src)
    if match:
        ss = match.group(1)
    match = re.search(r"\[(.*?)\]",tgt)
    if match:
        tt = match.group(1)

    print line, [[ss, ':', srcs.index('['+ss.split()[0])],[ tt,':', tgts.index('['+tt.split()[0])]]

它适用于括号之间有表达式的行，但对于括号之间没有表达式的行，它会给出错误“IndexError:list index out range”

预期产量为

parallel(src('he is a [good man]'),tgt('lui è un [buon uomo]')). [[good man:3][buon uomo:3]

parallel(src('she is a [good woman]'),tgt('lei è una donna buona')).[[good woman:3][]] 

parallel(src('he is a beautiful man]'),tgt('lei è una bella donna')). [[]:[]]

有人能帮忙吗？

发生错误是因为

ss.split

生成了一个0个单词的列表。简单的解决办法是：

if not ss or not tt:
    print(line, "[[]:[]]")
else:
    print line, [[ss, ':', srcs.index('['+ss.split()[0])],[ tt,':', tgts.index('['+tt.split()[0])]]

更复杂的修复方法是正确地执行，即：

source = '[]'
match = re.search(r"\[(.*?)\]", src)
if match:
    source_phrase = match.group(1)
    tmp = src[:match.start()]
    source_position = len(tmp.split())
    source = "[{}:{}]".format(source_phrase, source_position)

target = '[]'
match = re.search(r"\[(.*?)\]", tgt)
if match:
    target_phrase = match.group(1)
    tmp = tgt[:match.start()]
    target_position = len(tmp.split())
    target = "[{}:{}]".format(target_phrase, target_position)

print line, "[{}: {}]".format(source, target)

包括有问题的完整错误消息。@Antti Haapala这是新的并行输出（src（'he is a[good man]”），tgt（'luièun[buon uomo]”）。平行的（src（“她是一个[好女人]”），tgt（“leièuna donna buona”）。平行（src（‘他是一个漂亮的男人’）、tgt（‘leièuna bella donna’）。[[]：[]但是预期的输出是平行的（src（“他是一个[好人]”），tgt（“luièun[buon-uomo]”）。[good man:3][buon uomo:3]parallel（她是一个[好女人]）、tgt（莱昂娜·唐娜·布奥娜））[good woman:3][]parallel（他是一个漂亮男人）、tgt（莱昂娜·贝拉·唐娜））。[]：[]我完全从您的示例中复制了上面的代码。您的问题是

索引器

，而不是输出格式。@Antti Haapala您的第一个代码可以工作，但对于仅在src或tgt中有Espression的sentenecs，它会给出空括号，但在这种情况下，它会给出表达式