如何从字符串中找到多单词字符串，并用python对其进行标记？_Python_Nlp_String Matching_Preprocessor_Labeling

如何从字符串中找到多单词字符串，并用python对其进行标记？

python nlp

如何从字符串中找到多单词字符串，并用python对其进行标记？,python,nlp,string-matching,preprocessor,labeling,Python,Nlp,String Matching,Preprocessor,Labeling,例如，句子是“公司资产负债表数据每年可用”，我需要标记“公司资产负债表”，这是从给定句子中找到的子字符串因此，我需要找到的模式是： "corporate balance sheets" 给定字符串： "The corporate balance sheets data are available on an annual basis". 我想要的输出标签序列是： [0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0] 我需要找到一大堆句子（超过2GB）和一大堆模式。我不知道如

例如，句子是

“公司资产负债表数据每年可用”

，我需要标记

“公司资产负债表”

，这是从给定句子中找到的子字符串

因此，我需要找到的模式是：

"corporate balance sheets"

给定字符串：

"The corporate balance sheets data are available on an annual basis".

我想要的输出标签序列是：

[0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]

我需要找到一大堆句子（超过2GB）和一大堆模式。我不知道如何在python中高效地实现这一点。有人能给我一个好的算法吗？

列表理解和使用拆分：

import re
lst=[]
search_word = 'corporate balance sheets'
p = re.compile(search_word)
sentence="The corporate balance sheets data are available on an annual basis"

lst=[1 for i in range(len(search_word.split()))]
vect=[ lst if items == '__match_word' else 0 for items in re.sub(p,'__match_word',sentence).split()]
vectlstoflst=[[vec] if isinstance(vec,int) else vec for vec in vect]
flattened = [val for sublist in vectlstoflst for val in sublist]

 [0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]

输出：

import re
lst=[]
search_word = 'corporate balance sheets'
p = re.compile(search_word)
sentence="The corporate balance sheets data are available on an annual basis"

lst=[1 for i in range(len(search_word.split()))]
vect=[ lst if items == '__match_word' else 0 for items in re.sub(p,'__match_word',sentence).split()]
vectlstoflst=[[vec] if isinstance(vec,int) else vec for vec in vect]
flattened = [val for sublist in vectlstoflst for val in sublist]

 [0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]

句子=“公司资产负债表数据以年度为基础提供”

输出

[0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]

由于子字符串中的所有单词都必须匹配，因此在迭代句子时，可以使用检查并更新相应的索引：

def encode(sub, sent):
    subwords, sentwords = sub.split(), sent.split()
    res = [0 for _ in sentwords]    
    for i, word in enumerate(sentwords[:-len(subwords) + 1]):
        if all(x == y for x, y in zip(subwords, sentwords[i:i + len(subwords)])):
            for j in range(len(subwords)):
                res[i + j] = 1
    return res


sub = "corporate balance sheets"
sent = "The corporate balance sheets data are available on an annual basis"
print(encode(sub, sent))
# [0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]

sent = "The corporate balance data are available on an annual basis sheets"
print(encode(sub, sent))
# [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

如果search_word中有一个单词但不完全匹配，则不起作用？如果“sheets”独立出现在句末，但与“corporate balance”不匹配，如果句中是“corporate balance data is available on a annual based sheets”，则输出为[0,1,1,0,0,0,0,0,0,1]@rokss，你能举一个例子说明它不起作用吗？在前面的例子中，最后的“工作表”不应该贴标签。该算法只需从头到尾标记整个匹配模式。