在python中比较两个列表并获得部分/完全匹配_Python_Regex_List_Pattern Matching_Text Extraction

在python中比较两个列表并获得部分/完全匹配

python regex list

在python中比较两个列表并获得部分/完全匹配,python,regex,list,pattern-matching,text-extraction,Python,Regex,List,Pattern Matching,Text Extraction,我有一个csv和一个文本文件。csv有3列。csv数据示例： Pack Type Component Type Component Material Blister Foil Aluminium Blister Base Web PVC/PVDC Bottle Cylindrically Bottles Bottle Screw Type Cap

我有一个csv和一个文本文件。csv有3列。csv数据示例：

Pack Type   Component Type        Component Material
Blister       Foil                    Aluminium
Blister      Base Web                 PVC/PVDC
Bottle     Cylindrically Bottles    
Bottle       Screw Type Cap         Polypropylene

示例文本数据：

The tablets are filled into cylindrically shaped bottles made of white coloured
polyethylene. The volumes of the bottles depend on the tablet strength and amount of
tablets, ranging from 20 to 175 ml. The screw type cap is made of white coloured
polypropylene and is equipped with a tamper proof ring.

我有两份清单。列表1来自csv，列表2来自文本文件

list 1 = [['Bottle', 'Screw Type Cap', 'Polypropylene'], ['Bottle', 'Safety Ring', ''], ['Blister', 'Base Web', 'PVC'], ['Blister', 'Base Web', 'PVD/PVDC'], ['Bottle', 'Square Shaped Bottle', 'Polyethylene'], ['Bottle', 'Child Resistant (CR) Cap', 'Polypropylene']]

list 2 = [['The', 'tablets', 'are', 'filled', 'into', 'cylindrically', 'shaped', 'bottles', 'made', 'of', 'white', 'coloured', 'polyethylene.', 'The', 'volumes', 'of', 'the', 'bottles', 'depend', 'on', 'the', 'tablet', 'strength', 'and', 'amount', 'of', 'tablets,', 'ranging', 'from', '20', 'to', '175', 'ml.', 'The', 'screw', 'type', 'cap', 'is', 'made', 'of', 'white', 'coloured', 'polypropylene', 'and', 'is', 'equipped', 'with', 'a', 'tamper', 'proof', 'ring.'], ['PVC/PVDC', 'blister', 'pack'], ['Blisters', 'are', 'made', 'in', 'a', 'thermo-forming', 'process', 'from', 'a', 'PVC/PVDC', 'base', 'web.', 'Each', 'tablet', 'is', 'filled', 'into', 'a', 'separate', 'blister', 'and', 'a', 'lidding', 'foil', 'of', 'aluminium', 'is', 'welded', 'on.', 'The', 'blisters', 'are', 'opened', 'by', 'pressing', 'the', 'tablets', 'through', 'the', 'lidding', 'foil.', 'PVDC', 'foil', 'is', 'in', 'contact', 'with', 'the', 'tablets.']]

我想在列表2的每个列表中搜索每个列表1字符串。因此，我试图将列表1中的标记与列表2中的标记进行匹配。如果列表1中的一个列表的所有标记都在列表2的列表中找到，那么应该返回一个匹配项，并且我想在列表2中标识具有列表1的特定列表的所有标记的列表，并返回整个匹配的列表2列表，以及列表1的匹配列表

Output expected:

paragraph: ['Blisters', 'are', 'made', 'in', 'a', 'thermo-forming', 'process', 'from', 'a', 'PVC/PVDC', 'base', 'web.', 'Each', 'tablet', 'is', 'filled', 'into', 'a', 'separate', 'blister', 'and', 'a', 'lidding', 'foil', 'of', 'aluminium', 'is', 'welded', 'on.', 'The', 'blisters', 'are', 'opened', 'by', 'pressing', 'the', 'tablets', 'through', 'the', 'lidding', 'foil.', 'PVDC', 'foil', 'is', 'in', 'contact', 'with', 'the', 'tablets.'], Pack Type: Blister, Component Type: 'Base Web', Component Material:  'PVD/PVDC'

问题:

Q1. How do I make 'Base Web' match with 'base', 'web.' in list 2
Q2. **** In the CSV, the 3rd row has no data in the 3rd column. If such a case is encountered, I want to ignore the empty 3rd column and match the remaining value from 2 columns.
Q3. I want partial matches to be extracted likewise

迄今为止的代码：

import re,csv
filepath = r'C:\Users\0903882.txt'

with open(filepath) as f:
    data=f.read()
    paragraphs=data.split("\n\n")
    #print(paragraphs)

all_words=[]
for paragraph in paragraphs:
    words=paragraph.split()
    all_words.append(words)

print(all_words)


inputfile = r"C:\Users\metadata.csv"                
inputm = []

with open(inputfile, "r") as f:
    reader = csv.reader(f, delimiter="\t")
    for row in reader:
        #types = row.split(',')
        inputm.append(row)

final_ref = [] 
for lists in inputm:
    final_ref.append(str(lists[0]).split(','))

print(final_ref)

这给了我两个列表进行比较

fyi您不能命名一个有空格的列表。。。而list1包含“基本Web”，而list2包含“基本Web”。因此，如果您只需按说明循环并比较list1和list2，则在创建列表时需要进行进一步拆分。如果您能够正确拆分，我建议使用set（list1[I]）比较小写。IsubSet（list2[j]）