Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/xpath/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 检查文本/字符串是否出现预定义的列表元素_Python_List_Frequency_Custom Lists_Vocabulary - Fatal编程技术网

Python 检查文本/字符串是否出现预定义的列表元素

Python 检查文本/字符串是否出现预定义的列表元素,python,list,frequency,custom-lists,vocabulary,Python,List,Frequency,Custom Lists,Vocabulary,我有几个文本文件,我想将它们与由表达式和单个单词组成的词汇表进行比较。所需的输出应该是一个字典,其中包含该列表中的所有元素作为键,它们在文本文件中的各自频率作为值。要构建词汇表,我需要将两个列表匹配在一起 list1 = ['accounting',..., 'yields', 'zero-bond'] list2 = ['accounting', 'actual cost', ..., 'zero-bond'] vocabulary_list = ['accounting', 'actual

我有几个文本文件,我想将它们与由表达式和单个单词组成的词汇表进行比较。所需的输出应该是一个字典,其中包含该列表中的所有元素作为键,它们在文本文件中的各自频率作为值。要构建词汇表,我需要将两个列表匹配在一起

list1 = ['accounting',..., 'yields', 'zero-bond']
list2 = ['accounting', 'actual cost', ..., 'zero-bond']
vocabulary_list = ['accounting', 'actual cost', ..., 'yields', 'zero-bond']

sample_text = "Accounting experts predict an increase in yields for zero-bond and yields for junk-bonds."

desired_output = ['accounting':1, 'actual cost':0, ..., 'yields':2, 'zero-bond':1]
我尝试的是:

def word_frequency(fileobj, words):
     """Build a Counter of specified words in fileobj""" 
     # initialise the counter to 0 for each word 
    ct = Counter(dict((w, 0) for w in words)) 
    file_words = (word for line in fileobj for word in line)             
    filtered_words = (word for word in file_words if word in words)       
    return Counter(filtered_words)

 def print_summary(filepath, ct): 
    words = sorted(ct.keys()) 
    counts = [str(ct[k]) for k in words] with open(filepath[:-4] + '_dict' + '.txt', mode = 'w') as outfile: 
    outfile.write('{0}\n{1}\n{2}\n\n'.format(filepath,', '.join(words),', '.join(counts))) 
    return outfile 

在Python中有什么方法可以做到这一点吗?我想出了如何用单个单词1的词汇表来管理这个问题,但却找不到解决多个单词大小写的方法 要将所有列表链接到一个单字目录中,请执行以下操作:

from collections import Counter
from itertools import chain
import re

c = Counter()

l1,l2 = ['accounting', 'actual cost'], ['yields', 'zero-bond']
vocabulary_dict  = {k:0 for k in chain(l1,l2)}
print(vocabulary_dict)
sample_text = "Accounting experts predict actual costs an increase in yields for zero-bond and yields for junk-bonds.".lower()
splitted = sample_text.split()
c.update(splitted)

for k in vocabulary_dict:
    spl = k.split()
    ln = len(spl)
    if ln > 1:
        check = re.findall(r'\b{0}\b'.format(k),sample_text)
        if check:
            vocabulary_dict[k] += len(check)
    elif k in sample_text.split():
        vocabulary_dict[k] += c[k]
print(vocabulary_dict)

您可以创建两个dict,一个用于短语,另一个用于单词,并对每个dict进行遍历。

您的单字解决方案是什么?它在哪些方面不适用于表达式?定义word_frequencyfileobj,words:在fileobj中构建指定单词的计数器将每个单词的计数器初始化为0 ct=Counterdictw,0表示w in words file\u words=word for line in fileobj表示word in line filtered\u words=word for word in file\u words如果words in words返回反过滤的单词def print\u summaryfilepath,ct:words=sortedct dct.keys counts=[strct[k]表示k in words],带openfilepath[:-4]+'.\u dict'+'+'.txt',mode='w'作为outfile:outfile.write'{0}\n{1}\n{2}\n\n'.formatfilepath','.joinwords',','.joincounts返回outfilewords=词汇表\u列表不幸的是,第一个函数只捕获单个标记,因此它只能将这些sinlge标记字与词汇表Nice解决方案Padraic进行比较,但这不适用于这样的示例:示例文本=会计专家…实际成本。。。预测零债券和收益率的收益率增加->实际成本:0,“会计”:1。。。非常感谢Padraic:缺少一点,你的脚本的输出是…'yields':1应该是…'yields':2吗?@DominikScheld,是的,需要将逻辑颠倒一秒现在它工作得很好,你是否也知道如何组合这两个列表以构建一个独特的词汇表?@DominikScheld,添加了一个如何链接列表和创建dict的示例。您还可以有两个dict,一个用于短语,另一个用于单个单词,只需在末尾合并,无需拆分和检查len
from collections import Counter
from itertools import chain
import re

c = Counter()

l1,l2 = ['accounting', 'actual cost'], ['yields', 'zero-bond']
vocabulary_dict  = {k:0 for k in chain(l1,l2)}
print(vocabulary_dict)
sample_text = "Accounting experts predict actual costs an increase in yields for zero-bond and yields for junk-bonds.".lower()
splitted = sample_text.split()
c.update(splitted)

for k in vocabulary_dict:
    spl = k.split()
    ln = len(spl)
    if ln > 1:
        check = re.findall(r'\b{0}\b'.format(k),sample_text)
        if check:
            vocabulary_dict[k] += len(check)
    elif k in sample_text.split():
        vocabulary_dict[k] += c[k]
print(vocabulary_dict)