使用Python字典计算单词的频率，不包括将从第二个文件读取的一组“停止单词”_Python

使用Python字典计算单词的频率，不包括将从第二个文件读取的一组“停止单词”

python

使用Python字典计算单词的频率，不包括将从第二个文件读取的一组“停止单词”,python,Python,说实话，我在编码方面是个新手，但这是我这学期的最后一个作业，我完全被卡住了，所以基本上我需要编写一个Python程序，读取包含英语单词的文本文件，并使用Python字典计算单词的频率，排除将从第二个文件读取的一组停止字。使用matplotlib.pyplot创建一个水平条形图柱状图，显示输入文件中最常见的15个单词及其计数我能够打开并打印出txt文件和Iv文件中的单词，并将所有内容简化为小写，以便于阅读，删除标点符号，并在行中分隔单词我真的需要帮助，用我从usconst文件中提取的单词来迭代

说实话，我在编码方面是个新手，但这是我这学期的最后一个作业，我完全被卡住了，所以基本上我需要编写一个Python程序，读取包含英语单词的文本文件，并使用Python字典计算单词的频率，排除将从第二个文件读取的一组停止字。使用matplotlib.pyplot创建一个水平条形图柱状图，显示输入文件中最常见的15个单词及其计数

我能够打开并打印出txt文件和Iv文件中的单词，并将所有内容简化为小写，以便于阅读，删除标点符号，并在行中分隔单词

我真的需要帮助，用我从usconst文件中提取的单词来迭代stopwords文件。老实说，我不知道如何用字典做到这一点。直方图上的任何信息也很好

这是我到目前为止所拥有的

def main():
  
    text = open("usconst.txt" , "r")
    texts = open("stopwords.txt" , "r")
    
  

    
   #loop through each line of the file for us const
    
    line_count = 1
    d = dict()
   
        # us const 
    for line in text:
      print("line{} : is {}".format(line_count , line))
      line_count += 1
      line = line.translate(line.maketrans("","",string.punctuation))
      line = line.lower()
      words = line.split()
      print("words =" , words , "\n")
      
      # stop words 
    for line in texts:
      line_count += 1
      line = line.lower()
      line = line.translate(line.maketrans("","",string.punctuation))
      words = line.split()
      
      
      
    for word in words:
        if word in d:
                print("word--{}-- is already in dictionary, its value is {}".format(word , d[word])) 
        else:
            d[word] = 42

有点像未经测试，因为我没有您的列表：

def main():
  
    text = open("usconst.txt" , "r")
    texts = open("stopwords.txt" , "r")
    
    line_count = 1
   
    # us const 
    uswords = []
    for line in text:
      print("line{} : is {}".format(line_count , line))
      line_count += 1
      line = line.translate(line.maketrans("","",string.punctuation))
      line = line.lower()
      uswords.extend( line.split() )
    print("uswords =" , uswords , "\n")
      
      # stop words
    stopwords = [] 
    for line in texts:
      line_count += 1
      line = line.lower()
      line = line.translate(line.maketrans("","",string.punctuation))
      stopwords.extend( line.split())

    counts = {}      
    for word in uswords:
        if word in stopwords:
            continue
        if word in counts:
            counts[word] += 1
        else:
            counts[word] = 1
    print( counts )

有一些更聪明的方法可以做到这一点，但这保留了你的基本理念。

首先初始化一组停止词，并在规范化它们之后从文本中记录字数，删除标点符号、小写字母等

然后，您可以对不在停止词集中的词的dict值求和

我使用了部分代码，但采用了上面详述的方法

从集合导入defaultdict def标准化线： line=line.lower 返回行.translateline.maketrans，，string.标点符号创建规范化的停止词集停止使用openstopwords.txt，r作为f：对于f中的行：停止使用单词创建规范化字数字典单词数=默认发音使用openusconst.txt，r作为f：对于f中的行：对于normalizeline.split中的w：字数[w]+=1 按非停止词的最常用词列出已排序[k，v表示k，v表示单词数量。如果k不在停止单词中，则为项目]，反向=真，关键字=λx:x[1]

好的，但是你没有把单词表放在任何地方。您需要将要拆分的单词累积到一个大列表中，并且需要为这两个循环使用单独的列表。