使用python&；统计Excel工作表中特定单词的出现次数；xlrd_Python_Xlrd

使用python&；统计Excel工作表中特定单词的出现次数；xlrd

python

使用python&；统计Excel工作表中特定单词的出现次数；xlrd,python,xlrd,Python,Xlrd,我正在编写一个python脚本，它查找与脚本位于同一目录中的excel工作表（我大约有10个），并计算这些文件（如cloud、vmware、python等）中特定单词的出现次数，然后将每个单词的总数写入一个文本文件。我正在使用python和xlrd来实现这一点。每个excel文件都有一个名为“详细信息”的工作表，其中包含信息。每个文件有2列，大约26行 for filename in os.listdir(path): if filename.find('xls') != -1: p

我正在编写一个python脚本，它查找与脚本位于同一目录中的excel工作表（我大约有10个），并计算这些文件（如cloud、vmware、python等）中特定单词的出现次数，然后将每个单词的总数写入一个文本文件。我正在使用python和xlrd来实现这一点。每个excel文件都有一个名为“详细信息”的工作表，其中包含信息。每个文件有2列，大约26行

for filename in os.listdir(path):


if filename.find('xls') != -1:
    print filename        
    workbook=xlrd.open_workbook(filename)
    sheet=workbook.sheet_by_name("Details")
    values = []
    for row in range(sheet.nrows):
        for col in range(sheet.ncols):
            values.append(unicode(sheet.cell(row,col).value))

            print values.count("cloud")

我使用for循环遍历每个文件的列和所有行，然后将所有值添加到列表中。然后我使用名为values的列表进行计数。我需要某种计数来合计每个单词的计数，因为所有事情都发生在for循环中，否则会显示每个文件的计数。但不幸的是，由于某些原因，它不起作用。我还需要建立一个类似字典之类的东西，上面有我想要统计的所有单词，但我不知道怎么做。任何帮助都将不胜感激。

对于您提出的新问题，如果您提供输入数据的示例，将非常有用。以及预期的产量

我认为你应该改变

values.append(unicode(sheet.cell(row,col).value))

到

在这种情况下，您有包含所有单词（包括标点符号）的

值。您可以删除标点符号，并分别使用和模块对单词进行计数
打印单词
应打印excel文件中的所有单词，并在其前面加一个计数器。
单元格可能包含多个单词，也可能不包含多个单词，因此在替换标点符号后，必须对其进行拆分。在这里，它是通过翻译地图完成的：
import xlrd
import os
from string import punctuation, translate
from collections import Counter

def count_words_trans():
    filename = u'test.xls'
    sheet_no = 1  # sheet is selected by index here
    path = '.'
    punctuation_map = dict((ord(c), u' ') for c in punctuation)

    for filename in os.listdir(path):
       if filename.endswith('.xls'):
          print filename
          workbook = xlrd.open_workbook(filename)
          sheet = workbook.sheet_by_index(sheet_no)
          values = []
          for row in range(sheet.nrows):
              for col in range(sheet.ncols):
                  c = sheet.cell(row, col)
                  if c.ctype == xlrd.XL_CELL_TEXT:
                     cv = unicode(c.value)
                     wordlist = cv.translate(punctuation_map).split()
                     values.extend(wordlist)

          cnt = Counter(values)
          print sum(cnt.values()),' words counted,',len(cnt),' unique'

像“action:run”这样的文本被正确地分成两个单词（不同于仅仅删除标点符号）。翻译方法是unicode安全的。为了提高效率，只读取包含文本的单元格（无空格、无日期、无数字）。

您可以通过以下方式获得词频列表：
for w in cnt.most_common():
    print '%s %s' % w

你能告诉我原因吗？有什么错误吗？哦，好的。因此，它遍历每个文件的每一行，并在控制台上为每一行计算出一个零。我认为主要的问题是，当每个单元格中至少有一个句子时，我在寻找一个单词。我想它无法看到每个单元格中的每个单词。谢谢你的回复。当我使用code words=Counter（word.lower（）.strip（string.标点符号）作为word-in值时，它抛出了一个错误：AttributeError:'list'对象没有属性'lower'，请尝试删除unicode（）步骤，我想您不需要它。对于打印类型（值），我得到：对于打印值，我从输出的excel文件中得到所有文本。下面是一些文本的示例：[u'Hosting'，u'complex'，u'SAP'，u'application'，u'to'，u'save'，u'money']，是的，我编辑了我的回复，你应该删除unicode（）步骤，因为这样类型应该是
，示例文本应该是['Hosting'，complex'，SAP'，application'，to'，save'，money']Ok，当我去掉unicode时，现在它显示AttributeError:“float”对象没有属性“split”
import xlrd
import os
from string import punctuation, translate
from collections import Counter

def count_words_trans():
    filename = u'test.xls'
    sheet_no = 1  # sheet is selected by index here
    path = '.'
    punctuation_map = dict((ord(c), u' ') for c in punctuation)

    for filename in os.listdir(path):
       if filename.endswith('.xls'):
          print filename
          workbook = xlrd.open_workbook(filename)
          sheet = workbook.sheet_by_index(sheet_no)
          values = []
          for row in range(sheet.nrows):
              for col in range(sheet.ncols):
                  c = sheet.cell(row, col)
                  if c.ctype == xlrd.XL_CELL_TEXT:
                     cv = unicode(c.value)
                     wordlist = cv.translate(punctuation_map).split()
                     values.extend(wordlist)

          cnt = Counter(values)
          print sum(cnt.values()),' words counted,',len(cnt),' unique'

for w in cnt.most_common():
    print '%s %s' % w