计算整个CSV文件以及Python中每行中某些单词的出现次数

计算整个CSV文件以及Python中每行中某些单词的出现次数,python,dataframe,csv,dataset,counter,Python,Dataframe,Csv,Dataset,Counter,我正在处理来自多个服务器的数据,并为每个服务器生成一个CSV文件。我已设法在一个文件中编译来自所有服务器的数据,合并文件中的数据如下所示- Description,dc1pp1sellv01,dc1pp2sellv01,dc2pp1sellv01 1.1 Database Placement,PASSED,PASSED,PASSED 1.2 Use dedicated least privilaged account,PASSED,PASSED,PASSED 1.3 Diable MySQL h

我正在处理来自多个服务器的数据,并为每个服务器生成一个CSV文件。我已设法在一个文件中编译来自所有服务器的数据,合并文件中的数据如下所示-

Description,dc1pp1sellv01,dc1pp2sellv01,dc2pp1sellv01
1.1 Database Placement,PASSED,PASSED,PASSED
1.2 Use dedicated least privilaged account,PASSED,PASSED,PASSED
1.3 Diable MySQL history,PASSED,PASSED,FAILED
2.1 Ensure old passwords is set to 1,PASSED,DEPRICATED,NA

上述文件中的每个服务器列都可以具有以下任一结果值-

[“通过”、“失败”、“异常”、“不适用”、“不推荐”]

从上面的CSV文件中,我想计算结果并创建一个如下所示的数据集

Description,dc1pp1sellv01,dc1pp2sellv01,dc2pp1sellv01,PASSED,FAILED,EXCEPTION,NA,DEPRECATED
1.1 Database Placement,PASSED,PASSED,PASSED,3,0,0,0,0
1.2 Use dedicated least privilaged account,PASSED,PASSED,PASSED,3,0,0,0,0
1.3 Diable MySQL history,PASSED,PASSED,FAILED,2,1,0,0,0
2.1 Ensure old passwords is set to 1,PASSED,DEPRICATED,NA,1,0,0,1,1

这里有一个建议(相当详细,以强调正在发生的事情):

我假设您的数据位于名为
data.csv
的文件中。你必须调整一下。我希望它能起作用

PS:您的示例数据中有一个拼写错误:
debricated
应该是
不推荐的
。这将导致非预期的输出

没有不必要的辅助变量的更紧凑版本如下所示:

import csv

events = ["PASSED", "FAILED", "EXCEPTION", "NA", "DEPRECATED"]
with open('data.csv', 'r') as fin, open('data_out.csv', 'w') as fout:
    in_, out = csv.reader(fin), csv.writer(fout)
    out.writerow(next(in_) + events)
    out.writerows(line + [sum(1 if event == entry else 0 for entry in line[1:])
                          for event in events]
                  for line in in_)
这里有一个建议(相当详细,以强调正在发生的事情):

我假设您的数据位于名为
data.csv
的文件中。你必须调整一下。我希望它能起作用

PS:您的示例数据中有一个拼写错误:
debricated
应该是
不推荐的
。这将导致非预期的输出

没有不必要的辅助变量的更紧凑版本如下所示:

import csv

events = ["PASSED", "FAILED", "EXCEPTION", "NA", "DEPRECATED"]
with open('data.csv', 'r') as fin, open('data_out.csv', 'w') as fout:
    in_, out = csv.reader(fin), csv.writer(fout)
    out.writerow(next(in_) + events)
    out.writerows(line + [sum(1 if event == entry else 0 for entry in line[1:])
                          for event in events]
                  for line in in_)
您可以使用统计特定单词的出现次数。假设您已打开
.csv
文件并存储在字符串
输入中:您可以执行以下操作:

from collections import Counter

res_values = ("PASSED", "FAILED", "EXCEPTION", "NA", "DEPRECATED")

input = ("Description,dc1pp1sellv01,dc1pp2sellv01,dc2pp1sellv01\n"
         "1.1 Database Placement,PASSED,PASSED,PASSED\n"
         "1.2 Use dedicated least privilaged account,PASSED,PASSED,PASSED\n"
         "1.3 Diable MySQL history,PASSED,PASSED,FAILED\n"
         "2.1 Ensure old passwords is set to 1,PASSED,DEPRICATED,NA")

print('\n'.join(
    [line + ',' + ','.join(
        [str(Counter(line.split(','))[res])
         if i != 0
         else res
         for res in res_values]
    )
     for i, line in enumerate(input.split('\n'))]))
我使用列表理解来更好地优化流程(因为文件可能非常大),但这里有另一个更清晰的代码,它做的事情与此完全相同:

split = input.split('\n')                      # Split the input line by line
for i, line in enumerate(split):               # For each line of the input
    if i == 0:                                 # Write full result name (for the first line)
        split[i] += ',' + ','.join(res_values)
    else:                                      # Count and write result occurrences
        counts = Counter(line.split(','))
        for res in res_values:
            split[i] += ',' + str(counts[res])
print('\n'.join(split))                        # Join the full string
我提出了一个可执行的解决方案,但出于优化目的,逐行读取文件当然比将其存储在字符串变量中要好。

您可以使用它来计算特定单词的出现次数。假设您已打开
.csv
文件并存储在字符串
输入中:您可以执行以下操作:

from collections import Counter

res_values = ("PASSED", "FAILED", "EXCEPTION", "NA", "DEPRECATED")

input = ("Description,dc1pp1sellv01,dc1pp2sellv01,dc2pp1sellv01\n"
         "1.1 Database Placement,PASSED,PASSED,PASSED\n"
         "1.2 Use dedicated least privilaged account,PASSED,PASSED,PASSED\n"
         "1.3 Diable MySQL history,PASSED,PASSED,FAILED\n"
         "2.1 Ensure old passwords is set to 1,PASSED,DEPRICATED,NA")

print('\n'.join(
    [line + ',' + ','.join(
        [str(Counter(line.split(','))[res])
         if i != 0
         else res
         for res in res_values]
    )
     for i, line in enumerate(input.split('\n'))]))
我使用列表理解来更好地优化流程(因为文件可能非常大),但这里有另一个更清晰的代码,它做的事情与此完全相同:

split = input.split('\n')                      # Split the input line by line
for i, line in enumerate(split):               # For each line of the input
    if i == 0:                                 # Write full result name (for the first line)
        split[i] += ',' + ','.join(res_values)
    else:                                      # Count and write result occurrences
        counts = Counter(line.split(','))
        for res in res_values:
            split[i] += ',' + str(counts[res])
print('\n'.join(split))                        # Join the full string
我已经提出了一个可执行的解决方案,但出于优化目的,逐行读取文件当然比将其存储在字符串变量中要好