如何使用Python从CSV文件中提取和求和列？_Python_Linux_Csv

如何使用Python从CSV文件中提取和求和列？

python linux csv

如何使用Python从CSV文件中提取和求和列？,python,linux,csv,Python,Linux,Csv,我有一种特殊的CSV文件，其格式类似于以下内容（我在Linux AWK的帮助下从网络日志中提取了这些列）：我需要接收一个给定的源IP地址作为参数，例如“10.0.0.1”，然后（分别）求和每个DestIP的输入字节总数（并打印它们），然后求和输出字节总数（然后打印它们）。理想情况下，所需的输出如下： >file.py log.csv 10.0.0.1 10.0.0.1 connected to 19.0.1.1 with 350151 InputBytes 10.0.0.1 conne

我有一种特殊的CSV文件，其格式类似于以下内容（我在Linux AWK的帮助下从网络日志中提取了这些列）：

我需要接收一个给定的源IP地址作为参数，例如“10.0.0.1”，然后（分别）求和每个DestIP的输入字节总数（并打印它们），然后求和输出字节总数（然后打印它们）。理想情况下，所需的输出如下：

>file.py log.csv 10.0.0.1 10.0.0.1 connected to 19.0.1.1 with 350151 InputBytes 10.0.0.1 connected to 11.0.1.1 with 45460 InputBytes 10.0.0.1 connected to 11.0.0.1 with 37701 InputBytes 10.0.0.1 connected to 11.0.0.1 with 5700 OutputBytes 10.0.0.1 connected to 19.0.1.1 with 1501 OutputBytes 10.0.0.1 connected to 11.0.1.1 with 1230 OutputBytes
一些意见：

可以安全地假设所有四个原始字段都将出现

关于输出的事情（理想情况下）是对每个组（分别为InputBytes和OutputBytes）进行排序，因为这样做的目的是识别哪个Destinp地址接收/发送了更多信息

不幸的是，我一开始没有代码（不过我刚刚熟悉了文件读取）

衷心感谢您的帮助
我编写了这个实现：

from collections import defaultdict import sys big_d = defaultdict(dict) with open("tmp.csv") as f: for j, line in enumerate(f): attr = line.split(',') d = {} for a in attr: key, val = a.split('=') d[key] = val try: big_d[d['SourceIP']][d['DestIP']]['in'] += int(d['InputBytes']) big_d[d['SourceIP']][d['DestIP']]['out'] += int(d['InputBytes']) except: big_d[d['SourceIP']][d['DestIP']] = {'in' : int(d['InputBytes']), 'out':int(d['InputBytes']),} input_ip = sys.argv[1] for dest_ip in big_d[input_ip]: print input_ip, "connected to", dest_ip, "with", big_d[input_ip][dest_ip]['in'], "InputBytes" print input_ip, "connected to", dest_ip, "with", big_d[input_ip][dest_ip]['out'], "OutputBytes"
输出：
~python tmp.py 10.0.0.1
10.0.0.1连接到11.0.0.1，有37701个输入字节
10.0.0.1连接到11.0.0.1，有37701个输出字节
10.0.0.1连接到11.0.1.1，输入45460字节
10.0.0.1以45460个输出字节连接到11.0.1.1
10.0.0.1连接到19.0.1.1，输入350151字节
10.0.0.1连接到19.0.1.1，输出350151字节\
tmp.csv是您的输入文件。

我相信它满足了您的所有要求。
您可能应该预处理该文件。您可以将其解析为CSV&然后解析
SourceIP=10.0.0.1
中的每个值以仅提取IP。然后从中构建一个表（可以使用numpy数组）。然后对源IP的列进行汇总应该相当简单。
from collections import defaultdict import sys big_d = defaultdict(dict) with open("tmp.csv") as f: for j, line in enumerate(f): attr = line.split(',') d = {} for a in attr: key, val = a.split('=') d[key] = val try: big_d[d['SourceIP']][d['DestIP']]['in'] += int(d['InputBytes']) big_d[d['SourceIP']][d['DestIP']]['out'] += int(d['InputBytes']) except: big_d[d['SourceIP']][d['DestIP']] = {'in' : int(d['InputBytes']), 'out':int(d['InputBytes']),} input_ip = sys.argv[1] for dest_ip in big_d[input_ip]: print input_ip, "connected to", dest_ip, "with", big_d[input_ip][dest_ip]['in'], "InputBytes" print input_ip, "connected to", dest_ip, "with", big_d[input_ip][dest_ip]['out'], "OutputBytes"