用于搜索结果并将结果导出到.csv文件的Python脚本_Python_Search_Csv_Grep

用于搜索结果并将结果导出到.csv文件的Python脚本

python search csv grep

用于搜索结果并将结果导出到.csv文件的Python脚本,python,search,csv,grep,Python,Search,Csv,Grep,我正在尝试用Python执行以下操作，同时使用一些bash脚本。除非Python中有更简单的方法我有一个包含如下数据的日志文件： 16:14:59.027003 - WARN - Cancel Latency: 100ms - OrderId: 311yrsbj - On Venue: ABCD 16:14:59.027010 - WARN - Ack Latency: 25ms - OrderId: 311yrsbl - On Venue: EFGH 16:14:59.027201 - WA

我正在尝试用Python执行以下操作，同时使用一些bash脚本。除非Python中有更简单的方法

我有一个包含如下数据的日志文件：

16:14:59.027003 - WARN - Cancel Latency: 100ms - OrderId: 311yrsbj - On Venue: ABCD
16:14:59.027010 - WARN - Ack Latency: 25ms - OrderId: 311yrsbl - On Venue: EFGH
16:14:59.027201 - WARN - Ack Latency: 22ms - OrderId: 311yrsbn - On Venue: IJKL
16:14:59.027235 - WARN - Cancel Latency: 137ms - OrderId: 311yrsbp - On Venue: MNOP
16:14:59.027256 - WARN - Cancel Latency: 220ms - OrderId: 311yrsbr - On Venue: QRST
16:14:59.027293 - WARN - Ack Latency: 142ms - OrderId: 311yrsbt - On Venue: UVWX
16:14:59.027329 - WARN - Cancel Latency: 134ms - OrderId: 311yrsbv - On Venue: YZ  
16:14:59.027359 - WARN - Ack Latency: 75ms - OrderId: 311yrsbx - On Venue: ABCD
16:14:59.027401 - WARN - Cancel Latency: 66ms - OrderId: 311yrsbz - On Venue: ABCD
16:14:59.027426 - WARN - Cancel Latency: 212ms - OrderId: 311yrsc1 - On Venue: EFGH
16:14:59.027470 - WARN - Cancel Latency: 89ms - OrderId: 311yrsf7 - On Venue: IJKL  
16:14:59.027495 - WARN - Cancel Latency: 97ms - OrderId: 311yrsay - On Venue: IJKL

我需要从每一行中提取最后一个条目，然后使用每个唯一的条目搜索每一行，并将其导出到.csv文件中

我使用了以下bash脚本来获取每个唯一条目： cat日志文件{ucode>date+%Y%m%d.msg.log{lawk'{print$14}'| sort | uniq

基于日志文件中的上述数据，bash脚本将返回以下结果：

ABCD
EFGH
IJKL
MNOP
QRST
UVWX
YZ

现在，我想在同一个日志文件中搜索（或grep）每个结果，并返回前十个结果。我有另一个bash脚本来实现这一点，但是，如何使用FOR循环来实现呢？对于x，其中x=上面的每个条目

grep x LogFile_uucode>date+%Y%m%dmsg.log | awk'{print$7}'| sort-nr | uniq | head-10

然后将结果返回到.csv文件中。结果如下所示（每个字段位于单独的列中）：

我是Python的初学者，从大学开始（13年前）就没有做过太多的编码。任何帮助都将不胜感激。谢谢。

说你已经打开了文件。您要做的是记录每个条目在其中的次数，也就是说，每个条目将导致一个或多个计时：

from collections import defaultdict

entries = defaultdict(list)
for line in your_file:
    # Parse the line and return the 'ABCD' part and time
    column_a, timing = parse(line)
    entries[column_a].append(timing)

完成后，您将拥有这样一本词典：

{ 'ABCD': ['30ms', '25ms', '12ms'],
  'EFGH': ['12ms'],
  'IJKL': ['2ms', '14ms'] }

您现在要做的是将此字典转换为另一个数据结构，该数据结构按其值（即列表）的

len

排序。例如：

In [15]: sorted(((k, v) for k, v in entries.items()), 
                key=lambda i: len(i[1]), reverse=True)
Out[15]: 
[('ABCD', ['30ms', '25ms', '12ms']),
 ('IJKL', ['2ms', '14ms']),
 ('EFGH', ['12ms'])]

当然，这只是说明性的，您可能希望在原始

循环中为收集更多数据。
假设您已打开文件。您要做的是记录每个条目在其中的次数，也就是说，每个条目将导致一个或多个计时：
from collections import defaultdict

entries = defaultdict(list)
for line in your_file:
    # Parse the line and return the 'ABCD' part and time
    column_a, timing = parse(line)
    entries[column_a].append(timing)

完成后，您将拥有这样一本词典：
{ 'ABCD': ['30ms', '25ms', '12ms'],
  'EFGH': ['12ms'],
  'IJKL': ['2ms', '14ms'] }

您现在要做的是将此字典转换为另一个数据结构，该数据结构按其值（即列表）的len
排序。例如：
In [15]: sorted(((k, v) for k, v in entries.items()), 
                key=lambda i: len(i[1]), reverse=True)
Out[15]: 
[('ABCD', ['30ms', '25ms', '12ms']),
 ('IJKL', ['2ms', '14ms']),
 ('EFGH', ['12ms'])]

当然，这只是说明性的，您可能希望在原始的循环中为收集更多数据。
可能不像您认为的那样简洁。。。但我认为这可以解决你的问题。我添加了一些try…catch来更好地处理真实数据
import re
import os
import csv
import collections

# get all logfiles under current directory of course this pattern can be more
# sophisticated, but it's not our attention here, isn't it?
log_pattern = re.compile(r"LogFile_date[0-9]{8}.msg.log")
logfiles = [f for f in os.listdir('./') if log_pattern.match(f)]

# top n
nhead = 10
# used to parse useful fields
extract_pattern = re.compile(
    r'.*Cancel Latency: ([0-9]+ms) - OrderId: ([0-9a-z]+) - On Venue: ([A-Z]+)')
# container for final results
res = collections.defaultdict(list)

# parse out all interesting fields
for logfile in logfiles:
    with open(logfile, 'r') as logf:
        for line in logf:
            try:  # in case of blank line or line with no such fields.
                latency, orderid, venue = extract_pattern.match(line).groups()
            except AttributeError:
                continue
            res[venue].append((orderid, latency))

# write to csv
with open('res.csv', 'w') as resf:
    resc = csv.writer(resf, delimiter=' ')
    for venue in sorted(res.iterkeys()):  # sort by Venue
        entries = res[venue]
        entries.sort()  # sort by OrderId
        for i in range(0, nhead):
            try:
                resc.writerow([venue, entries[i][0], 'Cancel ' + entries[i][1]])
            except IndexError:  # nhead can not be satisfied
                break

也许不像你想的那么简洁。。。但我认为这可以解决你的问题。我添加了一些try…catch来更好地处理真实数据
import re
import os
import csv
import collections

# get all logfiles under current directory of course this pattern can be more
# sophisticated, but it's not our attention here, isn't it?
log_pattern = re.compile(r"LogFile_date[0-9]{8}.msg.log")
logfiles = [f for f in os.listdir('./') if log_pattern.match(f)]

# top n
nhead = 10
# used to parse useful fields
extract_pattern = re.compile(
    r'.*Cancel Latency: ([0-9]+ms) - OrderId: ([0-9a-z]+) - On Venue: ([A-Z]+)')
# container for final results
res = collections.defaultdict(list)

# parse out all interesting fields
for logfile in logfiles:
    with open(logfile, 'r') as logf:
        for line in logf:
            try:  # in case of blank line or line with no such fields.
                latency, orderid, venue = extract_pattern.match(line).groups()
            except AttributeError:
                continue
            res[venue].append((orderid, latency))

# write to csv
with open('res.csv', 'w') as resf:
    resc = csv.writer(resf, delimiter=' ')
    for venue in sorted(res.iterkeys()):  # sort by Venue
        entries = res[venue]
        entries.sort()  # sort by OrderId
        for i in range(0, nhead):
            try:
                resc.writerow([venue, entries[i][0], 'Cancel ' + entries[i][1]])
            except IndexError:  # nhead can not be satisfied
                break

您的输出如何与您的输入相对应？您的输出如何与您的输入相对应？可能是一些简单的问题，但我遇到了错误：将open（logfile，'r'）作为logf:^syntaxror:invalid syntax谢谢Francis Chan的帮助。这很有效。是否有一种方法可以将每个字段写入.csv文件中的一个单独的列，每个列都有相应的标题？立即写入—将所有4个字段写入同一列（A列）。另外，我想按场地的字母顺序排序，然后按第四个字段降序排序（63ms、64ms、63ms、62ms…等等）？再次感谢你的帮助。另外，我应该用一个更好的例子来说明我的日志文件。有两种不同类型的“延迟”，但我只显示了一种类型，即“取消”。它实际上是“取消”或“确认”。如何在延迟之前包含正确的前一个单词？16:14:59.027003-警告-确认延迟：22ms-订单ID:311yrsbj-在场馆：IJKL 16:14:59.027010-警告-取消延迟：22ms-订单ID:311yrsbl-在场馆：EFGH 16:14:59.027201-警告-确认延迟：22ms-订单ID:311yrsbn-在场馆：IJKL 16:14:59.027235-警告-取消延迟：22ms-订单ID:311yrsbp-在场馆：不可能发生什么事情很简单，但我遇到了一个错误：将open（logfile，'r'）作为logf:^SyntaxError:invalid syntaxes谢谢Francis Chan的帮助。这很有效。是否有一种方法可以将每个字段写入.csv文件中的一个单独的列，每个列都有相应的标题？立即写入—将所有4个字段写入同一列（A列）。另外，我想按场地的字母顺序排序，然后按第四个字段降序排序（63ms、64ms、63ms、62ms…等等）？再次感谢你的帮助。另外，我应该用一个更好的例子来说明我的日志文件。有两种不同类型的“延迟”，但我只显示了一种类型，即“取消”。它实际上是“取消”或“确认”。如何在延迟之前包含正确的前一个单词？16:14:59.027003-警告-确认延迟：22ms-订单ID:311yrsbj-在场馆：IJKL 16:14:59.027010-警告-取消延迟：22ms-订单ID:311yrsbl-在场馆：EFGH 16:14:59.027201-警告-确认延迟：22ms-订单ID:311yrsbn-在场馆：IJKL 16:14:59.027235-警告-取消延迟：22ms-订单ID:311yrsbp-在场馆：MNOP