Python 逐行读取大型日志文件集，并计算每行上出现的主机名数_Python_Dataframe_Large Files

Python 逐行读取大型日志文件集，并计算每行上出现的主机名数

python dataframe

Python 逐行读取大型日志文件集，并计算每行上出现的主机名数,python,dataframe,large-files,Python,Dataframe,Large Files,我在一个目录中有大约300个日志文件，每个日志文件包含大约3300000行。我需要逐行阅读每个文件，并计算每行上出现的主机名数量。我为该任务编写了基本代码。但它需要1个多小时才能运行，而且需要大量内存。如何改进此代码以使其运行更快 import pandas as pd import gzip directory=os.fsdecode("/home/scratch/mdsadmin/publisher_report/2018-07-25")#folder with 300 log files

我在一个目录中有大约300个日志文件，每个日志文件包含大约3300000行。我需要逐行阅读每个文件，并计算每行上出现的主机名数量。我为该任务编写了基本代码。但它需要1个多小时才能运行，而且需要大量内存。如何改进此代码以使其运行更快

import pandas as pd
import gzip

directory=os.fsdecode("/home/scratch/mdsadmin/publisher_report/2018-07-25")#folder with 300 log files
listi=os.listdir(directory)#converting the logfiles into a list

for file in listi:#taking eaching log file in the list 
    tt=os.path.join(directory,file)# joining log file name along with the directory path
    with gzip.open(tt,'rt') as f: #unzipping the log file           
        rows=[]#clearing the list after every loop
        for line in f: #reading each line in the file 
            s=len(line.split('|'))
            a=line.split('|')[s-3]
            b=a.split('/')[0] #slicing just the hostname out of each line in the log file                
            if len(b.split('.'))==None:
                ''
            else:
                b=b.split('.')[0]
            rows.append(b) # appending it to a list

    df_temp= pd.DataFrame(columns=['hostname'],data=rows) #append list to the dataframe after every file is read
    df_final=df_final.append(df_temp,ignore_index=True) #appending above dataframe to a new one to avoid overwriting
    del df_temp #deleting temp dataframe to clear memory
df_rows1=df_rows1.groupby(["serverMonitoringAppCode"]).size().reset_index(name="Topic_Count") #doing the count

日志行样本

tx:2018-05-05T20:44:37:626 BST|rx:2018-05-05T20:44:37:626 BST|dt:0|**wokpa22**.sx.sx.com/16604/#001b0001|244/5664|2344|455
tx:2018-05-05T20:44:37:626 BST|rx:2018-05-05T20:44:37:626 BST|dt:0|**wokdd333**.sc.sc.com/16604/#001b0001|7632663/2344|342344|23244

所需输出

如果代码有效，而您只是想对其进行优化，那么最好还是问一下“谢谢”：是的，它确实有效，但需要一个多小时。我只是想知道我应该做什么样的改变才能让它运行得更快。