Python中的I/O效率_Python_Io - Fatal编程技术网

Python中的I/O效率

python io

Python中的I/O效率,python,io,Python,Io,我正在编写一个程序：从excel表格中读取每行的内容（共90000行）将每行内容与另一张excel表格进行比较（共600000行）如果出现匹配项，请将匹配项写入新的excel工作表我已经写了剧本，一切都很好。然而，计算时间是巨大的。在一个小时内，它只完成了第一页的200行，从而编写了200个不同的文件我想知道是否有一种方法可以以不同的方式保存匹配，因为我稍后将使用它们？有没有办法保存在矩阵或其他什么东西中 import xlrd import xlsxwriter import o

我正在编写一个程序：

从excel表格中读取每行的内容（共90000行）
将每行内容与另一张excel表格进行比较（共600000行）
如果出现匹配项，请将匹配项写入新的excel工作表

我已经写了剧本，一切都很好。然而，计算时间是巨大的。在一个小时内，它只完成了第一页的200行，从而编写了200个不同的文件

我想知道是否有一种方法可以以不同的方式保存匹配，因为我稍后将使用它们？有没有办法保存在矩阵或其他什么东西中

import xlrd
import xlsxwriter
import os, itertools
from datetime import datetime
# choose the incident excel sheet
book_1 = xlrd.open_workbook('D:/Users/d774911/Desktop/Telstra Internship/Working files/Incidents.xlsx')
# choose the trap excel sheet
book_2 = xlrd.open_workbook("D:/Users/d774911/Desktop/Telstra Internship/Working files/Traps.xlsx")
# choose the features sheet
book_3 = xlrd.open_workbook("D:/Users/d774911/Desktop/Telstra Internship/Working files/Features.xlsx")
# select the working sheet, either by name or by index
Traps = book_2.sheet_by_name('Sheet1')
# select the working sheet, either by name or by index
Incidents = book_1.sheet_by_name('Sheet1')
# select the working sheet, either by name or by index
Features_Numbers = book_3.sheet_by_name('Sheet1')
#return the total number of rows for the traps sheet
Total_Number_of_Rows_Traps = Traps.nrows
# return the total number of rows for the incident sheet
Total_Number_of_Rows_Incidents = Incidents.nrows
# open a file two write down the non matching incident's numbers
print(Total_Number_of_Rows_Traps, Total_Number_of_Rows_Incidents)
write_no_matching =   open('C:/Users/d774911/PycharmProjects/GlobalData/No_Matching.txt', 'w')

# For loop to iterate for all the row for the incident sheet
for Rows_Incidents in range(Total_Number_of_Rows_Incidents):
    # Store content for the comparable cell for incident sheet
    Incidents_Content_Affected_resources = Incidents.cell_value(Rows_Incidents, 47)
    # Store content for the comparable cell for incident sheet
    Incidents_Content_Product_Type = Incidents.cell_value(Rows_Incidents, 29)
    # Convert Excel date type into python type
    Incidents_Content_Date = xlrd.xldate_as_tuple(Incidents.cell_value(Rows_Incidents, 2), book_1.datemode)
    # extract the year, month and day
    Incidents_Content_Date = str(Incidents_Content_Date[0]) + ' ' + str(Incidents_Content_Date[1]) + ' ' + str(Incidents_Content_Date[2])
    # Store content for the comparable cell for incident sheet
    Incidents_Content_Date = datetime.strptime(Incidents_Content_Date, '%Y %m %d')
    # extract the incident number
    Incident_Name = Incidents.cell_value(Rows_Incidents, 0)
    # Create a workbook for the selected incident
    Incident_Name_Book = xlsxwriter.Workbook(os.path.join('C:/Users/d774911/PycharmProjects/GlobalData/Test/', Incident_Name + '.xlsx'))
    # Create sheet name for the created workbook
    Incident_Name_Sheet = Incident_Name_Book.add_worksheet('Sheet1')
    # insert the first row that contains the features
    Incident_Name_Sheet.write_row(0, 0, Features_Numbers.row_values(0))
    Insert_Row_to_Incident_Sheet = 0

# For loop to iterate for all the row for the traps sheet
for Rows_Traps in range(Total_Number_of_Rows_Traps):

    # Store content for the comparable cell for traps sheet
    Traps_Content_Node_Name = Traps.cell_value(Rows_Traps, 3)
    # Store content for the comparable cell for traps sheet
    Traps_Content_Event_Type = Traps.cell_value(Rows_Traps, 6)
    # extract date temporally
    Traps_Content_Date_temp = Traps.cell_value(Rows_Traps, 10)
    # Store content for the comparable cell for traps sheet
    Traps_Content_Date = datetime.strptime(Traps_Content_Date_temp[0:10], '%Y-%m-%d')

    # If the content matches partially or full
    if len(str(Traps_Content_Node_Name)) * len(str(Incidents_Content_Affected_resources)) != 0 and \
            str(Incidents_Content_Affected_resources).lower().find(str(Traps_Content_Node_Name).lower()) != -1 and \
            len(str(Traps_Content_Event_Type)) * len(str(Incidents_Content_Product_Type)) != 0 and \
            str(Incidents_Content_Product_Type).lower().find(str(Traps_Content_Event_Type).lower()) != -1 and \
            len(str(Traps_Content_Date)) * len(str(Incidents_Content_Date)) != 0 and \
            Traps_Content_Date <= Incidents_Content_Date:
        # counter for writing inside the new incident sheet
        Insert_Row_to_Incident_Sheet = Insert_Row_to_Incident_Sheet + 1
        # Write the Incident information
        Incident_Name_Sheet.write_row(Insert_Row_to_Incident_Sheet, 0, Incidents.row_values(Rows_Incidents))
        # Write the Traps information
        Incident_Name_Sheet.write_row(Insert_Row_to_Incident_Sheet, 107, Traps.row_values(Rows_Traps))

Incident_Name_Book.close()

导入xlrd
导入xlsxwriter
导入操作系统、itertools
从日期时间导入日期时间
#选择事件excel表
book_1=xlrd.open_工作簿（'D:/Users/d774911/Desktop/Telstra实习生/Working files/events.xlsx'）
#选择陷阱excel工作表
book_2=xlrd.open_工作簿（“D:/Users/d774911/Desktop/Telstra实习生/Working files/Traps.xlsx”）
#选择功能表
book_3=xlrd.open_工作簿（“D:/Users/d774911/Desktop/Telstra实习生/Working files/Features.xlsx”）
#按名称或索引选择工作表
陷阱=书籍2。按名称排列的工作表（“工作表1”）
#按名称或索引选择工作表
事件=账簿1。按名称列出的工作表（“工作表1”）
#按名称或索引选择工作表
特征编号=书籍3.按名称排列的图纸（“图纸1”）
#返回陷阱工作表的总行数
总行数\u陷阱=Traps.nrows
#返回事件表的总行数
总行数\u事件=incents.nrows
#打开一个文件，写下不匹配事件的编号
打印（总行数陷阱、总行数事件）
write_no_matching=open（'C:/Users/d774911/PycharmProjects/GlobalData/no_matching.txt'，'w'）
#For循环对事件工作表的所有行进行迭代
对于范围内的行事件（行事件总数）：
#存储事件表的可比单元格的内容
事件\u内容\u受影响的\u资源=事件。单元格\u值（行\u事件，47）
#存储事件表的可比单元格的内容
事件\内容\产品\类型=事件。单元格\值（行\事件，29）
#将Excel日期类型转换为python类型
事件\u内容\u日期=xlrd.xldate\u作为\u元组（事件.cell\u值（行\u事件，2），book\u 1.datemode）
#提取年、月和日
事件内容日期=str（事件内容日期[0]）+“”+str（事件内容日期[1]）+“”+str（事件内容日期[2]））
#存储事件表的可比单元格的内容
事件内容日期=datetime.strTime（事件内容日期，%Y%m%d）
#提取事件编号
事件\名称=事件。单元格\值（行\事件，0）
#为所选事件创建工作簿
事件名称Book=xlsxwriter.Workbook（os.path.join（'C:/Users/d774911/PycharmProjects/GlobalData/Test/'，事件名称+'.xlsx'））
#为创建的工作簿创建工作表名称
事件\名称\工作表=事件\名称\工作表。添加\工作表（“工作表1”）
#插入包含要素的第一行
事件名称表。写入行（0，0，特征号。行值（0））
将\u行\u插入\u事件\u表=0
#For循环对陷阱工作表的所有行进行迭代
对于范围内的行数\u陷阱（行数\u陷阱总数）：
#存储陷阱工作表中可比较单元格的内容
陷阱\内容\节点\名称=陷阱。单元格\值（行\陷阱，3）
#存储陷阱工作表中可比较单元格的内容
陷阱\内容\事件\类型=陷阱。单元格\值（行\陷阱，6）
#临时提取日期
陷阱\内容\日期\温度=陷阱。单元格\值（行\陷阱，10）
#存储陷阱工作表中可比较单元格的内容
陷阱内容日期=datetime.strtime（陷阱内容日期温度[0:10]，“%Y-%m-%d”）
#如果内容部分匹配或完全匹配
如果len（str（陷阱内容节点名称））*len（str（事件内容受影响资源））！=0及\
str（事件\u内容\u受影响的\u资源）。lower（）。find（str（陷阱\u内容\u节点\u名称）。lower（））！=-1及\
len（str（Traps\u Content\u Event\u Type））*len（str（events\u Content\u Product\u Type））！=0及\
str（事件\内容\产品\类型）.lower（）.find（str（陷阱\内容\事件\类型）.lower（））！=-1及\
len（str（陷阱内容日期））*len（str（事件内容日期））！=0及\
Traps\u Content\u Date您所做的是为每个单元格查找/读取少量数据。这是非常低效的
尝试一次性将所有信息读入一个尽可能基本的python数据结构（列表、dict等），并在内存中对该数据集进行比较/操作，一次性写入所有结果。如果不是所有的数据都能放入内存，尝试将其划分为子任务
必须读取数据集10次，每次提取十分之一的数据可能仍然比单独读取每个单元格快得多
 我看不出你的代码是如何工作的；第二个循环处理的变量在第一个循环中的每一行都会发生变化，但第二个循环不在第一个循环的内部
这就是说，以这种方式比较文件的复杂性为O（N*M），这意味着运行时会迅速膨胀。在您的例子中，您尝试执行54'000'000'000（540亿）个循环
如果遇到此类问题，解决方案通常分为三步：
转换数据以使其更易于处理
将数据放入有效的结构（排序列表，dict
）
使用有效的结构搜索数据
您必须找到摆脱find（）
的方法。尝试清除单元格中要比较的所有垃圾，以便可以使用=
。当您有了这个，您可以将行放入dict
中以查找匹配项。或者你可以