Python 正在分析以空行分隔的文件-速度太慢_Python

Python 正在分析以空行分隔的文件-速度太慢

python

Python 正在分析以空行分隔的文件-速度太慢,python,Python,在过去的几天里，我一直在努力解析这个国际象棋游戏中使用的text/pgm文件。您可以假定这是一个.txt文件。每2个盖帽代表一场比赛。该文件看起来像： [Event "15th Czerniak mem"] [Site "Tel Aviv ISR"] [Date "1999.04.05"] [Round "6"] [White "Karolyi, T"] [Black "Lutz, C"] [Result "0-1"] [WhiteElo "2432"] [BlackElo "2610"] [E

在过去的几天里，我一直在努力解析这个国际象棋游戏中使用的text/pgm文件。您可以假定这是一个.txt文件。每2个盖帽代表一场比赛。该文件看起来像：

[Event "15th Czerniak mem"]
[Site "Tel Aviv ISR"]
[Date "1999.04.05"]
[Round "6"]
[White "Karolyi, T"]
[Black "Lutz, C"]
[Result "0-1"]
[WhiteElo "2432"]
[BlackElo "2610"]
[ECO "A87"]

1.d4 f5 2.c4 Nf6 3.g3 g6 4.Bg2 Bg7 5.Nc3 O-O 6.Nf3 d6 7.d5 Qe8 8.Be3 Na6 
9.Qc1 e5 10.dxe6 Bxe6 11.O-O c6 12.b3 Ng4 13.Bf4 Qe7 14.Qd2 Rad8 15.Rad1 
Nc5 16.Nd4 Ne5 17.Bg5 Bf6 18.Bxf6 Qxf6 19.f4 Nf7 20.b4 Bxc4 21.Nxc6 bxc6 
22.bxc5 d5 23.Rfe1 d4 24.Na4 Rfe8 25.Rc1 Bd5 26.Bxd5 Rxd5 27.Nb2 Nh6 28.
Nd3 Ng4 29.Nf2 Ne3 30.Nd1 g5 31.Nxe3 Rxe3 32.Rf1 Qe6 33.Rf2 g4 34.Rb1 Rd7 
35.Qc2 Kg7 36.Rb3 Kf6 37.Rxe3 Qxe3 38.Qb2 Qc3 39.Qb8 Qxc5 40.Qc8 Qd5 41.
Rf1 Qe6 0-1

[Event "Danilo Batricevic Mem Balkan GP"]
[Site "Cetinje MNE"]
[Date "2012.10.24"]
[Round "6.6"]
[White "Nikcevic, N"]
[Black "Blagojevic, Dr"]
[Result "1/2-1/2"]
[WhiteElo "2432"]
[BlackElo "2526"]
[ECO "A14"]
[EventDate "2012.10.20"]
[WhiteTitle "GM"]
[BlackTitle "GM"]
[Opening "English opening"]
[Variation "Agincourt variation"]
[WhiteFideId "901776"]
[BlackFideId "900885"]

1.c4 e6 2.Nf3 d5 3.b3 Nf6 4.g3 Be7 5.Bg2 c5 6.O-O Nc6 7.e3 O-O 8.Bb2 b6 9.
Nc3 dxc4 10.bxc4 Bb7 11.Qe2 Rc8 12.Rac1 Qc7 13.d4 Na5 14.Nb5 Qb8 15.dxc5 
Bxc5 16.Bxf6 gxf6 17.Rfd1 Rfd8 18.Nh4 Bxg2 19.Nxg2 Nc6 20.Qg4+ Kh8 21.
Rxd8+ Nxd8 22.Qh4 Be7 23.Rd1 Nc6 24.Qh5 Kg8 25.Qg4+ Kh8 26.Qh5 Kg8 27.Qg4+
Kh8 1/2-1/2

[Event "FSGM October"]
[Site "Budapest HUN"]
[Date "2003.10.09"]
[Round "6"]
[White "Anka, E"]
[Black "Taylor, T"]
[Result "1-0"]
[WhiteElo "2432"]
[BlackElo "2385"]
[ECO "C41"]

1.e4 e5 2.Nf3 d6 3.d4 exd4 4.Nxd4 g6 5.Nc3 Bg7 6.Be3 Nf6 7.Qd2 O-O 8.O-O-O
Re8 9.f3 Nc6 10.g4 Ne5 11.Be2 a6 12.Bh6 Bh8 13.h4 b5 14.h5 b4 15.Nd5 c5 
16.Nf5 Nxd5 17.Qxd5 Be6 18.Qxd6 Qf6 19.g5 Nd3+ 20.Kd2 Qd8 21.hxg6 fxg6 22.
Bxd3 gxf5 23.Qxd8 Raxd8 24.Kc1 c4 25.exf5 cxd3 26.fxe6 Rxe6 27.Rd2 Rc6 28.
Kb1 Rcd6 29.Rxd3 Rxd3 30.cxd3 Rxd3 31.Rc1 Kf7 32.Rc8 Bd4 33.Kc2 1-0
.......
.......

上面你一共看到3个游戏（=6个部分）每个游戏分为两个部分（一个以

[事件“15 Czer…”开头，另一个以1.d4 f5 2.f4…”开头）
有一个库：内存中的文件=f.read（）
不要这样做。逐行读取文件：读取标题直到找到空行。读取移动直到找到空行。冲洗并重复。row=collections.defaultdict（）
。这一行被下面的一行忽略。另外，如果你的类只不过是一个dict
，你可以使用一个dict
和一个函数。你有多少RAM？如果你的内存中有2GB的文件，一个包含至少2GB数据的巨大列表和2GB sql查询，这可能会解释为什么这个过程如此缓慢。你可能会这样做想要处理成批的1000个游戏并将它们一个接一个地放入数据库中。最后，将\u插入\u数据库（游戏列表）

是什么样子的？如果您有一个包含数千个游戏的列表，并将它们逐个插入数据库中，则处理过程将非常缓慢。有一个库可供使用：

file\u in\u memory=f.read（）

不要这样做。逐行读取文件：读取标题直到找到空行。读取移动直到找到空行。冲洗并重复。

row=collections.defaultdict（）

。这一行被下面的一行忽略。另外，如果你的类只不过是一个

dict

，你可以使用一个

dict

和一个函数。你有多少RAM？如果你的内存中有2GB的文件，一个包含至少2GB数据的巨大列表和2GB sql查询，这可能会解释为什么这个过程如此缓慢。你可能会这样做想要处理成批的1000个游戏，并将它们一个接一个地放入数据库。最后，

insert\u-into\u-database（list\u-of\u-games）

是什么样子的？如果你有一个数千个游戏的列表，并将它们一个接一个地插入数据库，那么这个过程将非常缓慢。

with open('data/chessgames.pgn', 'r', encoding='latin-1') as f:
    file_in_memory = f.read() # ~ 2 gb file in memory
    items = file_in_memory.strip().split('\n\n') #split on empty new lines
    list_of_games = []

item = 0
while item < len(items):
    summary = items[item].split("\n")  # str -> [Event "FIDE Candidates"]\n [Site "London ENG"] \n [Date "some date"....
    moves = items[item + 1]  # [ 1.e4 e5 2.Nf3 Nc6 3.Bb5 Nf6 4.d3 ...
    if summary[0][0] != "[":   # sometime if I come across ill informed data
        item += 1
        continue
    item += 2
    #print(item)
    parser(summary, moves, list_of_games)  # function to parse the data

insert_into_database(list_of_games)  # BATCH INSERT (not inserting one at a time) array of ALL parsed games to be inserted in a sqlite3 db where each row is one game. Happy to share this function if anyone wants to see

def parser(summary, moves, list_of_games):
    row = {'Event': None, 'Site': None, 'Date': None, 'Round': None,
           'White': None, 'Black': None, 'Result': None,
           'WhiteElo': None, 'BlackElo': None,
           'ECO': None, 'EventDate': None, 'Opening': None,
           'Variation': None}

    for item in summary: # list -> [[Event "FIDE Candidates"], [Site "London ENG"],['Date....
        try:
            key, val = shlex.split(item[1:-1].strip())  # exclude the first and last brackets
            row[key] = val  # exclude the first and last double quotes
        except Exception as e:
            print("ERRORSS")
            logger.error("Exception raised: error parsing", e)

    chess = ChessSchema(row, moves)  # create an object of ChessSchema
    list_of_games.append(chess.ret_tuple_of_game())

class ChessSchema:
    """
    :param: parses 14 fields
    """
    def __init__(self, game_dict, moves):

        self.Event = game_dict['Event']
        self.Site = game_dict['Site']
        self.Date = game_dict['Date']
        self.White = game_dict['White']  # name
        self.Black = game_dict['Black']  # name
        self.Round = game_dict['Round']
        self.Result = game_dict['Result']
        self.ECO = game_dict['ECO']
        self.EventDate = game_dict['EventDate']
        self.Opening = game_dict['Opening']
        self.Variation = game_dict['Variation']
        self.moves = moves
        try:
            self.WhiteElo = int(game_dict['WhiteElo']) # white score
            self.BlackElo = int(game_dict['BlackElo'])  # black score
        except TypeError:
            self.WhiteElo = game_dict['WhiteElo']  # white score
            self.BlackElo = game_dict['BlackElo']  # black score
        self.winner = self.winner()

    def winner(self):
        if self.Result:
            if self.Result == "0-1":
                return "Black"
            elif self.Result == "1-0":
                return "White"
            else:
                return "Draw"

    def ret_tuple_of_game(self):
        return (self.Event, self.Site, self.Date, self.White, self.Black,
                self.Result, self.WhiteElo, self.BlackElo, self.ECO, self.EventDate,
                self.Opening, self.Variation, self.moves, self.winner)

line = f.readline()
    while line: # reach till EOF

        print("LINE:" , line)
        #######################################################################
        # parse game headers [Event "Fide inv"]\n [Site "Fullerham"]....
            # print("game headers")

        if line.startswith("["):
            # print("LINES: ", line)
            while line.startswith("["):
                print(line)
                key, val = shlex.split(line[1:-2].rstrip())
                row[key] = val
                line = f.readline()

            # print(key, val)

            # else:
            #    break  # break out of the summary brackets

        if line.isspace():  # or if it starts with % or "\n"
            line = f.readline()
            all_games.append(row)
            # del all_games[:]
            # continue

        # print(line)
        analyze_data()

        #######################################################################
        # Parse moves info - move text
        if line.startswith("1"):
            tmp = ""
            while line and not line.isspace():
                line = f.readline()
                tmp += line
            row['moves'] = tmp

        line = f.readline()