R 如何将PGN数据读入数据帧
我有一个单一的.pgn(便携式游戏符号)的大量国际象棋游戏。文件中包含的游戏如下所示:R 如何将PGN数据读入数据帧,r,regex,dataframe,chess,R,Regex,Dataframe,Chess,我有一个单一的.pgn(便携式游戏符号)的大量国际象棋游戏。文件中包含的游戏如下所示: [Event "FIDE World Cup 2017"] [Site "Tbilisi GEO"] [Date "2017.09.05"] [Round "1.1"] [White "Carlsen, Magnus"] [Black "Balogun, Oluwafemi"] [Result "1-0"] [WhiteTitle "GM"] [BlackTitle "FM"] [WhiteElo "2822
[Event "FIDE World Cup 2017"]
[Site "Tbilisi GEO"]
[Date "2017.09.05"]
[Round "1.1"]
[White "Carlsen, Magnus"]
[Black "Balogun, Oluwafemi"]
[Result "1-0"]
[WhiteTitle "GM"]
[BlackTitle "FM"]
[WhiteElo "2822"]
[BlackElo "2255"]
[ECO "B00"]
[Opening "King's pawn opening"]
[WhiteFideId "1503014"]
[BlackFideId "8501246"]
[EventDate "2017.09.03"]
1. e4 d6 2. d4 g6 3. Bc4 Nf6 4. Qe2 Nc6 5. Nf3 Bg7 6. O-O Bg4 7. c3 O-O
8. h3 Bxf3 9. Qxf3 e5 10. Rd1 Qe8 11. d5 Ne7 12. Qe2 Nh5 13. Bb5 Qc8
14. Na3 a6 15. Ba4 f5 16. Bc2 f4 17. Qg4 Qxg4 18. hxg4 Nf6 19. g5 Nd7
20. Nc4 b6 21. b4 h6 22. gxh6 Bxh6 23. g4 Nf6 24. f3 Bg5 25. Kg2 Kg7
26. a4 Bh4 27. Bd2 g5 28. Rh1 Ng6 29. Kf1 Rh8 30. Ke2 Bg3 31. a5 b5 32.
Na3 Ne7 33. c4 c6 34. dxc6 Nxc6 35. Bc3 Rxh1 36. Rxh1 bxc4 37. Nxc4 Rb8
38. Nxd6 Kg6 39. Nf5 1-0
[Event "FIDE World Cup 2017"]
etc...
我想用这些数据创建一个数据框,其中列标题是每个字符串左边的单词,数据是字符串。然后为PGN字符串创建一个单独的列
我曾在受影响的情况下尝试过这一点,并得出以下结论:
pgn <- read.table("~/Desktop/GitHub/Chess/test.pgn", quote="",
stringsAsFactors=FALSE)
# get column names
column_names <- sub("\\[(\\w+).+", "\\1", pgn[1:17,1])
column_names[17] <- "PGN"
#create DF
pgn.df <- data.frame(matrix(sub("\\[\\w+ \\\"(.+)\\\"\\]", "\\1",
pgn[,1]),byrow=TRUE, ncol=17))
names(pgn.df) <- column_names
pgn我仍然建议在准备步骤中使用(更新的)replace RegEx删除不需要的中断,如下所示:
[Event "FIDE World Cup 2017"]
[Site "Tbilisi GEO"]
[Date "2017.09.05"]
[Round "1.1"]
[White "Carlsen, Magnus"]
[Black "Balogun, Oluwafemi"]
[Result "1-0"]
[WhiteTitle "GM"]
[BlackTitle "FM"]
[WhiteElo "2822"]
[BlackElo "2255"]
[ECO "B00"]
[Opening "King's pawn opening"]
[WhiteFideId "1503014"]
[BlackFideId "8501246"]
[EventDate "2017.09.03"]
1. e4 d6 2. d4 g6 3. Bc4 Nf6 4. Qe2 Nc6 5. Nf3 Bg7 6. O-O Bg4 7. c3 O-O
8. h3 Bxf3 9. Qxf3 e5 10. Rd1 Qe8 11. d5 Ne7 12. Qe2 Nh5 13. Bb5 Qc8
14. Na3 a6 15. Ba4 f5 16. Bc2 f4 17. Qg4 Qxg4 18. hxg4 Nf6 19. g5 Nd7
20. Nc4 b6 21. b4 h6 22. gxh6 Bxh6 23. g4 Nf6 24. f3 Bg5 25. Kg2 Kg7
26. a4 Bh4 27. Bd2 g5 28. Rh1 Ng6 29. Kf1 Rh8 30. Ke2 Bg3 31. a5 b5 32.
Na3 Ne7 33. c4 c6 34. dxc6 Nxc6 35. Bc3 Rxh1 36. Rxh1 bxc4 37. Nxc4 Rb8
38. Nxd6 Kg6 39. Nf5 1-0
[Event "FIDE World Cup 2017"]
etc...
/(?:[^\[\]\n\S])\S*\n//g
您可以在线测试它(使用PGN作为输入文本)。但是我对像你这样的特殊角色的转义有一些问题。
因此,我决定改用Perl
use strict;
use File::Slurp;
my $text = read_file($ARGV[0]);
$text =~ s/(?:[^\[\]\n\S])\s*\n/ /g;
write_file($ARGV[0], $text);
这可以像这样从R调用
system("perl ~/Desktop/regex.pl ~/Desktop/test.pgn")
我还没有在Windows或Linux上测试过这一点,但该软件包所基于的C代码库声称非常可移植。您将需要一个R设置来支持从源代码进行编译(即,如果您在Windows上,则需要Rtools)
安装:
devtools::install_github("hrbrmstr/pigeon")
使用(tidyverse
对于软件包来说并不是真正需要的,但是它比内置的base R打印功能更清晰地打印数据帧):
下面是一个带有内置数据集的小测试,该数据集可能也是您正在使用的数据集:
fide <- read_pgn(system.file("extdata", "r7.pgn", package="pigeon"))
fide
## # A tibble: 2 x 12
## Event Site Date Round White Black Result WhiteElo BlackElo ECO
## * <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 World Cup 2017 Tbilisi 2017.09.23 44.1 Aronian Levon (ARM) Ding Liren (CHN) 1/2-1/2 2799 2777 A18
## 2 World Cup 2017 Tbilisi 2017.09.24 45.1 Ding Liren (CHN) Aronian Levon (ARM) 1/2-1/2 2777 2799 E06
## # ... with 2 more variables: LiveChessVersion <chr>, Moves <list>
glimpse(fide)
## Observations: 2
## Variables: 12
## $ Event <chr> "World Cup 2017", "World Cup 2017"
## $ Site <chr> "Tbilisi", "Tbilisi"
## $ Date <chr> "2017.09.23", "2017.09.24"
## $ Round <chr> "44.1", "45.1"
## $ White <chr> "Aronian Levon (ARM)", "Ding Liren (CHN)"
## $ Black <chr> "Ding Liren (CHN)", "Aronian Levon (ARM)"
## $ Result <chr> "1/2-1/2", "1/2-1/2"
## $ WhiteElo <chr> "2799", "2777"
## $ BlackElo <chr> "2777", "2799"
## $ ECO <chr> "A18", "E06"
## $ LiveChessVersion <chr> "1.4.8", "1.4.8"
## $ Moves <list> [c("c4", "Nf6", "Nc3", "e6", "e4", "d5", "cxd5", "exd5", "e5", "Ne4", "Nf3", "Bf5", "Be2"...
fide RegEx似乎在站点上工作,但当我在代码中得到它时失败了。我收到这个错误信息:pgn哇,多么有用的“库”!非常感谢。我可以使用你的库来保存我自己的数据吗?或者我必须使用“鸽子”中已经存在的数据吗?对于某些数据集,如KingBase,我会遇到类似以下错误:Error:lexical Error:string中的无效字符:“Bxh6”}]},{“事件”:“Mnster打开”,“站点”:“Mnste(就在这里)----^
。即使我从@wp78de code运行Perl代码,这也是正确的。我该怎么办?@Parseltongue你能在GH问题中添加一到两个的链接吗?我可能能够为格式错误的文件制定一个解决方案。这太不可思议了——我已经绞尽脑汁了两个小时。现在将创建一个github请求。
tf <- tempfile(fileext = ".zip")
td <- tempdir()
download.file("https://www.pgnmentor.com/players/Adams.zip", tf)
fil <- unzip(tf, exdir = td)
adams <- read_pgn(fil)
adams
## # A tibble: 2,982 x 11
## Event Site Date Round White Black Result WhiteElo BlackElo ECO
## * <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Lloyds Bank op London 1984.??.?? 1 Adams, Michael Sedgwick, David 1-0 C05
## 2 Lloyds Bank op London 1984.??.?? 3 Adams, Michael Dickenson, Neil F 1-0 2230 C07
## 3 Lloyds Bank op London 1984.??.?? 4 Hebden, Mark Adams, Michael 1-0 2480 B10
## 4 Lloyds Bank op London 1984.??.?? 5 Pasman, Michael Adams, Michael 0-1 2310 D42
## 5 Lloyds Bank op London 1984.??.?? 6 Adams, Michael Levitt, Jonathan 1/2-1/2 2370 B99
## 6 Lloyds Bank op London 1984.??.?? 9 Adams, Michael Saeed, Saeed Ahmed 1-0 2430 B56
## 7 BCF-ch Edinburgh 1985.??.?? 1 Adams, Michael Singh, Sukh Dave 1/2-1/2 2360 2080 B70
## 8 BCF-ch Edinburgh 1985.??.?? 2 Abayasekera, Roger Adams, Michael 1-0 2200 2360 B13
## 9 BCF-ch Edinburgh 1985.??.?? 3 Adams, Michael Jackson, Sheila 1/2-1/2 2360 2225 C85
## 10 BCF-ch Edinburgh 1985.??.?? 4 Muir, Andrew J Adams, Michael 1/2-1/2 2285 2360 E45
## # ... with 2,972 more rows, and 1 more variables: Moves <list>
glimpse(adams)
## Observations: 2,982
## Variables: 11
## $ Event <chr> "Lloyds Bank op", "Lloyds Bank op", "Lloyds Bank op", "Lloyds Bank op", "Lloyds Bank op", "Lloyds ...
## $ Site <chr> "London", "London", "London", "London", "London", "London", "Edinburgh", "Edinburgh", "Edinburgh",...
## $ Date <chr> "1984.??.??", "1984.??.??", "1984.??.??", "1984.??.??", "1984.??.??", "1984.??.??", "1985.??.??", ...
## $ Round <chr> "1", "3", "4", "5", "6", "9", "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "?", "1", "...
## $ White <chr> "Adams, Michael", "Adams, Michael", "Hebden, Mark", "Pasman, Michael", "Adams, Michael", "Adams, M...
## $ Black <chr> "Sedgwick, David", "Dickenson, Neil F", "Adams, Michael", "Adams, Michael", "Levitt, Jonathan", "S...
## $ Result <chr> "1-0", "1-0", "1-0", "0-1", "1/2-1/2", "1-0", "1/2-1/2", "1-0", "1/2-1/2", "1/2-1/2", "1-0", "1/2-...
## $ WhiteElo <chr> "", "", "2480", "2310", "", "", "2360", "2200", "2360", "2285", "2360", "2250", "2360", "2225", "2...
## $ BlackElo <chr> "", "2230", "", "", "2370", "2430", "2080", "2360", "2225", "2360", "2245", "2360", "2260", "2360"...
## $ ECO <chr> "C05", "C07", "B10", "D42", "B99", "B56", "B70", "B13", "C85", "E45", "C84", "B10", "C85", "A22", ...
## $ Moves <list> [c("e4", "e6", "d4", "d5", "Nd2", "Nf6", "e5", "Nfd7", "f4", "c5", "c3", "Nc6", "Ndf3", "cxd4", "...