从R中的文本提取数据帧或表

从R中的文本提取数据帧或表,r,text,tidyverse,R,Text,Tidyverse,这是一个具有挑战性的问题,因为对于目前存在的可变性来说,这可能有点困难。让我们从这个例子开始: example <- list(c("Birth Centenary of K.S.Stanislavsky.Series:Birth CentenariesCatalog codes:Mi:SU 2710, Sn:SU 2695, Yt:SU 2626, Sg:SU 2797, AFA:SU 2698Variants:Click to see variantsThemes:Actors | A

这是一个具有挑战性的问题,因为对于目前存在的可变性来说,这可能有点困难。让我们从这个例子开始:

example <- list(c("Birth Centenary of K.S.Stanislavsky.Series:Birth CentenariesCatalog codes:Mi:SU 2710, Sn:SU 2695, Yt:SU 2626, Sg:SU 2797, AFA:SU 2698Variants:Click to see variantsThemes:Actors | Anniversaries and Jubilees | Famous People | MenIssued on:1963-01-15Size:30 x 42 mmColors:Blackish grey greenFormat:StampEmission:CommemorativePerforation:line 12½Printing:RecessPaper:hard thick whiteWatermark:UnwmkFace value:4 Russian kopekPrint run:2,000,000Score:29%\tAccuracy: Very HighBuy Now:2 sale offers from US$ 0.16", 
"Birth Centenary of A.S.Serafimovich.Series:Birth CentenariesCatalog codes:Mi:SU 2711, Sn:SU 2696, Yt:SU 2627, Sg:SU 2800, AFA:SU 2699Themes:Anniversaries and Jubilees | Authors | Famous People | Literary People (Poets and Writers) | Literature | MenIssued on:1963-01-19Size:28 x 40 mmFormat:StampEmission:CommemorativePerforation:frame 11½Printing:PhotogravurePaper:ordinaryFace value:4 Russian kopekPrint run:2,500,000Score:26%\tAccuracy: Very HighBuy Now:3 sale offers from US$ 0.11", 
"Children in nurserySeries:Soviet ChildrenCatalog codes:Mi:SU 2712, Sn:SU 2697, Yt:SU 2629, Sg:SU 2806, AFA:SU 2700Themes:ChildrenIssued on:1963-01-31Size:42 x 28 mmColors:MulticolorFormat:StampEmission:CommemorativePerforation:comb 11½Printing:PhotogravureFace value:4 Russian kopekPrint run:3,000,000Score:27%\tAccuracy: Very HighDescription:Designer: A. Shmidshtein. Paper: ordinary.Buy Now:2 sale offers from US$ 0.08", 
"Children with nurseSeries:Soviet ChildrenCatalog codes:Mi:SU 2713, Sn:SU 2698, Yt:SU 2628, Sg:SU 2807, AFA:SU 2701Themes:ChildrenIssued on:1963-01-31Size:42 x 28 mmFormat:StampEmission:CommemorativePerforation:comb 11½Printing:PhotogravureFace value:4 Russian kopekPrint run:3,000,000Score:25%\tAccuracy: Very HighDescription:Designer: A. Shmidshtein. Paper: ordinary.Buy Now:3 sale offers from US$ 0.08", 
"Pioneer campSeries:Soviet ChildrenCatalog codes:Mi:SU 2714, Sn:SU 2699, Yt:SU 2630, Sg:SU 2808, AFA:SU 2702Themes:ChildrenIssued on:1963-01-31Size:42 x 28 mmFormat:StampEmission:CommemorativePerforation:comb 11½Printing:PhotogravureFace value:4 Russian kopekPrint run:3,000,000Score:22%\tAccuracy: Very HighDescription:Designer: A. Shmidshtein. Paper: ordinary.Buy Now:4 sale offers from US$ 0.11", 
"Soviet Children.Series:Soviet ChildrenCatalog codes:Mi:SU 2715, Sn:SU 2700, Yt:SU 2631, Sg:SU 2809, AFA:SU 2703Themes:ChildrenIssued on:1963-01-31Size:40 x 28 mmFormat:StampEmission:CommemorativePerforation:comb 11½Printing:PhotogravurePaper:ordinaryFace value:4 Russian kopekPrint run:3,000,000Score:25%\tAccuracy: Very HighBuy Now:2 sale offers from US$ 0.08", 
"Dymkov's and Zagorsk toysSeries:Decorative ArtsCatalog codes:Mi:SU 2716, Sn:SU 2701, Yt:SU 2632, Sg:SU 2810, AFA:SU 2704Themes:Art | ToysIssued on:1963-01-31Size:30 x 42 mmFormat:StampEmission:CommemorativePerforation:comb 12 x 12½Printing:Offset lithographyFace value:4 Russian kopekPrint run:3,000,000Score:22%\tAccuracy: Very HighDescription:Designer: E. Komarov. Paper: ordinary.Buy Now:2 sale offers from US$ 0.11", 
"Oposhnya potterySeries:Decorative ArtsCatalog codes:Mi:SU 2717, Sn:SU 2702, Yt:SU 2633, Sg:SU 2811, AFA:SU 2705Themes:ArtIssued on:1963-01-31Size:30 x 42 mmFormat:StampEmission:CommemorativePerforation:comb 12 x 12½Printing:Offset lithographyFace value:6 Russian kopekPrint run:3,000,000Score:24%\tAccuracy: Very HighDescription:Designer: E. Komarov. Paper: ordinary.Buy Now:3 sale offers from US$ 0.08", 
"Embossing booksSeries:Decorative ArtsCatalog codes:Mi:SU 2718, Sn:SU 2703, Yt:SU 2634, Sg:SU 2812, AFA:SU 2706Themes:Art | BooksIssued on:1963-01-31Size:30 x 42 mmFormat:StampEmission:CommemorativePerforation:comb 12 x 12½Printing:Offset lithographyFace value:10 Russian kopekPrint run:3,000,000Score:27%\tAccuracy: Very HighDescription:Designer: E. Komarov. Paper: ordinary.Buy Now:2 sale offers from US$ 0.44", 
"Decorative Arts.Series:Decorative ArtsCatalog codes:Mi:SU 2719, Sn:SU 2704, Yt:SU 2635, Sg:SU 2813, AFA:SU 2707Themes:ArtIssued on:1963-01-31Size:30 x 42 mmFormat:StampEmission:CommemorativePerforation:comb 12 x 12½Printing:Offset lithographyPaper:ordinaryFace value:12 Russian kopekPrint run:3,000,000Score:26%\tAccuracy: Very HighBuy Now:3 sale offers from US$ 0.16"
), NULL, NULL, NULL)

example您的数据似乎源于网络垃圾。我建议您签出rvest::html_table()以尝试获得更好的格式化结果。否则它将非常混乱(即正则表达式)

非常非常混乱的示例代码:

untangle <- function(element) {
  Title = gsub("^(.*)Series:.*", "\\1", element)
  Series = gsub(".*Series:(.*)(Catalog codes:.*)", "\\1", element)
  CatalogCodes = gsub(".*Catalog codes:(.*)(Variants|Themes.*)", "\\1", element)
  return(data.frame(Title, Series, CatalogCodes, stringsAsFactors=FALSE))
}

for (e in unlist(example)) {
  print(untangle(e))
}
解开