Regex 用r解析文件
我正在尝试使用R解析此非结构化文件: 该文件中有(除其他垃圾文件外)上述行。我尝试了Regex 用r解析文件,regex,r,Regex,R,我正在尝试使用R解析此非结构化文件: 该文件中有(除其他垃圾文件外)上述行。我尝试了fread和read.table,都在某个点失败了,我无法找到解决方案。我需要一种方法来解析这些行,将它们划分为4个变量。对于regex,我会这样做: ^\s+(\S+)\s+(\d+)\s+(\S{3,4})\s+(.*)$ 关于我应该如何在R中解决这个问题,有什么建议吗 另外,第一个匹配必须是\S,而不是\d,因为有些匹配看起来像.0..002212,第三个匹配很少是10.0,因此我指定了3-4x非空白。
fread
和read.table
,都在某个点失败了,我无法找到解决方案。我需要一种方法来解析这些行,将它们划分为4个变量。对于regex,我会这样做:
^\s+(\S+)\s+(\d+)\s+(\S{3,4})\s+(.*)$
关于我应该如何在R中解决这个问题,有什么建议吗
另外,第一个匹配必须是
\S
,而不是\d
,因为有些匹配看起来像.0..002212
,第三个匹配很少是10.0
,因此我指定了3-4x非空白。分数之后的任何内容(例如8.3
)都是电影标题。您可以使用read.fwf
,而不是手动解析行,因为行结构良好,除最后一列外,每一列都有固定的宽度,您可以指定足够大的宽度来覆盖最后一列:
read.fwf("all.txt", widths = c(10, -2, 6, -3, 3, -2, 1000))
# V1 V2 V3 V4
# 1 1322 175300 8.3 The Sting (1973)
# 2 1123 426445 8.3 2001: A Space Odyssey (1968)
# 3 1222 94315 8.3 Ladri di biciclette (1948)
# 4 1222 149759 8.3 Singin' in the Rain (1952)
# 5 1322 622326 8.3 Toy Story (1995)
# 6 1222 599957 8.3 Snatch (2000)
首先看看你的数据;前27行是散文信息,第一个数据集从第28行运行到第278行。
readr
软件包的read\u table
功能比read.table
更智能,可以很好地处理丢失的数据:
df <- readr::read_table('ratings.list.gz', skip = 27, n_max = 250)
df
## # A tibble: 250 x 5
## New Distribution Votes Rank Title
## <chr> <chr> <int> <dbl> <chr>
## 1 0000000125 1686502 9.2 The Shawshank Redemption (1994)
## 2 0000000125 1153698 9.2 The Godfather (1972)
## 3 0000000124 789387 9.0 The Godfather: Part II (1974)
## 4 0000000124 1671708 8.9 The Dark Knight (2008)
## 5 0000000133 863309 8.9 Schindler's List (1993)
## 6 0000000133 446671 8.9 12 Angry Men (1957)
## 7 0000000123 1322033 8.9 Pulp Fiction (1994)
## 8 0000000124 1213467 8.9 The Lord of the Rings: The Return of the King (2003)
## 9 0000000123 502576 8.9 Il buono, il brutto, il cattivo (1966)
## 10 0000000133 1344643 8.8 Fight Club (1999)
## # ... with 240 more rows
df从文件中的描述来看,似乎它可能会经常更新。你最好试着一般地处理这个问题
library(readr)
library(purrr)
library(dplyr)
fil <- "ratings.list"
lines <- read_lines(fil) # could use the gz file instead
库(readr)
图书馆(purrr)
图书馆(dplyr)
菲尔
library(readr)
library(purrr)
library(dplyr)
fil <- "ratings.list"
lines <- read_lines(fil) # could use the gz file instead
starts <- which(grepl("^New", lines))
ends <- map_int(starts, ~which(grepl("^[[:alpha:]]", lines[(.+1):length(lines)]))[1]+.)
ratings <- map(seq_along(starts), ~read_table(paste0(lines[starts[.]:(ends[.]-1)], collapse="\n"))[,-1])
df_names <- c("top_250_movies", tolower(make.names(lines[starts[-1]-2])))
df_names <- gsub("\\.+", "_", df_names)
df_names <- gsub("_$", "", df_names)
df_names
## [1] "top_250_movies"
## [2] "bottom_10_movies_1500_votes"
## [3] "movie_ratings_report"
names(ratings) <- df_names
glimpse(ratings[[df_names[1]]])
## Observations: 250
## Variables: 4
## $ Distribution <chr> "0000000125", "0000000125", "0000000124", "000000...
## $ Votes <int> 1686502, 1153698, 789387, 1671708, 863309, 446671...
## $ Rank <dbl> 9.2, 9.2, 9.0, 8.9, 8.9, 8.9, 8.9, 8.9, 8.9, 8.8,...
## $ Title <chr> "The Shawshank Redemption (1994)", "The Godfather...
glimpse(ratings[[df_names[2]]])
## Observations: 10
## Variables: 4
## $ Distribution <dbl> 5e+09, 5e+09, 6e+09, 6e+09, 6e+09, 6e+09, 6e+09, ...
## $ Votes <int> 7541, 7735, 14147, 13055, 15329, 30542, 12641, 25...
## $ Rank <dbl> 1.9, 1.8, 1.8, 1.8, 1.7, 1.7, 1.6, 1.6, 1.6, 1.5
## $ Title <chr> "Zombie Nation (2004)", "Titanic - La leggenda co...
glimpse(ratings[[df_names[3]]])
## Observations: 665,729
## Variables: 4
## $ Distribution <chr> "41...1..2.", "1000000102", "2...0.01.4", "0.0..0...
## $ Votes <int> 7, 61, 12, 13, 10, 51, 15, 15, 9, 8, 5, 20, 23, 7...
## $ Rank <dbl> 4.1, 6.3, 6.8, 7.6, 6.9, 6.6, 5.8, 6.3, 7.6, 6.8,...
## $ Title <chr> "\"!Next?\" (1994)", "\"#1 Single\" (2006)", "\"#...