Regex 用r解析文件_Regex_R - Fatal编程技术网

Regex 用r解析文件

regex r

Regex 用r解析文件,regex,r,Regex,R,我正在尝试使用R解析此非结构化文件：该文件中有（除其他垃圾文件外）上述行。我尝试了fread和read.table，都在某个点失败了，我无法找到解决方案。我需要一种方法来解析这些行，将它们划分为4个变量。对于regex，我会这样做： ^\s+(\S+)\s+(\d+)\s+(\S{3,4})\s+(.*)$ 关于我应该如何在R中解决这个问题，有什么建议吗另外，第一个匹配必须是\S，而不是\d，因为有些匹配看起来像.0..002212，第三个匹配很少是10.0，因此我指定了3-4x非空白。

我正在尝试使用R解析此非结构化文件：

该文件中有（除其他垃圾文件外）上述行。我尝试了

fread

和

read.table

，都在某个点失败了，我无法找到解决方案。我需要一种方法来解析这些行，将它们划分为4个变量。对于regex，我会这样做：

^\s+(\S+)\s+(\d+)\s+(\S{3,4})\s+(.*)$

关于我应该如何在R中解决这个问题，有什么建议吗

另外，第一个匹配必须是

\S

，而不是

\d

，因为有些匹配看起来像

.0..002212

，第三个匹配很少是

10.0

，因此我指定了3-4x非空白。分数之后的任何内容（例如

8.3

）都是电影标题。

您可以使用

read.fwf

，而不是手动解析行，因为行结构良好，除最后一列外，每一列都有固定的宽度，您可以指定足够大的宽度来覆盖最后一列：

read.fwf("all.txt", widths = c(10, -2, 6, -3, 3, -2, 1000))

#     V1     V2  V3                           V4
# 1 1322 175300 8.3             The Sting (1973)
# 2 1123 426445 8.3 2001: A Space Odyssey (1968)
# 3 1222  94315 8.3   Ladri di biciclette (1948)
# 4 1222 149759 8.3   Singin' in the Rain (1952)
# 5 1322 622326 8.3             Toy Story (1995)
# 6 1222 599957 8.3                Snatch (2000)

首先看看你的数据；前27行是散文信息，第一个数据集从第28行运行到第278行。

readr

软件包的

read\u table

功能比

read.table

更智能，可以很好地处理丢失的数据：

df <- readr::read_table('ratings.list.gz', skip = 27, n_max = 250)

df
## # A tibble: 250 x 5
##      New Distribution   Votes  Rank                                                Title
##    <chr>        <chr>   <int> <dbl>                                                <chr>
## 1          0000000125 1686502   9.2                      The Shawshank Redemption (1994)
## 2          0000000125 1153698   9.2                                 The Godfather (1972)
## 3          0000000124  789387   9.0                        The Godfather: Part II (1974)
## 4          0000000124 1671708   8.9                               The Dark Knight (2008)
## 5          0000000133  863309   8.9                              Schindler's List (1993)
## 6          0000000133  446671   8.9                                  12 Angry Men (1957)
## 7          0000000123 1322033   8.9                                  Pulp Fiction (1994)
## 8          0000000124 1213467   8.9 The Lord of the Rings: The Return of the King (2003)
## 9          0000000123  502576   8.9               Il buono, il brutto, il cattivo (1966)
## 10         0000000133 1344643   8.8                                    Fight Club (1999)
## # ... with 240 more rows

df从文件中的描述来看，似乎它可能会经常更新。你最好试着一般地处理这个问题
library(readr)
library(purrr)
library(dplyr)

fil <- "ratings.list"
lines <- read_lines(fil) # could use the gz file instead

库（readr）
图书馆（purrr）
图书馆（dplyr）
菲尔
library(readr)
library(purrr)
library(dplyr)

fil <- "ratings.list"
lines <- read_lines(fil) # could use the gz file instead

starts <- which(grepl("^New", lines))

ends <- map_int(starts, ~which(grepl("^[[:alpha:]]", lines[(.+1):length(lines)]))[1]+.)

ratings <- map(seq_along(starts), ~read_table(paste0(lines[starts[.]:(ends[.]-1)], collapse="\n"))[,-1])

df_names <- c("top_250_movies", tolower(make.names(lines[starts[-1]-2])))
df_names <- gsub("\\.+", "_", df_names)
df_names <- gsub("_$", "", df_names)

df_names
## [1] "top_250_movies"              
## [2] "bottom_10_movies_1500_votes"
## [3] "movie_ratings_report"

names(ratings) <- df_names

glimpse(ratings[[df_names[1]]])
## Observations: 250
## Variables: 4
## $ Distribution <chr> "0000000125", "0000000125", "0000000124", "000000...
## $ Votes        <int> 1686502, 1153698, 789387, 1671708, 863309, 446671...
## $ Rank         <dbl> 9.2, 9.2, 9.0, 8.9, 8.9, 8.9, 8.9, 8.9, 8.9, 8.8,...
## $ Title        <chr> "The Shawshank Redemption (1994)", "The Godfather...

glimpse(ratings[[df_names[2]]])
## Observations: 10
## Variables: 4
## $ Distribution <dbl> 5e+09, 5e+09, 6e+09, 6e+09, 6e+09, 6e+09, 6e+09, ...
## $ Votes        <int> 7541, 7735, 14147, 13055, 15329, 30542, 12641, 25...
## $ Rank         <dbl> 1.9, 1.8, 1.8, 1.8, 1.7, 1.7, 1.6, 1.6, 1.6, 1.5
## $ Title        <chr> "Zombie Nation (2004)", "Titanic - La leggenda co...

glimpse(ratings[[df_names[3]]])
## Observations: 665,729
## Variables: 4
## $ Distribution <chr> "41...1..2.", "1000000102", "2...0.01.4", "0.0..0...
## $ Votes        <int> 7, 61, 12, 13, 10, 51, 15, 15, 9, 8, 5, 20, 23, 7...
## $ Rank         <dbl> 4.1, 6.3, 6.8, 7.6, 6.9, 6.6, 5.8, 6.3, 7.6, 6.8,...
## $ Title        <chr> "\"!Next?\" (1994)", "\"#1 Single\" (2006)", "\"#...