Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/r/68.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
从Movierens到R studio是否还有读取.dat文件的方法_R_Notepad++_Rstudio - Fatal编程技术网

从Movierens到R studio是否还有读取.dat文件的方法

从Movierens到R studio是否还有读取.dat文件的方法,r,notepad++,rstudio,R,Notepad++,Rstudio,我正在尝试使用R Studio中的导入数据集从movielens读取ratings.dat。 基本上它有这样的格式: 1::1::5::978824268 1::1022::5::978300055 1::1028::5::978301777 1::1029::5::978302205 1::1035::5::978301753 因此,我需要替换::by:or'或空格等。我使用记事本++,它有助于加载文件的速度相当快,与note相比,可以轻松查看非常大的文件。但是,当我进行

我正在尝试使用R Studio中的导入数据集从movielens读取ratings.dat。 基本上它有这样的格式:

 1::1::5::978824268  
 1::1022::5::978300055
 1::1028::5::978301777 
 1::1029::5::978302205 
 1::1035::5::978301753 
因此,我需要替换::by:or'或空格等。我使用记事本++,它有助于加载文件的速度相当快,与note相比,可以轻松查看非常大的文件。但是,当我进行替换时,它会显示一些奇怪的字符:

"LF"
当我在这里做一些研究时,它说这是换行或换行。但我不知道为什么当它加载文件时,它不会显示这些,只有当我做替换时,它们才会出现。当我加载到R Studio时,它仍然检测为LF,而不是换行,并导致数据读取错误

解决这个问题的办法是什么?非常感谢。 PS:我知道有python代码可以转换它,但我不想使用它,还有其他方法吗?

试试这个:

url <- "http://files.grouplens.org/datasets/movielens/ml-10m.zip"

## this part is agonizingly slow
tf <- tempfile()
download.file(url,tf, mode="wb")                          # download archived movielens data
files    <- unzip(tf, exdir=tempdir())                    # unzips and returns a vector of file names
ratings <- readLines(files[grepl("ratings.dat$",files)])  # read rating.dat file
ratings <- gsub("::", "\t", ratings)

# this part is much faster
library(data.table)
ratings <- fread(paste(ratings, collapse="\n"), sep="\t")
# Read 10000054 rows and 4 (of 4) columns from 0.219 GB file in 00:00:07
head(ratings)
#    V1  V2 V3        V4
# 1:  1 122  5 838985046
# 2:  1 185  5 838983525
# 3:  1 231  5 838983392
# 4:  1 292  5 838983421
# 5:  1 316  5 838983392
# 6:  1 329  5 838983392
试试这个:

url <- "http://files.grouplens.org/datasets/movielens/ml-10m.zip"

## this part is agonizingly slow
tf <- tempfile()
download.file(url,tf, mode="wb")                          # download archived movielens data
files    <- unzip(tf, exdir=tempdir())                    # unzips and returns a vector of file names
ratings <- readLines(files[grepl("ratings.dat$",files)])  # read rating.dat file
ratings <- gsub("::", "\t", ratings)

# this part is much faster
library(data.table)
ratings <- fread(paste(ratings, collapse="\n"), sep="\t")
# Read 10000054 rows and 4 (of 4) columns from 0.219 GB file in 00:00:07
head(ratings)
#    V1  V2 V3        V4
# 1:  1 122  5 838985046
# 2:  1 185  5 838983525
# 3:  1 231  5 838983392
# 4:  1 292  5 838983421
# 5:  1 316  5 838983392
# 6:  1 329  5 838983392

或者使用jlhoward的d/l代码,但他也更新了代码,使其不使用内置函数,并在我编写此代码时切换到data.table,但我的代码更快/更高效:-

library(data.table)

# i try not to use variable names that stomp on function names in base
URL <- "http://files.grouplens.org/datasets/movielens/ml-10m.zip"

# this will be "ml-10m.zip"
fil <- basename(URL) 

# this will download to getwd() since you prbly want easy access to 
# the files after the machinations. the nice thing about this is
# that it won't re-download the file and waste bandwidth
if (!file.exists(fil)) download.file(URL, fil)

# this will create the "ml-10M100K" dir in getwd(). if using
# R 3.2+ you can do a dir.exists() test to avoid re-doing the unzip
# (which is useful for large archives or archives compressed with a 
# more CPU-intensive algorithm)   
unzip(fil)

# fast read and slicing of the input
# fread will only spit on a single delimiter so the initial fread
# will create a few blank columns. the [] expression filters those
# out. the "with=FALSE" is part of the data.table inanity
mov <- fread("ml-10M100K/ratings.dat", sep=":")[, c(1,3,5,7), with=FALSE]

# saner column names, set efficiently via data.table::setnames
setnames(mov, c("user_id", "movie_id", "tag", "timestamp"))

mov
##           user_id movie_id tag timestamp
##        1:       1      122   5 838985046
##        2:       1      185   5 838983525
##        3:       1      231   5 838983392
##        4:       1      292   5 838983421
##        5:       1      316   5 838983392
##       ---                               
## 10000050:   71567     2107   1 912580553
## 10000051:   71567     2126   2 912649143
## 10000052:   71567     2294   5 912577968
## 10000053:   71567     2338   2 912578016
## 10000054:   71567     2384   2 912578173

它比内置函数快很多。

或者使用jlhoward的d/l代码,但他也更新了代码,不使用内置函数并切换到data.table。虽然我写了这个,但我的代码更快/更有效:-

library(data.table)

# i try not to use variable names that stomp on function names in base
URL <- "http://files.grouplens.org/datasets/movielens/ml-10m.zip"

# this will be "ml-10m.zip"
fil <- basename(URL) 

# this will download to getwd() since you prbly want easy access to 
# the files after the machinations. the nice thing about this is
# that it won't re-download the file and waste bandwidth
if (!file.exists(fil)) download.file(URL, fil)

# this will create the "ml-10M100K" dir in getwd(). if using
# R 3.2+ you can do a dir.exists() test to avoid re-doing the unzip
# (which is useful for large archives or archives compressed with a 
# more CPU-intensive algorithm)   
unzip(fil)

# fast read and slicing of the input
# fread will only spit on a single delimiter so the initial fread
# will create a few blank columns. the [] expression filters those
# out. the "with=FALSE" is part of the data.table inanity
mov <- fread("ml-10M100K/ratings.dat", sep=":")[, c(1,3,5,7), with=FALSE]

# saner column names, set efficiently via data.table::setnames
setnames(mov, c("user_id", "movie_id", "tag", "timestamp"))

mov
##           user_id movie_id tag timestamp
##        1:       1      122   5 838985046
##        2:       1      185   5 838983525
##        3:       1      231   5 838983392
##        4:       1      292   5 838983421
##        5:       1      316   5 838983392
##       ---                               
## 10000050:   71567     2107   1 912580553
## 10000051:   71567     2126   2 912649143
## 10000052:   71567     2294   5 912577968
## 10000053:   71567     2338   2 912578016
## 10000054:   71567     2384   2 912578173

它比内置函数快得多。

对@hrbrmstr的答案有一些小的改进:

mov <- fread("ml-10M100K/ratings.dat", sep=":", select=c(1,3,5,7))

@hrbrmstr的答案略有改进:

mov <- fread("ml-10M100K/ratings.dat", sep=":", select=c(1,3,5,7))

刚刚意识到你想要的是收视率而不是电影。这是一张更大的桌子,所以最好使用fread。。。在data.table包中。感谢您提供非常好的代码。我投票赞成你的答案!刚刚意识到你想要的是收视率而不是电影。这是一张更大的桌子,所以最好使用fread。。。在data.table包中。感谢您提供非常好的代码。我投票赞成你的答案!fread不适用于我当前的R Studio版本包“fread”不适用于R版本3.2.2有什么替代方法?添加以下内容的意义是什么:[,c1,3,5,7,with=FALSE]?我明白了,是gsub:,\t,评级是一种更通用的方法吗?你能在这种情况下使用它吗?我接受了你的回答。谢谢你的帮助和时间!fread不适用于我当前的R Studio版本包“fread”不适用于R版本3.2.2有什么替代方法?添加以下内容的意义是什么:[,c1,3,5,7,with=FALSE]?我明白了,是gsub:,\t,评级是一种更通用的方法吗?你能在这种情况下使用它吗?我接受了你的回答。谢谢你的帮助和时间!