在R中使用ffdfwith的操作

在R中使用ffdfwith的操作,r,ff,ffbase,R,Ff,Ffbase,我使用ff和R是因为我有一个巨大的数据集(大约16GB)要处理。作为一个测试用例,我让该文件读取大约100万条记录,并将其作为ff数据库写出 system.time(te3 <- read.csv.ffdf(file="testdata.csv", sep = ",", header=TRUE, first.rows=10000, next.rows=50000, colClasses=c("numeric","numeric","numeric","numeric"))) system

我使用ff和R是因为我有一个巨大的数据集(大约16GB)要处理。作为一个测试用例,我让该文件读取大约100万条记录,并将其作为ff数据库写出

system.time(te3 <- read.csv.ffdf(file="testdata.csv", sep = ",", header=TRUE, first.rows=10000, next.rows=50000, colClasses=c("numeric","numeric","numeric","numeric")))

system.time(te3向ffdf添加一个额外变量是一个基本问题,但是有几个选项可以达到相同的目标。见下文。
我已经在下载了你的zipfile并解压缩了它

require(ffbase)
load.ffdf(dir="/home/janw/Desktop/stackoverflow/ffdb")

## Using ffdfwith or with will chunkwise execute the expression
te3$odfips <- ffdfwith(te3, ofips*100000 + dfips)
te3$odfips <- with(te3, ofips*100000 + dfips)
## It is better to restrict to the columns you need in the expression 
## otherwise you are going to load other columns in RAM also which is not needed. 
## This will speedup
te3$odfips <- ffdfwith(te3[c("ofips","dfips")], ofips*100000 + dfips)
te3$odfips <- with(te3[c("ofips","dfips")], ofips*100000 + dfips)
## ffdfwith will look at options("ffbatchbytes") and look at how many rows in your ffdf
## can be put in 1 batch in order to not overflow options("ffbatchbytes") and hence RAM. 
## So creating this variable will be done in chunks.
## If you want to specify the chunksize yourself, you can e.g. pass the by argument
## to with which will be passed on to ?chunk. Eg. below this variable is created
## in chunks of 100000 records.
te3$odfips <- with(te3[c("ofips","dfips")], ofips*100000 + dfips, by = 100000)

## As the Ops * and + are implemented in ffbase for ff vectors you can also do this:
te3$odfips <- te3$ofips * 100000 + te3$dfips
require(ffbase)
load.ffdf(dir=“/home/janw/Desktop/stackoverflow/ffdb”)
##使用ffdfwith或with将分块执行表达式

te3$odfips感谢您的见解和详细评论。我已经将我的memory.limit设置得非常高,但确实知道ffbatchbytes。我将使用ffbatchbytes进行测试,看看是否仍然存在错误。Regd.MCMC我的问题更一般—标准R包可以与ff一起使用吗。从我阅读的内容来看,应该是这样的,但我不确定。嗯..reb退出机器并再次运行代码修复了它。关于标准的R包。这取决于,一些确实需要,其他需要轻微或更大的更改。
Error in if (by < 1) stop("'by' must be > 0") : missing value where TRUE/FALSE needed
In addition: Warning message: In chunk.default(from = 1L, to = 1000000L, by = 2293760000, maxindex = 1000000L) : NAs introduced by coercion
require(ffbase)
load.ffdf(dir="/home/janw/Desktop/stackoverflow/ffdb")

## Using ffdfwith or with will chunkwise execute the expression
te3$odfips <- ffdfwith(te3, ofips*100000 + dfips)
te3$odfips <- with(te3, ofips*100000 + dfips)
## It is better to restrict to the columns you need in the expression 
## otherwise you are going to load other columns in RAM also which is not needed. 
## This will speedup
te3$odfips <- ffdfwith(te3[c("ofips","dfips")], ofips*100000 + dfips)
te3$odfips <- with(te3[c("ofips","dfips")], ofips*100000 + dfips)
## ffdfwith will look at options("ffbatchbytes") and look at how many rows in your ffdf
## can be put in 1 batch in order to not overflow options("ffbatchbytes") and hence RAM. 
## So creating this variable will be done in chunks.
## If you want to specify the chunksize yourself, you can e.g. pass the by argument
## to with which will be passed on to ?chunk. Eg. below this variable is created
## in chunks of 100000 records.
te3$odfips <- with(te3[c("ofips","dfips")], ofips*100000 + dfips, by = 100000)

## As the Ops * and + are implemented in ffbase for ff vectors you can also do this:
te3$odfips <- te3$ofips * 100000 + te3$dfips