R 如何通过重新构造MALLET输出文件来创建表?
我使用for topic analysis,它将结果输出到文本文件(“topics.txt”)中,文本文件包含数千行和大约100行,其中每行由制表符分隔的变量组成,如下所示:R 如何通过重新构造MALLET输出文件来创建表?,r,dataframe,mallet,R,Dataframe,Mallet,我使用for topic analysis,它将结果输出到文本文件(“topics.txt”)中,文本文件包含数千行和大约100行,其中每行由制表符分隔的变量组成,如下所示: Num1 text1 topic1 proportion1 topic2 proportion2 topic3 proportion3, etc. Num2 text2 topic1 proportion1 topic2 proportion2 topic3 proportion3, etc. Num3 text3 t
Num1 text1 topic1 proportion1 topic2 proportion2 topic3 proportion3, etc.
Num2 text2 topic1 proportion1 topic2 proportion2 topic3 proportion3, etc.
Num3 text3 topic1 proportion1 topic2 proportion2 topic3 proportion3, etc.
下面是实际数据的一个片段:
> dat[1:5,1:10]
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 0 10.txt 27 0.4560785 23 0.3040853 20 0.1315621 21 0.03632624
2 1 1001.txt 20 0.2660085 12 0.2099153 8 0.1699586 13 0.16922928
3 2 1002.txt 16 0.3341721 2 0.1747023 10 0.1360454 12 0.07507119
4 3 1003.txt 12 0.5366148 8 0.2255179 18 0.1388561 0 0.01867091
5 4 1005.txt 16 0.2363206 0 0.2214441 24 0.1914769 7 0.17760521
我试图使用R将此输出转换为一个数据表,其中主题是列标题,每个主题都包含变量“proporty”的值,该变量直接位于每个变量“topic”的右侧,对应于每个“text”的值。像这样:
topic1 topic2 topic3
text1 proportion1 proportion2 proportion3
text2 proportion1 proportion2 proportion3
或者使用上面的数据片段,如下所示:
0 2 7 8 10 12 13 16 18 20 21 23 24 27
10.txt 0 0 0 0 0 0 0 0 0 0.1315621 0.03632624 0.3040853 0 0.4560785
1001.txt 0 0 0 0.1699586 0 0.2099153 0.1692292 0 0 0.2660085 0 0 0 0
1002.txt 0 0.1747023 0 0 0.1360454 0.0750711 0 0.3341721 0 0 0 0 0 0
1003.txt 0.0186709 0 0 0.2255179 0 0.5366148 0 0 0.138856 0 0 0 0 0
1005.txt 0.2214441 0 0.1776052 0 0 0 0 0.2363206 0 0 0 0 0.1914769 0
这是我必须完成这项工作的R代码,由一位朋友发送,但它对我不起作用(我自己也不知道如何修复它):
##########################################
dat您可以将其转换为长格式,但需要进一步的实际数据。
在提供数据后编辑。仍然不确定MALLET的整体结构,但至少展示了R函数。这种方法的“特点”是,如果存在重叠的主题,则会将比例相加。这取决于数据布局是否有利
dat <-read.table(textConnection(" V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 0 10.txt 27 0.4560785 23 0.3040853 20 0.1315621 21 0.03632624
2 1 1001.txt 20 0.2660085 12 0.2099153 8 0.1699586 13 0.16922928
3 2 1002.txt 16 0.3341721 2 0.1747023 10 0.1360454 12 0.07507119
4 3 1003.txt 12 0.5366148 8 0.2255179 18 0.1388561 0 0.01867091
5 4 1005.txt 16 0.2363206 0 0.2214441 24 0.1914769 7 0.17760521
"),
header=TRUE)
ldat <- reshape(dat, idvar=1:2, varying=list(topics=c("V3", "V5", "V7", "V9"),
props=c("V4", "V6", "V8", "V10")),
direction="long")
####------------------####
> ldat
V1 V2 time V3 V4
0.10.txt.1 0 10.txt 1 27 0.45607850
1.1001.txt.1 1 1001.txt 1 20 0.26600850
2.1002.txt.1 2 1002.txt 1 16 0.33417210
3.1003.txt.1 3 1003.txt 1 12 0.53661480
4.1005.txt.1 4 1005.txt 1 16 0.23632060
0.10.txt.2 0 10.txt 2 23 0.30408530
1.1001.txt.2 1 1001.txt 2 12 0.20991530
2.1002.txt.2 2 1002.txt 2 2 0.17470230
3.1003.txt.2 3 1003.txt 2 8 0.22551790
4.1005.txt.2 4 1005.txt 2 0 0.22144410
0.10.txt.3 0 10.txt 3 20 0.13156210
1.1001.txt.3 1 1001.txt 3 8 0.16995860
2.1002.txt.3 2 1002.txt 3 10 0.13604540
3.1003.txt.3 3 1003.txt 3 18 0.13885610
4.1005.txt.3 4 1005.txt 3 24 0.19147690
0.10.txt.4 0 10.txt 4 21 0.03632624
1.1001.txt.4 1 1001.txt 4 13 0.16922928
2.1002.txt.4 2 1002.txt 4 12 0.07507119
3.1003.txt.4 3 1003.txt 4 0 0.01867091
4.1005.txt.4 4 1005.txt 4 7 0.17760521
这里有一种解决问题的方法
dat <-read.table(as.is = TRUE, header = FALSE, textConnection(
"Num1 text1 topic1 proportion1 topic2 proportion2 topic3 proportion3
Num2 text2 topic1 proportion1 topic2 proportion2 topic3 proportion3
Num3 text3 topic1 proportion1 topic2 proportion2 topic3 proportion3"))
NTOPICS = 3
nam <- c('num', 'text',
paste(c('topic', 'proportion'), rep(1:NTOPICS, each = 2), sep = ""))
dat_l <- reshape(setNames(dat, nam), varying = 3:length(nam), direction = 'long',
sep = "")
reshape2::dcast(dat_l, num + text ~ topic, value_var = 'proportion')
num text topic1 topic2 topic3
1 Num1 text1 proportion1 proportion2 proportion3
2 Num2 text2 proportion1 proportion2 proportion3
3 Num3 text3 proportion1 proportion2 proportion3
dat回到这个问题,我发现重塑
函数对内存的要求太高,所以我使用了data.table
方法。只需再执行几个步骤,但速度要快得多,占用的内存也要少得多
dat <- read.table(text = "V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 0 10.txt 27 0.4560785 23 0.3040853 20 0.1315621 21 0.03632624
2 1 1001.txt 20 0.2660085 12 0.2099153 8 0.1699586 13 0.16922928
3 2 1002.txt 16 0.3341721 2 0.1747023 10 0.1360454 12 0.07507119
4 3 1003.txt 12 0.5366148 8 0.2255179 18 0.1388561 0 0.01867091
5 4 1005.txt 16 0.2363206 0 0.2214441 24 0.1914769 7 0.17760521")
dat$V11 <- rep(NA, 5) # my real data has this extra unwanted col
dat <- data.table(dat)
# get document number
docnum <- dat$V1
# get text number
txt <- dat$V2
# remove doc num and text num so we just have topic and props
dat1 <- dat[ ,c("V1","V2", paste0("V", ncol(dat))) := NULL]
# get topic numbers
n <- ncol(dat1)
tops <- apply(dat1, 1, function(i) i[seq(1, n, 2)])
# get props
props <- apply(dat1, 1, function(i) i[seq(2, n, 2)])
# put topics and props together
tp <- lapply(1:ncol(tops), function(i) data.frame(tops[,i], props[,i]))
names(tp) <- txt
# make into long table
dt <- data.table::rbindlist(tp)
dt$doc <- unlist(lapply(txt, function(i) rep(i, ncol(dat1)/2)))
dt$docnum <- unlist(lapply(docnum, function(i) rep(i, ncol(dat1)/2)))
# reshape to wide
library(data.table)
setkey(dt, tops...i., doc)
out <- dt[CJ(unique(tops...i.), unique(doc))][, as.list(props...i.), by=tops...i.]
setnames(out, c("topic", as.character(txt)))
# transpose to have table of docs (rows) and columns (topics)
tout <- data.table(t(out))
setnames(tout, unname(as.character(tout[1,])))
tout <- tout[-1,]
row.names(tout) <- txt
# replace NA with zero
tout[is.na(tout)] <- 0
除非您提供真实的数据结构,否则您不会得到太多帮助。。。。一个有数字的比例。使用dput(head(dat,20))谢谢你的提示,我已经添加了一些。我还应该补充一点,在尝试我朋友的代码之前,用rm(list=ls(all=TRUE))
删除对象稍微改变了问题,这样在他的代码块结束时,错误消息变为“error in[.data.frame`(dat2,I:z):未定义的列已选定”。尽管如此,我认为@Ramnath的答案是一个很有希望的选择。感谢您的快速建议。我可以复制您的结果。如何将其推广到30个(或100个或更多)主题?如果列名非常规则,那么“可变”参数可以是类似topics=paste(“V”,seq(1100,by=2),sep=”“)
和props=粘贴(“V”,seq(2100,by=2),sep=”“)
感谢您的快速帮助。不幸的是,我不明白为什么您的建议对我不起作用,但是@Ramnath的代码起作用,所以我很高兴结束这个案例。再次感谢您的建议,我可以复制您的示例并使其在我的全套数据上工作。如果我们更改dat_l Minor edit以删除hav,怎么样ing输入主题数###########
rm(list=ls(all=TRUE))
dat
dat <-read.table(as.is = TRUE, header = FALSE, textConnection(
"Num1 text1 topic1 proportion1 topic2 proportion2 topic3 proportion3
Num2 text2 topic1 proportion1 topic2 proportion2 topic3 proportion3
Num3 text3 topic1 proportion1 topic2 proportion2 topic3 proportion3"))
NTOPICS = 3
nam <- c('num', 'text',
paste(c('topic', 'proportion'), rep(1:NTOPICS, each = 2), sep = ""))
dat_l <- reshape(setNames(dat, nam), varying = 3:length(nam), direction = 'long',
sep = "")
reshape2::dcast(dat_l, num + text ~ topic, value_var = 'proportion')
num text topic1 topic2 topic3
1 Num1 text1 proportion1 proportion2 proportion3
2 Num2 text2 proportion1 proportion2 proportion3
3 Num3 text3 proportion1 proportion2 proportion3
dat <- read.table(text = "V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 0 10.txt 27 0.4560785 23 0.3040853 20 0.1315621 21 0.03632624
2 1 1001.txt 20 0.2660085 12 0.2099153 8 0.1699586 13 0.16922928
3 2 1002.txt 16 0.3341721 2 0.1747023 10 0.1360454 12 0.07507119
4 3 1003.txt 12 0.5366148 8 0.2255179 18 0.1388561 0 0.01867091
5 4 1005.txt 16 0.2363206 0 0.2214441 24 0.1914769 7 0.17760521")
dat$V11 <- rep(NA, 5) # my real data has this extra unwanted col
dat <- data.table(dat)
# get document number
docnum <- dat$V1
# get text number
txt <- dat$V2
# remove doc num and text num so we just have topic and props
dat1 <- dat[ ,c("V1","V2", paste0("V", ncol(dat))) := NULL]
# get topic numbers
n <- ncol(dat1)
tops <- apply(dat1, 1, function(i) i[seq(1, n, 2)])
# get props
props <- apply(dat1, 1, function(i) i[seq(2, n, 2)])
# put topics and props together
tp <- lapply(1:ncol(tops), function(i) data.frame(tops[,i], props[,i]))
names(tp) <- txt
# make into long table
dt <- data.table::rbindlist(tp)
dt$doc <- unlist(lapply(txt, function(i) rep(i, ncol(dat1)/2)))
dt$docnum <- unlist(lapply(docnum, function(i) rep(i, ncol(dat1)/2)))
# reshape to wide
library(data.table)
setkey(dt, tops...i., doc)
out <- dt[CJ(unique(tops...i.), unique(doc))][, as.list(props...i.), by=tops...i.]
setnames(out, c("topic", as.character(txt)))
# transpose to have table of docs (rows) and columns (topics)
tout <- data.table(t(out))
setnames(tout, unname(as.character(tout[1,])))
tout <- tout[-1,]
row.names(tout) <- txt
# replace NA with zero
tout[is.na(tout)] <- 0
tout
0 2 7 8 10 12 13 16 18
1: 0.00000000 0.0000000 0.0000000 0.0000000 0.0000000 0.00000000 0.0000000 0.0000000 0.0000000
2: 0.00000000 0.0000000 0.0000000 0.1699586 0.0000000 0.20991530 0.1692293 0.0000000 0.0000000
3: 0.00000000 0.1747023 0.0000000 0.0000000 0.1360454 0.07507119 0.0000000 0.3341721 0.0000000
4: 0.01867091 0.0000000 0.0000000 0.2255179 0.0000000 0.53661480 0.0000000 0.0000000 0.1388561
5: 0.22144410 0.0000000 0.1776052 0.0000000 0.0000000 0.00000000 0.0000000 0.2363206 0.0000000
20 21 23 24 27
1: 0.1315621 0.03632624 0.3040853 0.0000000 0.4560785
2: 0.2660085 0.00000000 0.0000000 0.0000000 0.0000000
3: 0.0000000 0.00000000 0.0000000 0.0000000 0.0000000
4: 0.0000000 0.00000000 0.0000000 0.0000000 0.0000000
5: 0.0000000 0.00000000 0.0000000 0.1914769 0.0000000