R 如何通过重新构造MALLET输出文件来创建表？_R_Dataframe_Mallet

R 如何通过重新构造MALLET输出文件来创建表？

r dataframe

R 如何通过重新构造MALLET输出文件来创建表？,r,dataframe,mallet,R,Dataframe,Mallet,我使用for topic analysis，它将结果输出到文本文件（“topics.txt”）中，文本文件包含数千行和大约100行，其中每行由制表符分隔的变量组成，如下所示： Num1 text1 topic1 proportion1 topic2 proportion2 topic3 proportion3, etc. Num2 text2 topic1 proportion1 topic2 proportion2 topic3 proportion3, etc. Num3 text3 t

我使用for topic analysis，它将结果输出到文本文件（“topics.txt”）中，文本文件包含数千行和大约100行，其中每行由制表符分隔的变量组成，如下所示：

Num1 text1 topic1 proportion1 topic2 proportion2 topic3 proportion3,  etc.
Num2 text2 topic1 proportion1 topic2 proportion2 topic3 proportion3,  etc.
Num3 text3 topic1 proportion1 topic2 proportion2 topic3 proportion3,  etc.

下面是实际数据的一个片段：

> dat[1:5,1:10]

  V1 V2 V3    V4 V5        V6 V7        V8 V9        V10
1  0 10.txt   27 0.4560785 23 0.3040853 20 0.1315621 21 0.03632624
2  1 1001.txt 20 0.2660085 12 0.2099153  8 0.1699586 13 0.16922928
3  2 1002.txt 16 0.3341721  2 0.1747023 10 0.1360454 12 0.07507119
4  3 1003.txt 12 0.5366148  8 0.2255179 18 0.1388561  0 0.01867091
5  4 1005.txt 16 0.2363206  0 0.2214441 24 0.1914769  7 0.17760521

我试图使用R将此输出转换为一个数据表，其中主题是列标题，每个主题都包含变量“proporty”的值，该变量直接位于每个变量“topic”的右侧，对应于每个“text”的值。像这样：

      topic1       topic2       topic3
text1 proportion1  proportion2  proportion3
text2 proportion1  proportion2  proportion3

或者使用上面的数据片段，如下所示：

           0         2         7         8         10        12        13        16        18       20        21         23        24         27
10.txt     0         0         0         0         0         0         0         0         0        0.1315621 0.03632624 0.3040853 0          0.4560785        
1001.txt   0         0         0         0.1699586 0         0.2099153 0.1692292 0         0        0.2660085 0          0         0          0
1002.txt   0         0.1747023 0         0         0.1360454 0.0750711 0         0.3341721 0        0         0          0         0          0
1003.txt   0.0186709 0         0         0.2255179 0         0.5366148 0         0         0.138856 0         0          0         0          0
1005.txt   0.2214441 0         0.1776052 0         0         0         0         0.2363206 0        0         0          0         0.1914769  0

这是我必须完成这项工作的R代码，由一位朋友发送，但它对我不起作用（我自己也不知道如何修复它）：

##########################################
dat您可以将其转换为长格式，但需要进一步的实际数据。
在提供数据后编辑。仍然不确定MALLET的整体结构，但至少展示了R函数。这种方法的“特点”是，如果存在重叠的主题，则会将比例相加。这取决于数据布局是否有利
dat <-read.table(textConnection("  V1 V2 V3  V4 V5  V6 V7  V8 V9  V10
1  0 10.txt   27 0.4560785 23 0.3040853 20 0.1315621 21 0.03632624
2  1 1001.txt 20 0.2660085 12 0.2099153  8 0.1699586 13 0.16922928
3  2 1002.txt 16 0.3341721  2 0.1747023 10 0.1360454 12 0.07507119
4  3 1003.txt 12 0.5366148  8 0.2255179 18 0.1388561  0 0.01867091
5  4 1005.txt 16 0.2363206  0 0.2214441 24 0.1914769  7 0.17760521
"), 
          header=TRUE)
 ldat <- reshape(dat, idvar=1:2, varying=list(topics=c("V3", "V5", "V7", "V9"), 
                                          props=c("V4", "V6", "V8", "V10")), 
                       direction="long")
####------------------####
    > ldat
             V1       V2 time V3         V4
0.10.txt.1    0   10.txt    1 27 0.45607850
1.1001.txt.1  1 1001.txt    1 20 0.26600850
2.1002.txt.1  2 1002.txt    1 16 0.33417210
3.1003.txt.1  3 1003.txt    1 12 0.53661480
4.1005.txt.1  4 1005.txt    1 16 0.23632060
0.10.txt.2    0   10.txt    2 23 0.30408530
1.1001.txt.2  1 1001.txt    2 12 0.20991530
2.1002.txt.2  2 1002.txt    2  2 0.17470230
3.1003.txt.2  3 1003.txt    2  8 0.22551790
4.1005.txt.2  4 1005.txt    2  0 0.22144410
0.10.txt.3    0   10.txt    3 20 0.13156210
1.1001.txt.3  1 1001.txt    3  8 0.16995860
2.1002.txt.3  2 1002.txt    3 10 0.13604540
3.1003.txt.3  3 1003.txt    3 18 0.13885610
4.1005.txt.3  4 1005.txt    3 24 0.19147690
0.10.txt.4    0   10.txt    4 21 0.03632624
1.1001.txt.4  1 1001.txt    4 13 0.16922928
2.1002.txt.4  2 1002.txt    4 12 0.07507119
3.1003.txt.4  3 1003.txt    4  0 0.01867091
4.1005.txt.4  4 1005.txt    4  7 0.17760521

这里有一种解决问题的方法
 dat <-read.table(as.is = TRUE, header = FALSE, textConnection(
  "Num1 text1 topic1 proportion1 topic2 proportion2 topic3 proportion3
   Num2 text2 topic1 proportion1 topic2 proportion2 topic3 proportion3
   Num3 text3 topic1 proportion1 topic2 proportion2 topic3 proportion3"))

 NTOPICS = 3 
 nam <- c('num', 'text', 
   paste(c('topic', 'proportion'), rep(1:NTOPICS, each = 2), sep = ""))

 dat_l <- reshape(setNames(dat, nam), varying = 3:length(nam), direction = 'long',
   sep = "")
 reshape2::dcast(dat_l, num + text ~ topic, value_var = 'proportion')

num  text      topic1      topic2      topic3
1 Num1 text1 proportion1 proportion2 proportion3
2 Num2 text2 proportion1 proportion2 proportion3
3 Num3 text3 proportion1 proportion2 proportion3

dat回到这个问题，我发现重塑
函数对内存的要求太高，所以我使用了data.table
方法。只需再执行几个步骤，但速度要快得多，占用的内存也要少得多
dat <- read.table(text = "V1 V2 V3    V4 V5        V6 V7        V8 V9        V10
1  0 10.txt   27 0.4560785 23 0.3040853 20 0.1315621 21 0.03632624
2  1 1001.txt 20 0.2660085 12 0.2099153  8 0.1699586 13 0.16922928
3  2 1002.txt 16 0.3341721  2 0.1747023 10 0.1360454 12 0.07507119
4  3 1003.txt 12 0.5366148  8 0.2255179 18 0.1388561  0 0.01867091
5  4 1005.txt 16 0.2363206  0 0.2214441 24 0.1914769  7 0.17760521")

dat$V11 <- rep(NA, 5) # my real data has this extra unwanted col
dat <- data.table(dat)

# get document number
docnum <- dat$V1
# get text number
txt <- dat$V2

# remove doc num and text num so we just have topic and props
dat1 <- dat[ ,c("V1","V2", paste0("V", ncol(dat))) := NULL]
# get topic numbers
n <- ncol(dat1) 
tops <- apply(dat1, 1, function(i) i[seq(1, n, 2)])
# get props 
props <- apply(dat1, 1, function(i) i[seq(2, n, 2)])

# put topics and props together
tp <- lapply(1:ncol(tops), function(i) data.frame(tops[,i], props[,i]))
names(tp) <- txt
# make into long table
dt <- data.table::rbindlist(tp)
dt$doc <- unlist(lapply(txt, function(i) rep(i, ncol(dat1)/2)))
dt$docnum <- unlist(lapply(docnum, function(i) rep(i, ncol(dat1)/2)))

# reshape to wide
library(data.table)
setkey(dt, tops...i., doc)
out <- dt[CJ(unique(tops...i.), unique(doc))][, as.list(props...i.), by=tops...i.]
setnames(out, c("topic", as.character(txt)))

# transpose to have table of docs (rows) and columns (topics) 
tout <- data.table(t(out))
setnames(tout, unname(as.character(tout[1,])))
tout <- tout[-1,]
row.names(tout) <- txt 

# replace NA with zero
tout[is.na(tout)] <- 0

除非您提供真实的数据结构，否则您不会得到太多帮助。。。。一个有数字的比例。使用dput（head（dat，20））谢谢你的提示，我已经添加了一些。我还应该补充一点，在尝试我朋友的代码之前，用rm（list=ls（all=TRUE））
删除对象稍微改变了问题，这样在他的代码块结束时，错误消息变为“error in[.data.frame`（dat2，I:z）：未定义的列已选定”。尽管如此，我认为@Ramnath的答案是一个很有希望的选择。感谢您的快速建议。我可以复制您的结果。如何将其推广到30个（或100个或更多）主题？如果列名非常规则，那么“可变”参数可以是类似topics=paste（“V”，seq（1100，by=2），sep=”“）
和props=粘贴（“V”，seq（2100，by=2），sep=”“）
感谢您的快速帮助。不幸的是，我不明白为什么您的建议对我不起作用，但是@Ramnath的代码起作用，所以我很高兴结束这个案例。再次感谢您的建议，我可以复制您的示例并使其在我的全套数据上工作。如果我们更改dat_l Minor edit以删除hav，怎么样ing输入主题数###########
rm（list=ls（all=TRUE））
dat
 dat <-read.table(as.is = TRUE, header = FALSE, textConnection(
  "Num1 text1 topic1 proportion1 topic2 proportion2 topic3 proportion3
   Num2 text2 topic1 proportion1 topic2 proportion2 topic3 proportion3
   Num3 text3 topic1 proportion1 topic2 proportion2 topic3 proportion3"))

 NTOPICS = 3 
 nam <- c('num', 'text', 
   paste(c('topic', 'proportion'), rep(1:NTOPICS, each = 2), sep = ""))

 dat_l <- reshape(setNames(dat, nam), varying = 3:length(nam), direction = 'long',
   sep = "")
 reshape2::dcast(dat_l, num + text ~ topic, value_var = 'proportion')

num  text      topic1      topic2      topic3
1 Num1 text1 proportion1 proportion2 proportion3
2 Num2 text2 proportion1 proportion2 proportion3
3 Num3 text3 proportion1 proportion2 proportion3

dat <- read.table(text = "V1 V2 V3    V4 V5        V6 V7        V8 V9        V10
1  0 10.txt   27 0.4560785 23 0.3040853 20 0.1315621 21 0.03632624
2  1 1001.txt 20 0.2660085 12 0.2099153  8 0.1699586 13 0.16922928
3  2 1002.txt 16 0.3341721  2 0.1747023 10 0.1360454 12 0.07507119
4  3 1003.txt 12 0.5366148  8 0.2255179 18 0.1388561  0 0.01867091
5  4 1005.txt 16 0.2363206  0 0.2214441 24 0.1914769  7 0.17760521")

dat$V11 <- rep(NA, 5) # my real data has this extra unwanted col
dat <- data.table(dat)

# get document number
docnum <- dat$V1
# get text number
txt <- dat$V2

# remove doc num and text num so we just have topic and props
dat1 <- dat[ ,c("V1","V2", paste0("V", ncol(dat))) := NULL]
# get topic numbers
n <- ncol(dat1) 
tops <- apply(dat1, 1, function(i) i[seq(1, n, 2)])
# get props 
props <- apply(dat1, 1, function(i) i[seq(2, n, 2)])

# put topics and props together
tp <- lapply(1:ncol(tops), function(i) data.frame(tops[,i], props[,i]))
names(tp) <- txt
# make into long table
dt <- data.table::rbindlist(tp)
dt$doc <- unlist(lapply(txt, function(i) rep(i, ncol(dat1)/2)))
dt$docnum <- unlist(lapply(docnum, function(i) rep(i, ncol(dat1)/2)))

# reshape to wide
library(data.table)
setkey(dt, tops...i., doc)
out <- dt[CJ(unique(tops...i.), unique(doc))][, as.list(props...i.), by=tops...i.]
setnames(out, c("topic", as.character(txt)))

# transpose to have table of docs (rows) and columns (topics) 
tout <- data.table(t(out))
setnames(tout, unname(as.character(tout[1,])))
tout <- tout[-1,]
row.names(tout) <- txt 

# replace NA with zero
tout[is.na(tout)] <- 0

tout

            0         2         7         8        10         12        13        16        18
1: 0.00000000 0.0000000 0.0000000 0.0000000 0.0000000 0.00000000 0.0000000 0.0000000 0.0000000
2: 0.00000000 0.0000000 0.0000000 0.1699586 0.0000000 0.20991530 0.1692293 0.0000000 0.0000000
3: 0.00000000 0.1747023 0.0000000 0.0000000 0.1360454 0.07507119 0.0000000 0.3341721 0.0000000
4: 0.01867091 0.0000000 0.0000000 0.2255179 0.0000000 0.53661480 0.0000000 0.0000000 0.1388561
5: 0.22144410 0.0000000 0.1776052 0.0000000 0.0000000 0.00000000 0.0000000 0.2363206 0.0000000
          20         21        23        24        27
1: 0.1315621 0.03632624 0.3040853 0.0000000 0.4560785
2: 0.2660085 0.00000000 0.0000000 0.0000000 0.0000000
3: 0.0000000 0.00000000 0.0000000 0.0000000 0.0000000
4: 0.0000000 0.00000000 0.0000000 0.0000000 0.0000000
5: 0.0000000 0.00000000 0.0000000 0.1914769 0.0000000