R 使字符串集成为按组排序的数据帧,包含序列信息
我正在努力用字符串制作一个有组织的数据帧 有了这个输入R 使字符串集成为按组排序的数据帧,包含序列信息,r,R,我正在努力用字符串制作一个有组织的数据帧 有了这个输入 text = c('I do not want to do this thing anymore','you do not know what I mean','I will not do this thing','do not want anymore','you will see') [1] "I do not want to do this thing anymore" "you do not know what I mean"
text = c('I do not want to do this thing anymore','you do not know what I mean','I will not do this thing','do not want anymore','you will see')
[1] "I do not want to do this thing anymore" "you do not know what I mean"
[3] "I will not do this thing" "do not want anymore"
[5] "you will see"
我希望制作一个数据框架,它看起来像是一种具有序列信息的文档术语表。然而,我不知道如何实现这一点。这既不是文档术语矩阵,也不仅仅是可以用下面的代码生成的数据帧
as.data.frame(t(stri_list2matrix(strsplit(as.character(text),' '))))
V1 V2 V3 V4 V5 V6 V7 V8 V9
1 I do not want to do this thing anymore
2 you do not know what I mean <NA> <NA>
3 I will not do this thing <NA> <NA> <NA>
4 do not want anymore <NA> <NA> <NA> <NA> <NA>
5 you will see <NA> <NA> <NA> <NA> <NA> <NA>
请查找下面的代码,并让我知道这是否符合您的目的,只是输出数据框中的单词顺序与您的不同
library(stringi)
text = c('I do not want to do this thing anymore','you do not know what I mean','I will not do this thing','do not want anymore','you will see')
tf = as.data.frame(t(stri_list2matrix(strsplit(as.character(text),' '))),stringsAsFactors = F)
strs = unlist(strsplit(as.character(text),' '))
fstrs = unique(strs)
fdf = data.frame(matrix(ncol = length(fstrs),nrow = 0))
names(fdf) = fstrs
log_out = data.frame()
for(i in 1:nrow(tf)){
log = as.data.frame(t(names(fdf)[ifelse((names(fdf) %in% as.character(tf[i,])) == F,NA,T)]))
log_out = rbind(log_out,log)
}
输出将是
log_out
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14
1 I do not want to this thing anymore <NA> <NA> <NA> <NA> <NA> <NA>
2 I do not <NA> <NA> <NA> <NA> <NA> you know what mean <NA> <NA>
3 I do not <NA> <NA> this thing <NA> <NA> <NA> <NA> <NA> will <NA>
4 <NA> do not want <NA> <NA> <NA> anymore <NA> <NA> <NA> <NA> <NA> <NA>
5 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> you <NA> <NA> <NA> will see
注销
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14
我不想再做这件事了
我不知道你是什么意思
我不认为这件事会发生
我再也不想要了
你会看到的
不幸的是,它无法处理第二个问题——“你不知道我的意思”。不过这很酷。你的期望是,每个句子中的单词都会按照现有的顺序被拆分,并重复单词?如果是这样的话,你想在这里加上什么规则,比如什么时候应该停止重复?好吧。。。规则也是我所寻找的一部分。这真是令人费解。
library(stringi)
text = c('I do not want to do this thing anymore','you do not know what I mean','I will not do this thing','do not want anymore','you will see')
tf = as.data.frame(t(stri_list2matrix(strsplit(as.character(text),' '))),stringsAsFactors = F)
strs = unlist(strsplit(as.character(text),' '))
fstrs = unique(strs)
fdf = data.frame(matrix(ncol = length(fstrs),nrow = 0))
names(fdf) = fstrs
log_out = data.frame()
for(i in 1:nrow(tf)){
log = as.data.frame(t(names(fdf)[ifelse((names(fdf) %in% as.character(tf[i,])) == F,NA,T)]))
log_out = rbind(log_out,log)
}
log_out
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14
1 I do not want to this thing anymore <NA> <NA> <NA> <NA> <NA> <NA>
2 I do not <NA> <NA> <NA> <NA> <NA> you know what mean <NA> <NA>
3 I do not <NA> <NA> this thing <NA> <NA> <NA> <NA> <NA> will <NA>
4 <NA> do not want <NA> <NA> <NA> anymore <NA> <NA> <NA> <NA> <NA> <NA>
5 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> you <NA> <NA> <NA> will see