如何从具有多个文本列的data.frame创建quanteda语料库？_R_Quanteda

如何从具有多个文本列的data.frame创建quanteda语料库？

如何从具有多个文本列的data.frame创建quanteda语料库？,r,quanteda,R,Quanteda,假设我有以下几点： x10 = data.frame(id = c(1,2,3),vars =c('top','down','top'), text1=c('this is text','so is this','and this is too.'), text2=c('we have more text here','and here too','and look at this, more text.')) x1 = corpus(x10,docid_field='i

假设我有以下几点：

x10 = data.frame(id = c(1,2,3),vars =c('top','down','top'), 
     text1=c('this is text','so is this','and this is too.'),
     text2=c('we have more text here','and here too','and look at this, more text.'))

x1 = corpus(x10,docid_field='id',text_field=c(3:4),tolower=T)

我想使用以下方法在quanteda中创建dfm/语料库：

x10 = data.frame(id = c(1,2,3),vars =c('top','down','top'), 
     text1=c('this is text','so is this','and this is too.'),
     text2=c('we have more text here','and here too','and look at this, more text.'))

x1 = corpus(x10,docid_field='id',text_field=c(3:4),tolower=T)

显然，这会出错，因为text_字段只占用一列。除了建造两个小体，还有更好的方法来处理这个问题吗？我可以构建2然后在id上合并吗？这是一件事吗？

首先，让我们在不分解字符值的情况下重新创建data.frame：

x10 = data.frame(id = c(1,2,3), vars = c('top','down','top'), 
                 text1 = c('this is text', 'so is this', 'and this is too.'),
                 text2 = c('we have more text here', 'and here too', 'and look at this, more text.'),
                 stringsAsFactors = FALSE)

那么我们有两个选择

方法1：重塑为“长”格式并创建单个语料库首先“融化”数据，这样就有一个单独的列，然后作为语料库导入。（另一种选择是

tidy:：gather（）

）

在此阶段，您还可以使用

docnames（x10_-corpus2）啊！我想建立两个小体并合并。这些方法有好处吗？假设我必须将其扩展到20k+评论，是否有一种方法更适合扩展？我可能会使用方法1。20k+将毫无问题地工作。太棒了！方法1中的最后一个问题，行名（x），我刚刚在代码中添加了一条注释来解释原因。这是因为corpus（）
调用将自动使用这些文档名。我们已经为此打开了窗口，但哪种行为更自然：连接文本，还是重复id变量来堆叠文本列（如下面的答案所示）？是的，我明白你的意思。我认为更直接的方法是重复ID变量，因为串联可能会很麻烦。例如，我们的员工调查有3个开放式问题（积极的经验、消极的经验，还有其他问题吗？），把这些问题结合起来真的很奇怪。