R 如何在读取csv文件时指定文本列?

R 如何在读取csv文件时指定文本列?,r,quanteda,R,Quanteda,我使用这种方式读取csv文件: 这里是str $ an_id : int 4840 41981 40482 37473 33278 29083 30940 29374 24023 23922 ... 它似乎是一个int字符列,并使用以下命令将其转换为chr df$an_id <- paste0("doc_", df$an_id) 是否有其他方法读取文件或以文本形式传递列 如果我将此数据保存到csv文件中,并读取该文件并运行命令,它们将正常工作 dtext <- data.fr

我使用这种方式读取csv文件:

这里是str

$ an_id  : int  4840 41981 40482 37473 33278 29083 30940 29374 24023 23922 ...
它似乎是一个int字符列,并使用以下命令将其转换为chr

df$an_id <- paste0("doc_", df$an_id)
是否有其他方法读取文件或以文本形式传递列

如果我将此数据保存到csv文件中,并读取该文件并运行命令,它们将正常工作

dtext <- data.frame(id = c(1,2,3,4), text = c("here","This dataset contains movie reviews along with their associated binary sentiment polarity labels. It is intended to serve as a benchmark for sentiment classification. This document outlines how the dataset was gathered, and how to use the files provided.", "The core dataset contains 50,000 reviews split evenly into 25k train and 25k test sets. The overall distribution of labels is balanced (25k pos and 25k neg). We also include an additional 50,000 unlabeled documents for unsupervised learning.", "There are two top-level directories [train/, test/] corresponding to the training and test sets. Each contains [pos/, neg/] directories for the reviews with binary labels positive and negative. Within these directories, reviews are stored in text files named following the convention [[id]_[rating].txt] where [id] is a unique id and [rating] is the star rating for that review on a 1-10 scale. For example, the file [test/pos/200_8.txt] is the text for a positive-labeled test set example with unique id 200 and star rating 8/10 from IMDb. The [train/unsup/] directory has 0 for all ratings because the ratings are omitted for this portion of the dataset."),stringsAsFactors = F)

正如@Nathalie在评论中所提到的,如果数据在data.frame中,那么下面的操作就是这样的。docid_字段引用文档id列,text_字段引用包含文本的列

toks <- corpus(df, 
           docid_field = "an_id", 
           text_field = "text") %>% 
  tokens()

str(toks)
List of 4
 $ doc_1: chr "here"
 $ doc_2: chr [1:39] "This" "dataset" "contains" "movie" ...
 $ doc_3: chr [1:36] "The" "core" "dataset" "contains" ...
 $ doc_4: chr [1:105] "There" "are" "two" "top-level" ...
 - attr(*, "types")= chr [1:102] "here" "This" "dataset" "contains" ...
 - attr(*, "padding")= logi FALSE
 - attr(*, "class")= chr "tokens"
 - attr(*, "what")= chr "word"
 - attr(*, "ngrams")= int 1
 - attr(*, "skip")= int 0
 - attr(*, "concatenator")= chr "_"
 - attr(*, "docvars")='data.frame': 4 obs. of  0 variables
数据:


toks%tokens@phiver我不太确定这是否是答案,因为如果我使用此命令,toks的结构不正确。@phiver我认为选项是带有文本的列,用于处理,而不是id列toks%tokens,我用经典的方式read.csv和stringAsFactor=false读取它是,最后一部分是正确的。它生成一个包含4个条目的tokens对象列表,称为doc_1等@phiver如果您需要,请提供答案,因为您帮助我找到了解决方案
dtext <- data.frame(id = c(1,2,3,4), text = c("here","This dataset contains movie reviews along with their associated binary sentiment polarity labels. It is intended to serve as a benchmark for sentiment classification. This document outlines how the dataset was gathered, and how to use the files provided.", "The core dataset contains 50,000 reviews split evenly into 25k train and 25k test sets. The overall distribution of labels is balanced (25k pos and 25k neg). We also include an additional 50,000 unlabeled documents for unsupervised learning.", "There are two top-level directories [train/, test/] corresponding to the training and test sets. Each contains [pos/, neg/] directories for the reviews with binary labels positive and negative. Within these directories, reviews are stored in text files named following the convention [[id]_[rating].txt] where [id] is a unique id and [rating] is the star rating for that review on a 1-10 scale. For example, the file [test/pos/200_8.txt] is the text for a positive-labeled test set example with unique id 200 and star rating 8/10 from IMDb. The [train/unsup/] directory has 0 for all ratings because the ratings are omitted for this portion of the dataset."),stringsAsFactors = F)
toks <- corpus(df, 
           docid_field = "an_id", 
           text_field = "text") %>% 
  tokens()

str(toks)
List of 4
 $ doc_1: chr "here"
 $ doc_2: chr [1:39] "This" "dataset" "contains" "movie" ...
 $ doc_3: chr [1:36] "The" "core" "dataset" "contains" ...
 $ doc_4: chr [1:105] "There" "are" "two" "top-level" ...
 - attr(*, "types")= chr [1:102] "here" "This" "dataset" "contains" ...
 - attr(*, "padding")= logi FALSE
 - attr(*, "class")= chr "tokens"
 - attr(*, "what")= chr "word"
 - attr(*, "ngrams")= int 1
 - attr(*, "skip")= int 0
 - attr(*, "concatenator")= chr "_"
 - attr(*, "docvars")='data.frame': 4 obs. of  0 variables
df <- structure(list(an_id = c("doc_1", "doc_2", "doc_3", "doc_4"), 
    text = c("here", "This dataset contains movie reviews along with their associated binary sentiment polarity labels. It is intended to serve as a benchmark for sentiment classification. This document outlines how the dataset was gathered, and how to use the files provided.", 
    "The core dataset contains 50,000 reviews split evenly into 25k train and 25k test sets. The overall distribution of labels is balanced (25k pos and 25k neg). We also include an additional 50,000 unlabeled documents for unsupervised learning.", 
    "There are two top-level directories [train/, test/] corresponding to the training and test sets. Each contains [pos/, neg/] directories for the reviews with binary labels positive and negative. Within these directories, reviews are stored in text files named following the convention [[id]_[rating].txt] where [id] is a unique id and [rating] is the star rating for that review on a 1-10 scale. For example, the file [test/pos/200_8.txt] is the text for a positive-labeled test set example with unique id 200 and star rating 8/10 from IMDb. The [train/unsup/] directory has 0 for all ratings because the ratings are omitted for this portion of the dataset."
    )), row.names = c(NA, -4L), class = "data.frame")