Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/string/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
将标点符号添加到中的列表,并将coreace添加到data.frame r_R_String_Word - Fatal编程技术网

将标点符号添加到中的列表,并将coreace添加到data.frame r

将标点符号添加到中的列表,并将coreace添加到data.frame r,r,string,word,R,String,Word,我有一个单词和标点符号库。我正在尝试用它制作一个数据帧,以便以后可以使用它。原始数据集有2000000行带有标点符号,但它是一个列表。我很难从列表中的其他单词中解析出标点符号。我想在单词的每个标点符号之间留空格。我可以很容易地在excel中找到替换项。我有一个例子叫做=df,我想要在R中的输出叫做=output。我附上了下面的代码和我到目前为止所拥有的代码。我尝试str_split查找How,但它删除了“How”,并没有返回任何结果 #------上传第一个数据集并编辑-------# 图书馆(

我有一个单词和标点符号库。我正在尝试用它制作一个数据帧,以便以后可以使用它。原始数据集有2000000行带有标点符号,但它是一个列表。我很难从列表中的其他单词中解析出标点符号。我想在单词的每个标点符号之间留空格。我可以很容易地在excel中找到替换项。我有一个例子叫做=df,我想要在R中的输出叫做=output。我附上了下面的代码和我到目前为止所拥有的代码。我尝试str_split查找How,但它删除了“How”,并没有返回任何结果

#------上传第一个数据集并编辑-------#
图书馆(“stringr”)

sent1以下是完成工作的粗略概念:

首先拆分所有非单词字符的字符(灵感来自)。然后获得最大长度,并填写其他长度相同的值

dfsplt <- strsplit( gsub("([^\\w])","~\\1~", df, perl = TRUE), "~")
dfsplt <- lapply(dfsplt, function(x) x[!x %in% c("", " ")])
n <- max(lengths(dfsplt))
sapply(dfsplt, function(x) {x <- rep(x, ceiling(n / length(x))); x[1:n]})
# or
sapply(dfsplt, function(x) x[(1:n - 1) %% length(x) + 1])

      [,1]     [,2]    [,3]     
 [1,] "How"    "Why"   "How"    
 [2,] "did"    "does"  "do"     
 [3,] "Quebec" "valve" "I"      
 [4,] "?"      "="     "use"    
 [5,] "1"      "."     "a"      
 [6,] "2"      "245"   "period" 
 [7,] "3"      "?"     "("      
 [8,] "How"    "."     "."      
 [9,] "did"    "66"    ")"      
[10,] "Quebec" "Why"   "comma"  
[11,] "?"      "does"  "["      
[12,] "1"      "valve" ","      
[13,] "2"      "="     "]"      
[14,] "3"      "."     "and"    
[15,] "How"    "245"   "hyphen" 
[16,] "did"    "?"     "{"      
[17,] "Quebec" "."     "-"      
[18,] "?"      "66"    "}"      
[19,] "1"      "Why"   "to"     
[20,] "2"      "does"  "columns"
[21,] "3"      "valve" "?"  

dfsplt这里有一个选项,我们在标点符号之间创建一个空格,然后分别扫描它

do.call(cbind, lapply(gsub("([[:punct:]])", " \\1 ", 
       df$text), function(x) scan(text = x, what = "", quiet = TRUE)))
#      [,1]     [,2]    [,3]     
# [1,] "How"    "Why"   "How"    
# [2,] "did"    "does"  "do"     
# [3,] "Quebec" "valve" "I"      
# [4,] "?"      "="     "use"    
# [5,] "1"      "."     "a"      
# [6,] "2"      "245"   "period" 
# [7,] "3"      "?"     "("      
# [8,] "How"    "."     "."      
# [9,] "did"    "66"    ")"      
#[10,] "Quebec" "Why"   "comma"  
#[11,] "?"      "does"  "["      
#[12,] "1"      "valve" ","      
#[13,] "2"      "="     "]"      
#14,] "3"      "."     "and"    
#[15,] "How"    "245"   "hyphen" 
#[16,] "did"    "?"     "{"      
#[17,] "Quebec" "."     "-"      
#[18,] "?"      "66"    "}"      
#[19,] "1"      "Why"   "to"     
#[20,] "2"      "does"  "columns"
#[21,] "3"      "valve" "?"    

输出是您操作
df
的预期输出?是的输出=我的预期输出您的预期输出有效,因为
words1
words2
在长度上是
words3
的精确倍数。你的数据总是这样吗?怀疑这可能是XY问题:meta.stackexchange.com/questions/66377/what-is-the-XY-problemYes,单词1、2、3在原始数据中的长度不同。它也可以像NAsNice一样填写倍数。所以
cbind()
做回收!您对
scan()
strsplit()
的利弊有何评论?@snoram
cbind
在回收过程中发出警告。我认为strsplit
会更快,但这里我使用了
扫描
,因为它将返回一个向量,而不是
列表
(可能需要
取消列表
)`
do.call(cbind, lapply(gsub("([[:punct:]])", " \\1 ", 
       df$text), function(x) scan(text = x, what = "", quiet = TRUE)))
#      [,1]     [,2]    [,3]     
# [1,] "How"    "Why"   "How"    
# [2,] "did"    "does"  "do"     
# [3,] "Quebec" "valve" "I"      
# [4,] "?"      "="     "use"    
# [5,] "1"      "."     "a"      
# [6,] "2"      "245"   "period" 
# [7,] "3"      "?"     "("      
# [8,] "How"    "."     "."      
# [9,] "did"    "66"    ")"      
#[10,] "Quebec" "Why"   "comma"  
#[11,] "?"      "does"  "["      
#[12,] "1"      "valve" ","      
#[13,] "2"      "="     "]"      
#14,] "3"      "."     "and"    
#[15,] "How"    "245"   "hyphen" 
#[16,] "did"    "?"     "{"      
#[17,] "Quebec" "."     "-"      
#[18,] "?"      "66"    "}"      
#[19,] "1"      "Why"   "to"     
#[20,] "2"      "does"  "columns"
#[21,] "3"      "valve" "?"