将标点符号添加到中的列表,并将coreace添加到data.frame r
我有一个单词和标点符号库。我正在尝试用它制作一个数据帧,以便以后可以使用它。原始数据集有2000000行带有标点符号,但它是一个列表。我很难从列表中的其他单词中解析出标点符号。我想在单词的每个标点符号之间留空格。我可以很容易地在excel中找到替换项。我有一个例子叫做=df,我想要在R中的输出叫做=output。我附上了下面的代码和我到目前为止所拥有的代码。我尝试str_split查找How,但它删除了“How”,并没有返回任何结果将标点符号添加到中的列表,并将coreace添加到data.frame r,r,string,word,R,String,Word,我有一个单词和标点符号库。我正在尝试用它制作一个数据帧,以便以后可以使用它。原始数据集有2000000行带有标点符号,但它是一个列表。我很难从列表中的其他单词中解析出标点符号。我想在单词的每个标点符号之间留空格。我可以很容易地在excel中找到替换项。我有一个例子叫做=df,我想要在R中的输出叫做=output。我附上了下面的代码和我到目前为止所拥有的代码。我尝试str_split查找How,但它删除了“How”,并没有返回任何结果 #------上传第一个数据集并编辑-------# 图书馆(
#------上传第一个数据集并编辑-------#
图书馆(“stringr”)
sent1以下是完成工作的粗略概念:
首先拆分所有非单词字符的字符(灵感来自)。然后获得最大长度,并填写其他长度相同的值
dfsplt <- strsplit( gsub("([^\\w])","~\\1~", df, perl = TRUE), "~")
dfsplt <- lapply(dfsplt, function(x) x[!x %in% c("", " ")])
n <- max(lengths(dfsplt))
sapply(dfsplt, function(x) {x <- rep(x, ceiling(n / length(x))); x[1:n]})
# or
sapply(dfsplt, function(x) x[(1:n - 1) %% length(x) + 1])
[,1] [,2] [,3]
[1,] "How" "Why" "How"
[2,] "did" "does" "do"
[3,] "Quebec" "valve" "I"
[4,] "?" "=" "use"
[5,] "1" "." "a"
[6,] "2" "245" "period"
[7,] "3" "?" "("
[8,] "How" "." "."
[9,] "did" "66" ")"
[10,] "Quebec" "Why" "comma"
[11,] "?" "does" "["
[12,] "1" "valve" ","
[13,] "2" "=" "]"
[14,] "3" "." "and"
[15,] "How" "245" "hyphen"
[16,] "did" "?" "{"
[17,] "Quebec" "." "-"
[18,] "?" "66" "}"
[19,] "1" "Why" "to"
[20,] "2" "does" "columns"
[21,] "3" "valve" "?"
dfsplt这里有一个选项,我们在标点符号之间创建一个空格,然后分别扫描它
do.call(cbind, lapply(gsub("([[:punct:]])", " \\1 ",
df$text), function(x) scan(text = x, what = "", quiet = TRUE)))
# [,1] [,2] [,3]
# [1,] "How" "Why" "How"
# [2,] "did" "does" "do"
# [3,] "Quebec" "valve" "I"
# [4,] "?" "=" "use"
# [5,] "1" "." "a"
# [6,] "2" "245" "period"
# [7,] "3" "?" "("
# [8,] "How" "." "."
# [9,] "did" "66" ")"
#[10,] "Quebec" "Why" "comma"
#[11,] "?" "does" "["
#[12,] "1" "valve" ","
#[13,] "2" "=" "]"
#14,] "3" "." "and"
#[15,] "How" "245" "hyphen"
#[16,] "did" "?" "{"
#[17,] "Quebec" "." "-"
#[18,] "?" "66" "}"
#[19,] "1" "Why" "to"
#[20,] "2" "does" "columns"
#[21,] "3" "valve" "?"
输出是您操作df
的预期输出?是的输出=我的预期输出您的预期输出有效,因为words1
和words2
在长度上是words3
的精确倍数。你的数据总是这样吗?怀疑这可能是XY问题:meta.stackexchange.com/questions/66377/what-is-the-XY-problemYes,单词1、2、3在原始数据中的长度不同。它也可以像NAsNice一样填写倍数。所以cbind()
做回收!您对scan()
与strsplit()
的利弊有何评论?@snoramcbind
在回收过程中发出警告。我认为strsplit
会更快,但这里我使用了扫描
,因为它将返回一个向量,而不是列表
(可能需要取消列表
)`
do.call(cbind, lapply(gsub("([[:punct:]])", " \\1 ",
df$text), function(x) scan(text = x, what = "", quiet = TRUE)))
# [,1] [,2] [,3]
# [1,] "How" "Why" "How"
# [2,] "did" "does" "do"
# [3,] "Quebec" "valve" "I"
# [4,] "?" "=" "use"
# [5,] "1" "." "a"
# [6,] "2" "245" "period"
# [7,] "3" "?" "("
# [8,] "How" "." "."
# [9,] "did" "66" ")"
#[10,] "Quebec" "Why" "comma"
#[11,] "?" "does" "["
#[12,] "1" "valve" ","
#[13,] "2" "=" "]"
#14,] "3" "." "and"
#[15,] "How" "245" "hyphen"
#[16,] "did" "?" "{"
#[17,] "Quebec" "." "-"
#[18,] "?" "66" "}"
#[19,] "1" "Why" "to"
#[20,] "2" "does" "columns"
#[21,] "3" "valve" "?"