R 基于变量将文本字符串拆分为列
我有一个带有文本列的数据框,我想将其拆分为多个列,因为文本字符串包含多个变量,如位置、教育、距离等 数据帧:R 基于变量将文本字符串拆分为列,r,regex,text,dataframe,R,Regex,Text,Dataframe,我有一个带有文本列的数据框,我想将其拆分为多个列,因为文本字符串包含多个变量,如位置、教育、距离等 数据帧: text.string = c("&location=NY&distance=30&education=University", "&location=CA&distance=30&education=Highschool&education=University",
text.string = c("&location=NY&distance=30&education=University",
"&location=CA&distance=30&education=Highschool&education=University",
"&location=MN&distance=10&industry=Healthcare",
"&location=VT&distance=30&education=University&industry=IT&industry=Business")
df = data.frame(text.string)
df
text.string
1 &location=NY&distance=30&education=University
2 &location=CA&distance=30&education=Highschool&education=University
3 &location=MN&distance=10&industry=Healthcare
4 &location=VT&distance=30&education=University&industry=IT&industry=Business
我可以使用cSplit
:cSplit(df,'text.string',sep=“&”)
:
问题是文本字符串可能包含同一变量的倍数,或者某些字符串缺少某个变量。使用cSplit
时,每列变量的分组都会混淆。我希望避免这种情况,并将它们组合在一起
因此它类似于此(教育
和行业
不再出现在多个列中):
text.string\u 1 text.string\u 2 text.string\u 3 text.string\u 4 text.string\u 5 text.string\u 6
1 NA地点=纽约距离=30教育=大学NA
2 NA地点=CA距离=30教育程度=高中教育程度=大学NA
3不适用位置=MN距离=10行业=医疗不适用
4 NA地点=VT距离=30教育=大学行业=IT行业=商业NA
考虑到@NicE评论:
按照您的示例,这是一种方法:
library(data.table)
text.string = c("&location=NY&distance=30&education=University",
"&location=CA&distance=30&education=Highschool&education=University",
"&location=MN&distance=10&industry=Healthcare",
"&location=VT&distance=30&education=University&industry=IT&industry=Business")
clean <- strsplit(text.string, "&|=")
out <- lapply(clean, function(x){ma <- data.table(matrix(x[!x==""], nrow = 2, byrow = F ));
setnames(ma, as.character(ma[1,]));
ma[-1,]})
out <- rbindlist(out, fill = T)
out
location distance education education industry industry
1: NY 30 University NA NA NA
2: CA 30 Highschool University NA NA
3: MN 10 NA NA Healthcare NA
4: VT 30 University NA IT Business
库(data.table)
text.string=c(“&location=NY&distance=30&education=University”,
“&location=CA&distance=30&education=Highschool&education=University”,
“&location=MN&distance=10&industry=Healthcare”,
“&location=VT&distance=30&education=University&industry=IT&industry=Business”)
清理如何执行dcast的dcast
?您需要添加一个id
变量来跟踪每个变量来自哪一行。感谢您的时间和努力!但是,strsplit
是否有方法获取数据帧的一列?因为我的字符串在数据帧中,所以我非常希望像这样做clean我不知道我是否正确地理解了您,但是您也可以在每个列上使用lappy,一些lappy(dataframe,strsplit)。但是,如果我不是,请提供完整的可复制示例。抱歉,也许我不是很清楚,本质上需要拆分的字符串位于数据帧的单个列中。例如,在我的帖子中,该列将是df$text.string
。我希望现在情况更清楚。哼。。。试试这个data.frame(text.string=c(“&location=NY&distance=30&education=University”、“&location=CA&distance=30&education=Highschool&education=University”、“&location=MN&distance=10&industry=Healthcare”,“&location=VT&distance=30&education=University&industry=IT&industry=Business”),stringsAsFactors=F)
text.string_1 text.string_2 text.string_3 text.string_4 text.string_5 text.string_6
1 NA location=NY distance=30 education=University <NA> NA
2 NA location=CA distance=30 education=Highschool education=University <NA> NA
3 NA location=MN distance=10 <NA> industry=Healthcare NA
4 NA location=VT distance=30 education=University industry=IT industry=Business NA
library(data.table)
text.string = c("&location=NY&distance=30&education=University",
"&location=CA&distance=30&education=Highschool&education=University",
"&location=MN&distance=10&industry=Healthcare",
"&location=VT&distance=30&education=University&industry=IT&industry=Business")
clean <- strsplit(text.string, "&|=")
out <- lapply(clean, function(x){ma <- data.table(matrix(x[!x==""], nrow = 2, byrow = F ));
setnames(ma, as.character(ma[1,]));
ma[-1,]})
out <- rbindlist(out, fill = T)
out
location distance education education industry industry
1: NY 30 University NA NA NA
2: CA 30 Highschool University NA NA
3: MN 10 NA NA Healthcare NA
4: VT 30 University NA IT Business