Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/regex/16.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
R 基于变量将文本字符串拆分为列_R_Regex_Text_Dataframe - Fatal编程技术网

R 基于变量将文本字符串拆分为列

R 基于变量将文本字符串拆分为列,r,regex,text,dataframe,R,Regex,Text,Dataframe,我有一个带有文本列的数据框,我想将其拆分为多个列,因为文本字符串包含多个变量,如位置、教育、距离等 数据帧: text.string = c("&location=NY&distance=30&education=University", "&location=CA&distance=30&education=Highschool&education=University",

我有一个带有文本列的数据框,我想将其拆分为多个列,因为文本字符串包含多个变量,如位置、教育、距离等

数据帧:

text.string = c("&location=NY&distance=30&education=University", 
                "&location=CA&distance=30&education=Highschool&education=University", 
                "&location=MN&distance=10&industry=Healthcare", 
                "&location=VT&distance=30&education=University&industry=IT&industry=Business") 

df = data.frame(text.string)
df


                                                                  text.string
1                               &location=NY&distance=30&education=University
2          &location=CA&distance=30&education=Highschool&education=University
3                                &location=MN&distance=10&industry=Healthcare
4 &location=VT&distance=30&education=University&industry=IT&industry=Business
我可以使用
cSplit
cSplit(df,'text.string',sep=“&”)

问题是文本字符串可能包含同一变量的倍数,或者某些字符串缺少某个变量。使用
cSplit
时,每列变量的分组都会混淆。我希望避免这种情况,并将它们组合在一起

因此它类似于此(
教育
行业
不再出现在多个列中):

text.string\u 1 text.string\u 2 text.string\u 3 text.string\u 4 text.string\u 5 text.string\u 6
1 NA地点=纽约距离=30教育=大学NA
2 NA地点=CA距离=30教育程度=高中教育程度=大学NA
3不适用位置=MN距离=10行业=医疗不适用
4 NA地点=VT距离=30教育=大学行业=IT行业=商业NA

考虑到@NicE评论: 按照您的示例,这是一种方法:

library(data.table)
       text.string = c("&location=NY&distance=30&education=University", 
                    "&location=CA&distance=30&education=Highschool&education=University", 
                    "&location=MN&distance=10&industry=Healthcare", 
                    "&location=VT&distance=30&education=University&industry=IT&industry=Business") 

    clean <- strsplit(text.string, "&|=")
    out <- lapply(clean, function(x){ma <- data.table(matrix(x[!x==""], nrow = 2, byrow = F ));
    setnames(ma, as.character(ma[1,]));
    ma[-1,]})

    out <- rbindlist(out, fill = T)
    out
       location distance  education  education   industry industry
    1:       NY       30 University         NA         NA       NA
    2:       CA       30 Highschool University         NA       NA
    3:       MN       10         NA         NA Healthcare       NA
    4:       VT       30 University         NA         IT Business
库(data.table)
text.string=c(“&location=NY&distance=30&education=University”,
“&location=CA&distance=30&education=Highschool&education=University”,
“&location=MN&distance=10&industry=Healthcare”,
“&location=VT&distance=30&education=University&industry=IT&industry=Business”)

清理如何执行dcast的
dcast
?您需要添加一个
id
变量来跟踪每个变量来自哪一行。感谢您的时间和努力!但是,
strsplit
是否有方法获取数据帧的一列?因为我的字符串在数据帧中,所以我非常希望像这样做
clean我不知道我是否正确地理解了您,但是您也可以在每个列上使用lappy,一些lappy(dataframe,strsplit)。但是,如果我不是,请提供完整的可复制示例。抱歉,也许我不是很清楚,本质上需要拆分的字符串位于数据帧的单个列中。例如,在我的帖子中,该列将是
df$text.string
。我希望现在情况更清楚。哼。。。试试这个
data.frame(text.string=c(“&location=NY&distance=30&education=University”、“&location=CA&distance=30&education=Highschool&education=University”、“&location=MN&distance=10&industry=Healthcare”,“&location=VT&distance=30&education=University&industry=IT&industry=Business”),stringsAsFactors=F)
  text.string_1 text.string_2 text.string_3                             text.string_4                 text.string_5 text.string_6
1            NA   location=NY   distance=30                      education=University                          <NA>            NA
2            NA   location=CA   distance=30 education=Highschool education=University                          <NA>            NA
3            NA   location=MN   distance=10                                      <NA>           industry=Healthcare            NA
4            NA   location=VT   distance=30                      education=University  industry=IT industry=Business            NA
library(data.table)
       text.string = c("&location=NY&distance=30&education=University", 
                    "&location=CA&distance=30&education=Highschool&education=University", 
                    "&location=MN&distance=10&industry=Healthcare", 
                    "&location=VT&distance=30&education=University&industry=IT&industry=Business") 

    clean <- strsplit(text.string, "&|=")
    out <- lapply(clean, function(x){ma <- data.table(matrix(x[!x==""], nrow = 2, byrow = F ));
    setnames(ma, as.character(ma[1,]));
    ma[-1,]})

    out <- rbindlist(out, fill = T)
    out
       location distance  education  education   industry industry
    1:       NY       30 University         NA         NA       NA
    2:       CA       30 Highschool University         NA       NA
    3:       MN       10         NA         NA Healthcare       NA
    4:       VT       30 University         NA         IT Business