在R中将字符串拆分为新行
我有如下数据集:在R中将字符串拆分为新行,r,split,dataframe,strsplit,R,Split,Dataframe,Strsplit,我有如下数据集: Country Region Molecule Item Code IND NA PB102 FR206985511 THAI AP PB103 BA-107603 / F000113361 / 107603 LUXE NA PB105 1012701 / SGP-1012701 / F041701000 IND AP
Country Region Molecule Item Code
IND NA PB102 FR206985511
THAI AP PB103 BA-107603 / F000113361 / 107603
LUXE NA PB105 1012701 / SGP-1012701 / F041701000
IND AP PB106 AU206985211 / CA-F206985211
THAI HP PB107 F034702000 / 1010701 / SGP-1010701
BANG NA PB108 F000007970/25781/20009021
我想根据/
上ITEMCODE
列中的字符串值进行拆分,并为每个条目创建一个新行
例如,所需的输出将是:
Country Region Molecule Item.Code
IND NA PB102 FR206985511
THAI AP PB103 BA-107603
THAI AP PB103 F000113361
THAI AP PB103 107603
LUXE NA PB105 1012701
LUXE NA PB105 SGP-1012701
LUXE NA PB105 F041701000
IND AP PB106 AU206985211
IND AP PB106 CA-F206985211
THAI HP PB107 F034702000
THAI HP PB107 1010701
THAI HP PB107 SGP-1010701
BANG NA PB108 F000007970
BANG NA PB108 25781
BANG NA PB108 20009021
我尝试了下面的代码
library(splitstackshape)
df2=concat.split.multiple(df1,"Plant.Item.Code","/", direction="long")
但是我犯了错误
"Error: memory exhausted (limit reached?)"
当我尝试strsplit()
时,我收到了下面的错误消息
Error in strsplit(df1$Plant.Item.Code, "/") : non-character argument
尝试
cSplit
功能(因为您已经在使用@Anandas包)。请注意,is将返回一个data.table
对象,因此请确保已安装此软件包。通过执行类似于setDF(df2)
库(splitstackshape)
df2试试这样的方法
d <- structure(list(Country = c("A", "B", "C"), `Item Code` = c("FR206985511",
"BA-107603/F000113361/107603", "1012701/SGP-1012701/F041701000")),
.Names = c("Country", "Item Code"), row.names = c(NA, -3L),
class = "data.frame")
d
# Country Item code
# A FR206985511
# B BA-107603/F000113361/107603
# C 1012701/SGP-1012701/F041701000
codes <- strsplit(d$"Item Code", "/")
code.lengths <- sapply(codes, length)
new.d <- d[rep(1:nrow(d), code.lengths), ]
new.d$"Item Code" <- unlist(codes)
new.d
# Country Item Code
#1 A FR206985511
#2 B BA-107603
#2.1 B F000113361
#2.2 B 107603
#3 C 1012701
#3.1 C SGP-1012701
#3.2 C F041701000
dbase R中的另一种方法:
as.data.frame(do.call(rbind, apply(df1, 1, function(x) {
do.call(expand.grid, strsplit(x, " */ *"))
})))
结果是:
Country Region Molecule Item.Code
1 IND <NA> PB102 FR206985511
2 THAI AP PB103 BA-107603
3 THAI AP PB103 F000113361
4 THAI AP PB103 107603
5 LUXE <NA> PB105 1012701
6 LUXE <NA> PB105 SGP-1012701
7 LUXE <NA> PB105 F041701000
8 IND AP PB106 AU206985211
9 IND AP PB106 CA-F206985211
10 THAI HP PB107 F034702000
11 THAI HP PB107 1010701
12 THAI HP PB107 SGP-1010701
13 BANG <NA> PB108 F000007970
14 BANG <NA> PB108 25781
15 BANG <NA> PB108 20009021
国家/地区分子项目。代码
1 IND PB102 FR206985511
2泰国AP PB103 BA-107603
3泰国AP PB103 F000113361
4泰国AP PB103 107603
5豪华PB105 1012701
6豪华PB105 SGP-1012701
7豪华PB105 F041701000
8 IND AP PB106 AU206985211
9 IND AP PB106 CA-F206985211
10泰国HP PB107 F034702000
11泰国HP PB107 1010701
12泰国HP PB107 SGP-1010701
13邦PB108 F000007970
14邦PB108 25781
15邦PB108 20009021
对于第二个错误,可以使用strsplit(as.character(df1$Plant.Item.code,“/”)
假设该列为因子
我支持下面David的答案。它将更有效。您当前使用的函数依赖于重塑
函数,速度较慢,可能会遇到内存问题。非常感谢David。这确实有效,是一个超快速的解决方案。有没有办法添加程序此功能的ess bar?@510947,似乎您已经在github上提交了请求没有?嗯,是的……只是针对更大的受众:)
Country Region Molecule Item.Code
1 IND <NA> PB102 FR206985511
2 THAI AP PB103 BA-107603
3 THAI AP PB103 F000113361
4 THAI AP PB103 107603
5 LUXE <NA> PB105 1012701
6 LUXE <NA> PB105 SGP-1012701
7 LUXE <NA> PB105 F041701000
8 IND AP PB106 AU206985211
9 IND AP PB106 CA-F206985211
10 THAI HP PB107 F034702000
11 THAI HP PB107 1010701
12 THAI HP PB107 SGP-1010701
13 BANG <NA> PB108 F000007970
14 BANG <NA> PB108 25781
15 BANG <NA> PB108 20009021