R中data.table中的条件字符串拆分_R_Regex_Data.table

R中data.table中的条件字符串拆分

r regex

R中data.table中的条件字符串拆分,r,regex,data.table,R,Regex,Data.table,基于这个问题：，我想知道是否有一种有效的方法可以根据行的内容有条件地拆分文本字符串假设我有下表： Comments Eaten 001 Testing my computer No 0026 Testing my fridge No Testing my car Yes 我想要这个： ID Comments Eaten 001 Testing my computer No 0026 Testin

基于这个问题：，我想知道是否有一种有效的方法可以根据行的内容有条件地拆分文本字符串

假设我有下表：

Comments                  Eaten
001 Testing my computer   No
0026 Testing my fridge    No
Testing my car            Yes

我想要这个：

ID   Comments             Eaten
001  Testing my computer  No
0026 Testing my fridge    No
NA   Testing my car       Yes

其中NA是空的

这在data.table中可能吗

注释应该有一个ID，但由于这是可选的，我只想在注释以数字开头时提取ID。

这可以使用

tidyr

的

extract

函数来完成，该函数允许您指定正则表达式模式：

tidyr::extract(dt, Comments, c("ID", "Comments"), regex = "^(\\d+)?\\s?(.*)$")
#     ID            Comments Eaten
#1:  001 Testing my computer    No
#2: 0026   Testing my fridge    No
#3:   NA      Testing my car   Yes

如果希望将提取的列转换为更合理的类型，可以添加参数

convert=TRUE

另一个仅使用base R和data.table的选项是

dt[grepl("^\\d+", Comments),                     # check if start with ID (subset)
   `:=`(ID = sub("^(\\d+).*", "\\1",Comments),   # extract ID from comments
        Comments = sub("^(\\d+)", "",Comments))  # delete ID from Comments
]

虽然在这种情况下，tidyr语法对我来说似乎更容易一些。也可能有一种使用数据的方法。表的

tstrsplit

函数带有一个奇特的lookaround regex。

所以您知道应该有一个ID和注释，或者应该自动检测到它？有一个ID是可选的，但如果它的注释以一个数字开头，那么它应该是一个ID。

转置（regmatches（x，regexec（“^（\\d+）？？（.*”，x））

或类似，我猜。由于OP的数据不可复制，因此未进行测试…R3.4.0还有一个

strcapture

函数，可以放在这里--

strcapture（^（\\d+）\\s（.*）$，dt$Comments，data.frame（ID=“”，Comment=”“）

@alexis_laz看起来很有趣，但那些空字符串也有点奇怪。我还没有升级到3.4.0