在R中循环以匹配并从其他工作表中提取数据_R

在R中循环以匹配并从其他工作表中提取数据

在R中循环以匹配并从其他工作表中提取数据,r,R,我是R的新手，所以请容忍我。我有两个数据帧dfA和dfB dfA DfB 期望输出 Type No Test1 11000 1 11000 Test2 1 Test3 11000 1 11000 Test4 1 Test5 11001 2 Test6 11002 3 Test7 11003 4 Test8 11004 5 Test9 11004 5 Test10 11006 7 我相信forloop和grepl是必需的。如果有人能帮助我写下

我是R的新手，所以请容忍我。我有两个数据帧dfA和dfB

dfA

DfB

期望输出

Type         No
Test1 11000   1
11000 Test2   1
Test3 11000   1
11000 Test4   1
Test5 11001   2
Test6 11002   3
Test7 11003   4
Test8 11004   5
Test9 11004   5
Test10 11006  7

我相信forloop和grepl是必需的。如果有人能帮助我写下forloop，那将是非常有帮助的。

希望这能有所帮助

Type <- c("Test1 11000","11000 Test2","Test3 11000","11000 Test4","Test5 11001",
          "Test6 11002","Test7 11003","Test8 11004","Test9 11004","Test10 11006")

Asset_NO <- seq(11000,11009,1)   
No <- seq(1,10,1)

dfA <- data.frame(Type)
dfB <- data.frame(Asset_NO,No)

split <- str_split(dfA$Type, " ")
v <- c(NULL)

for (i in 1:length(split)) {
  f <- sapply(split, "[",1)
  s <- sapply(split, "[",2)
  #v <- ifelse(grepl("Test",f), s, f)
   v <- ifelse(grepl("[a-zA-Z]",f), s, f) #As per the new cooment
}

dfA$Asset_NO <- v
dfA$Asset_NO <- as.numeric(dfA$Asset_NO)

m <- merge(dfA, dfB, by="Asset_NO")
m

我首先为

dfA

创建一个名为

new

的新列：

并使用

gsub

将所有字母替换为空字符串

“

”。这给了我一个两个数字的向量，但我通过对数字进行排序来寻找有意义的一个，一个是索引，另一个是根据

dfB中的Asset\u NO
列的值
dfA$NEW = NA
for(i in 1:nrow(dfA)){
    temp = as.numeric(strsplit(gsub("[[:alpha:]]", "", dfA$Type[i]), split = " ")[[1]])
    dfA$NEW[i] = (sort(temp, decreasing = T)[1])
}

我确保这些现在都是数字形式，所以比较苹果和桔子
dfB$Asset_NO = as.numeric(dfB$Asset_NO)

然后我需要做的就是合并它们
df_new = merge(x=dfA, y=dfB, by.x  ="NEW" , by.y="Asset_NO")[,-1]
print(df_new)

# Type No
# 1   Test1 11000  1
# 2   11000 Test2  1
# 3   Test3 11000  1
# 4   11000 Test4  1
# 5   Test5 11001  2
# 6   Test6 11002  3
# 7   Test7 11003  4
# 8   Test8 11004  5
# 9   Test9 11004  5
# 10 Test10 11006  7

您可以使用gregexpr
和一个简单的ifelse
语句和merge
在dfB
中创建列Asset\u NO
，如下所示
dfA$Asset_NO <- ifelse(sapply(gregexpr('[A-Za-z]+', dfA$Type), '[', 1) > 1, 
                          gsub('\\s+.*', '', dfA$Type), gsub('.*\\s+', '', dfA$Type))

merge(dfA, dfB)

#   Asset_NO         Type No
#1     11000  Test1 11000  1
#2     11000  11000 Test2  1
#3     11000  Test3 11000  1
#4     11000  11000 Test4  1
#5     11001  Test5 11001  2
#6     11002  Test6 11002  3
#7     11003  Test7 11003  4
#8     11004  Test8 11004  5
#9     11004  Test9 11004  5
#10    11006 Test10 11006  7

dfA$Asset\u NO 1，
gsub（'\\s+.''，''，dfA$类型），gsub（'.'\\s+'，''，dfA$类型））
合并（dfA、dfB）
#资产编号类型编号
#11000测试11000 1
#2 11000 11000测试2 1
#3 11000测试3 11000 1
#4 11000 11000测试4 1
#5 11001测试5 11001 2
#611002测试611002 3
#7 11003测试7 11003 4
#811004测试811004 5
#9 11004测试9 11004 5
#1011006测试1011006 7
如果dfB$No
只是一个行号，我只会这么做
match(as.integer(sub(".*(\\b\\d+\\b).*", "\\1", dfA$Type)), dfB$AssetNO)
## [1] 1 1 1 1 2 3 4 5 5 7

这将只捕获dfA$Type
中的整数（以单词绑定为界），然后匹配回dfB$AssetNO


否则，只要稍加修改，您就可以
indx <- match(as.integer(sub(".*(\\b\\d+\\b).*", "\\1", dfA$Type)), dfB$AssetNO)
dfB[indx, "No"]
## [1] 1 1 1 1 2 3 4 5 5 7

indx这是另一种使用数据的方法。表
和stringr
：
library(data.table)
dfA[, Asset := as.integer(stringr::str_extract(Type, "(^\\d{5})|(\\d{5}$)"))]
dfB[dfA, on = "Asset", .(Type, No)]
#            Type No
# 1:  Test1 11000  1
# 2:  11000 Test2  1
# 3:  Test3 11000  1
# 4:  11000 Test4  1
# 5:  Test5 11001  2
# 6:  Test6 11002  3
# 7:  Test7 11003  4
# 8:  Test8 11004  5
# 9:  Test9 11004  5
#10: Test10 11006  7
#11:   Test111000  1
#12:   11000Test2  1
#13:   Test311000  1
#14:   11000Test4  1
#15:   Test511001  2
#16:   Test611002  3
#17:   Test711003  4
#18:   Test811004  5
#19:   Test911004  5
#20:  Test1011006  7

请注意，所有答案在用于从类型
提取资产编号的正则表达式中都有所不同。这是由于原始问题中提供的规格不良造成的
此处使用的正则表达式假定资产编号始终由五位数字组成，并且类型
以资产编号开始或结束。这与任何单词边界无关，因此也适用于类型中不包含空格的情况
将提取的资产编号分配到新列后，asset
上的dfA
与dfB
连接
讨论
我刚刚意识到OP在各种评论中透露了重要信息：
@etienne这是一行没有列，因此理想情况下是一个单词“11000Test2”“Test311000”，因此使用grepl和search from second df的原因是数字不在行中，需要使用搜索来匹配，否则可能会删除首字母并进行匹配
及
不能使用grepl作为“test”，因为所有类型都没有提到为test。这只是真实数据中的一个例子。所以基本上需要对数字进行搜索，所以从dfB中搜索第一个数字，并将其与dfA匹配，然后从dfB的b列获得输出，如果不匹配，则从第二个df中选择第二个数字，依此类推
但是,，只要OP没有提供更真实的生产数据样本，只要资产编号可以通过正则表达式从Type
中毫不含糊地提取出来，就不需要查找每个给定的资产编号，如果Type
中有匹配项，那么dfA
Test2
中是否正常和Test4
在右列，而不是左列？在dfA类型中，是一个具有数据“Test1 11000”的单列？最好是发布dput（dfA）
和dput（dfB）的输出为了更好地了解您的数据是什么样子的。@etienne它是一行无列的，因此理想情况下只有一个单词“11000Test2”“Test311000”，因此使用grepl和search from second df的原因是因为数字不在第二行中，所以需要使用搜索来匹配，否则可能会删除首字母并进行匹配。@Hardik是的，它是单一的Column不能将grepl用作“test”，因为所有类型都没有提到test。这只是真实数据中的一个示例，与itz完全不同。所以基本上需要对数字进行搜索，所以从dfB中搜索第一个数字，并将其与dfA匹配，然后从dfB的b列获得输出，如果不匹配，则从第二个df中选择第二个数字，依此类推。。。。
match(as.integer(sub(".*(\\b\\d+\\b).*", "\\1", dfA$Type)), dfB$AssetNO)
## [1] 1 1 1 1 2 3 4 5 5 7

indx <- match(as.integer(sub(".*(\\b\\d+\\b).*", "\\1", dfA$Type)), dfB$AssetNO)
dfB[indx, "No"]
## [1] 1 1 1 1 2 3 4 5 5 7

library(data.table)
dfA[, Asset := as.integer(stringr::str_extract(Type, "(^\\d{5})|(\\d{5}$)"))]
dfB[dfA, on = "Asset", .(Type, No)]
#            Type No
# 1:  Test1 11000  1
# 2:  11000 Test2  1
# 3:  Test3 11000  1
# 4:  11000 Test4  1
# 5:  Test5 11001  2
# 6:  Test6 11002  3
# 7:  Test7 11003  4
# 8:  Test8 11004  5
# 9:  Test9 11004  5
#10: Test10 11006  7
#11:   Test111000  1
#12:   11000Test2  1
#13:   Test311000  1
#14:   11000Test4  1
#15:   Test511001  2
#16:   Test611002  3
#17:   Test711003  4
#18:   Test811004  5
#19:   Test911004  5
#20:  Test1011006  7