R 连接然后使用data.table(不含中间表)进行变异
我是R 连接然后使用data.table(不含中间表)进行变异,r,data.table,R,Data.table,我是data.table的初学者,四处搜索以进行连接,然后对列进行变异。我找到了线索,但无法继续 请注意,我可以使用dplyr完成我想做的事情,但是由于数据的大小,在实际数据上运行此代码是不可行的。另外,由于上述原因,我无法创建中间表 以下是我使用dplyr的数据和解决方案 输入 DFI = structure(list(PO_ID = c("P1234", "P1234", "P1234", "P1234", "P1234", "P1234", "P2345", "P2345", "P345
data.table
的初学者,四处搜索以进行连接,然后对列进行变异。我找到了线索,但无法继续
请注意,我可以使用dplyr
完成我想做的事情,但是由于数据的大小,在实际数据上运行此代码是不可行的。另外,由于上述原因,我无法创建中间表
以下是我使用dplyr的数据和解决方案
输入
DFI = structure(list(PO_ID = c("P1234", "P1234", "P1234", "P1234",
"P1234", "P1234", "P2345", "P2345", "P3456", "P4567"), SO_ID = c("S1",
"S1", "S1", "S2", "S2", "S2", "S3", "S4", "S7", "S10"), F_Year = c(2012,
2012, 2012, 2013, 2013, 2013, 2011, 2011, 2014, 2015), Product_ID = c("385X",
"385X", "385X", "450X", "450X", "900X", "3700", "3700", "A11U",
"2700"), Revenue = c(1, 2, 3, 34, 34, 6, 7, 88, 9, 100), Quantity = c(1,
2, 3, 8, 8, 6, 7, 8, 9, 40), Location1 = c("MA", "NY", "WA",
"NY", "WA", "NY", "IL", "IL", "MN", "CA")), .Names = c("PO_ID",
"SO_ID", "F_Year", "Product_ID", "Revenue", "Quantity", "Location1"
), row.names = c(NA, 10L), class = "data.frame")
DFO = structure(list(PO_ID = c("P1234", "P1234", "P1234", "P1234",
"P1234", "P1234", "P2345", "P2345", "P3456", "P4567"), SO_ID = c("S1",
"S1", "S1", "S2", "S2", "S2", "S3", "S4", "S7", "S10"), F_Year = c(2012,
2012, 2012, 2013, 2013, 2013, 2011, 2011, 2014, 2015), Product_ID = c("385X",
"385X", "385X", "450X", "450X", "900X", "3700", "3700", "A11U",
"2700"), Revenue = c(16.6666666666667, 16.6666666666667, 16.6666666666667,
35, 35, 35, 100, -50, 50, 100), Quantity = c(1, 1, 1, 10, 10,
20, 20, -10, 20, 40), Location1 = c("MA", "NY", "WA", "NY", "WA",
"NY", "IL", "IL", "MN", "CA")), .Names = c("PO_ID", "SO_ID",
"F_Year", "Product_ID", "Revenue", "Quantity", "Location1"), row.names = c(NA,
10L), class = "data.frame")
查找表
DF_Lookup = structure(list(PO_ID = c("P1234", "P1234", "P1234", "P2345",
"P2345", "P3456", "P4567"), SO_ID = c("S1", "S2", "S2", "S3",
"S4", "S7", "S10"), F_Year = c(2012, 2013, 2013, 2011, 2011,
2014, 2015), Product_ID = c("385X", "450X", "900X", "3700", "3700",
"A11U", "2700"), Revenue = c(50, 70, 35, 100, -50, 50, 100),
Quantity = c(3, 20, 20, 20, -10, 20, 40)), .Names = c("PO_ID",
"SO_ID", "F_Year", "Product_ID", "Revenue", "Quantity"), row.names = c(NA,
7L), class = "data.frame")
输出
DFI = structure(list(PO_ID = c("P1234", "P1234", "P1234", "P1234",
"P1234", "P1234", "P2345", "P2345", "P3456", "P4567"), SO_ID = c("S1",
"S1", "S1", "S2", "S2", "S2", "S3", "S4", "S7", "S10"), F_Year = c(2012,
2012, 2012, 2013, 2013, 2013, 2011, 2011, 2014, 2015), Product_ID = c("385X",
"385X", "385X", "450X", "450X", "900X", "3700", "3700", "A11U",
"2700"), Revenue = c(1, 2, 3, 34, 34, 6, 7, 88, 9, 100), Quantity = c(1,
2, 3, 8, 8, 6, 7, 8, 9, 40), Location1 = c("MA", "NY", "WA",
"NY", "WA", "NY", "IL", "IL", "MN", "CA")), .Names = c("PO_ID",
"SO_ID", "F_Year", "Product_ID", "Revenue", "Quantity", "Location1"
), row.names = c(NA, 10L), class = "data.frame")
DFO = structure(list(PO_ID = c("P1234", "P1234", "P1234", "P1234",
"P1234", "P1234", "P2345", "P2345", "P3456", "P4567"), SO_ID = c("S1",
"S1", "S1", "S2", "S2", "S2", "S3", "S4", "S7", "S10"), F_Year = c(2012,
2012, 2012, 2013, 2013, 2013, 2011, 2011, 2014, 2015), Product_ID = c("385X",
"385X", "385X", "450X", "450X", "900X", "3700", "3700", "A11U",
"2700"), Revenue = c(16.6666666666667, 16.6666666666667, 16.6666666666667,
35, 35, 35, 100, -50, 50, 100), Quantity = c(1, 1, 1, 10, 10,
20, 20, -10, 20, 40), Location1 = c("MA", "NY", "WA", "NY", "WA",
"NY", "IL", "IL", "MN", "CA")), .Names = c("PO_ID", "SO_ID",
"F_Year", "Product_ID", "Revenue", "Quantity", "Location1"), row.names = c(NA,
10L), class = "data.frame")
这是我使用dplyr的代码
我在这里使用两个库:dplyr
和compare
我正在使用left join将新条目从查找表添加到DFI
。然后,我根据组中的行数划分收入和列。这是因为我想防止分组时数字膨胀
DF_Generated <- DFI %>%
dplyr::left_join(DF_Lookup,by = c("PO_ID", "SO_ID", "F_Year", "Product_ID")) %>%
dplyr::group_by(PO_ID, SO_ID, F_Year, Product_ID) %>%
dplyr::mutate(Count = n()) %>%
dplyr::ungroup()%>%
dplyr::mutate(Revenue = Revenue.y/Count, Quantity = Quantity.y/Count) %>%
dplyr::select(PO_ID:Product_ID,Location1,Revenue,Quantity)
我真诚地感谢任何帮助 只需将列添加到DFI(在“更新联接”中),而不是创建新表,效率更高:
DFI[DF_Lookup, on=.(PO_ID, SO_ID, F_Year, Product_ID),
`:=`(newrev = i.Revenue/.N, newqty = i.Quantity/.N)
, by=.EACHI]
PO_ID SO_ID F_Year Product_ID Revenue Quantity Location1 newrev newqty
1: P1234 S1 2012 385X 1 1 MA 16.66667 1
2: P1234 S1 2012 385X 2 2 NY 16.66667 1
3: P1234 S1 2012 385X 3 3 WA 16.66667 1
4: P1234 S2 2013 450X 34 8 NY 35.00000 10
5: P1234 S2 2013 450X 34 8 WA 35.00000 10
6: P1234 S2 2013 900X 6 6 NY 35.00000 20
7: P2345 S3 2011 3700 7 7 IL 100.00000 20
8: P2345 S4 2011 3700 88 8 IL -50.00000 -10
9: P3456 S7 2014 A11U 9 9 MN 50.00000 20
10: P4567 S10 2015 2700 100 40 CA 100.00000 40
这是OP中链接的Q&a的一个非常自然的扩展
by=.EACHI
按x[i,on=,j]
中的i
每行分组;.N
是组中有多少行
如果要覆盖rev和qty列,请使用
`:=`(Revenue=i.Revenue/.N,Quantity=i.Quantity/.N)
只需将列添加到DFI(在“更新联接”中),而不是创建新表,效率更高:
DFI[DF_Lookup, on=.(PO_ID, SO_ID, F_Year, Product_ID),
`:=`(newrev = i.Revenue/.N, newqty = i.Quantity/.N)
, by=.EACHI]
PO_ID SO_ID F_Year Product_ID Revenue Quantity Location1 newrev newqty
1: P1234 S1 2012 385X 1 1 MA 16.66667 1
2: P1234 S1 2012 385X 2 2 NY 16.66667 1
3: P1234 S1 2012 385X 3 3 WA 16.66667 1
4: P1234 S2 2013 450X 34 8 NY 35.00000 10
5: P1234 S2 2013 450X 34 8 WA 35.00000 10
6: P1234 S2 2013 900X 6 6 NY 35.00000 20
7: P2345 S3 2011 3700 7 7 IL 100.00000 20
8: P2345 S4 2011 3700 88 8 IL -50.00000 -10
9: P3456 S7 2014 A11U 9 9 MN 50.00000 20
10: P4567 S10 2015 2700 100 40 CA 100.00000 40
这是OP中链接的Q&a的一个非常自然的扩展
by=.EACHI
按x[i,on=,j]
中的i
每行分组;.N
是组中有多少行
如果要覆盖版本和数量列,请使用
`:=`(Revenue=i.Revenue/.N,Quantity=i.Quantity/.N)
确定。我所做的只是左连接,然后在结果中对列进行变异。不过,让我写几行。我已经添加了逻辑。好的。我所做的只是一个左连接,然后在结果中改变列。不过,让我写几行。我添加了逻辑。太棒了。谢谢我如何摆脱旧的收入和数量?如果您也能添加这一部分,我将不胜感激。@watchtower您可以在`:=`()
中使用它们的名称,这些列将被覆盖。(我通常不太愿意这样改写。)谢谢你,弗兰克。这很有帮助。我正在读.EACHI
上的一篇文章。我如何知道上面代码中的分组内容?我猜它是PO\u ID
,SO\u ID
,F\u Year
,Product\u ID
?如果是这样,我如何修改您的代码,使其仅对4列中的3列进行分组,例如PO_ID
,so_ID
,F_Year
?我知道这是一个不同的问题,我可以为它创建一个新的线程。请让我知道。@watchtower它不是根据这些变量分组,而是在表的i
位置的每一行上分组。原则上,您可以在那里重复行,每个行都将由.EACHI单独处理。例如,我可以在第一行加入两次:DFI[DF_Lookup[c(1,1)],on=(PO_ID,SO_ID,F_Year,Product_ID),.N,by=.EACHI]
。关于如何在一组列上进行连接,但基于一组较小的列进行计数,是的,不幸的是,该功能在更新连接中还不可用:实际上,对于这种情况,有一些非常简单的解决方法(考虑到您仅使用.N
),但是,是的,我认为它需要一个单独的问题。太棒了。谢谢我如何摆脱旧的收入和数量?如果您也能添加这一部分,我将不胜感激。@watchtower您可以在`:=`()
中使用它们的名称,这些列将被覆盖。(我通常不太愿意这样改写。)谢谢你,弗兰克。这很有帮助。我正在读.EACHI
上的一篇文章。我如何知道上面代码中的分组内容?我猜它是PO\u ID
,SO\u ID
,F\u Year
,Product\u ID
?如果是这样,我如何修改您的代码,使其仅对4列中的3列进行分组,例如PO_ID
,so_ID
,F_Year
?我知道这是一个不同的问题,我可以为它创建一个新的线程。请让我知道。@watchtower它不是根据这些变量分组,而是在表的i
位置的每一行上分组。原则上,您可以在那里重复行,每个行都将由.EACHI单独处理。例如,我可以在第一行加入两次:DFI[DF_Lookup[c(1,1)],on=(PO_ID,SO_ID,F_Year,Product_ID),.N,by=.EACHI]
。关于如何在一组列上进行连接,但基于一组较小的列进行计数,是的,不幸的是,该功能在更新连接中还不可用:实际上,对于这种情况,有一些非常简单的解决方法(考虑到您仅使用.N
),但是,是的,我认为它需要一个单独的问题。