R 群迭代的非连续因子类的欧氏距离
这个问题是这个问题的延伸。 上一个问题的相同解释也适用于这里。我想根据以下公式计算每个公司基于专利类别的连续年份之间的欧几里德距离:R 群迭代的非连续因子类的欧氏距离,r,data.table,euclidean-distance,rowwise,R,Data.table,Euclidean Distance,Rowwise,这个问题是这个问题的延伸。 上一个问题的相同解释也适用于这里。我想根据以下公式计算每个公司基于专利类别的连续年份之间的欧几里德距离: 其中席代表了T年中属于某一特定类别的专利数量,并且YI代表了前一年属于特定类别的专利数量(T-1)。 不同之处在于我想添加另一个假设: 如果中间缺少一年/几年,我想实现一个假设,即该公司与最近一个未缺少的年份在相同的专利类别中一直活跃。例如,在以下数据集中: > set.seed(123) > df <- data.table(firm =
其中席代表了T年中属于某一特定类别的专利数量,并且YI代表了前一年属于特定类别的专利数量(T-1)。 不同之处在于我想添加另一个假设: 如果中间缺少一年/几年,我想实现一个假设,即该公司与最近一个未缺少的年份在相同的专利类别中一直活跃。例如,在以下数据集中:
> set.seed(123)
> df <- data.table(firm = rep(c("A","B"),each=5),
year = c(1979,1981,1981,1984,1984,1959,1960,1963,1963,1965),
patent = sample(3800000:4200000,10,replace = FALSE),
class = c("410","73","52","250","252","105","454","380","380","60")
)
> df
firm year patent class
1: A 1979 3988941 410
2: A 1981 3934057 73
3: A 1981 3924021 52
4: A 1984 3960996 250
5: A 1984 4026317 252
6: B 1959 4165208 105
7: B 1960 3924506 454
8: B 1963 3993626 380
9: B 1963 3845403 380
10: B 1965 3865160 60
在这里,对于公司A来说,由于1979年是开始的一年,因此该年没有欧几里德距离(应该生产NAs)。展望1980年,距离为零。1981年,今年(1981年)和上一年(1980年)的不同等级分别为73、52和410。因此,上述公式在这三个不同的类上求和(有三个不同的“i”)。因此,公式的输出将是:
为了进一步澄清,解释了1984年的计算:
1984年,公司A共拥有250类和252类两项专利(各一项)。紧接着的前一年是1983年,这一年原本不存在,但在应用上述假设后,现在它拥有两项专利,分别属于73类和52类。由于距离仅在连续两年之间,因此计算1984年的距离时,仅考虑1984年和1983年。因此,我们总共有四个类(250、252、73和52),这意味着总和是在四个“i”上完成的。从第一个I(250级)开始,这个类的专利总数为1个,1984个,0个1983个,所以席西等于1,Yi等于0。同样的逻辑也适用于252(Xi=1,Yi=0)。现在,对于第三’I或73类,专利总数是1984的0和1983的1,所以席等于0,Yi等于1。同样的逻辑也适用于52类。因此,距离由下式给出:
按照相同的计算并对公司进行重申,最终产量应为:
> df
firm year patent class El_Dist
1: A 1979 4108578 410 NA
2: A 1980 4108578 410 0.000000
3: A 1981 3859133 73 1.224745
4: A 1981 3983203 52 1.224745
5: A 1982 3859133 73 0.000000
6: A 1982 3983203 52 0.000000
7: A 1983 3859133 73 0.000000
8: A 1983 3983203 52 0.000000
9: A 1984 4158992 250 1.000000
10: A 1984 3945254 252 1.000000
11: B 1959 4077323 105 NA
12: B 1960 3889708 454 1.414214
13: B 1961 3889708 454 0.000000
14: B 1962 3889708 454 0.000000
15: B 1963 3830537 380 1.118034
16: B 1963 4025588 380 1.118034
17: B 1964 3830537 380 0.000000
18: B 1964 4025588 380 0.000000
19: B 1965 3944607 60 1.118034
为了提高速度,我最好寻找data.table解决方案(我的原始数据包含大约700万行)
非常感谢您的帮助。扩展可以按以下方式进行:
df1 <- df[, lapply(.SD, list), .(firm, year)][df[,
.(year = min(year):max(year)), firm], on = .(firm, year)]
df1[, grp := cumsum(sapply(patent, Negate(is.null))), .(firm)]
df1[, c('patent', 'class') := list(patent[1], class[1]), .(firm, grp)]
df1[, .(patent = unlist(patent), class = unlist(class)), .(firm, year)]
编辑:澄清后更新 这是一种主要(但不是完全)矢量化的方法,带有“隐式扩展”:
foo = function(x, y) {
sqrt(sum((x - y)^2)) / (sqrt(sum(x^2)) * sqrt(sum(y^2)))
}
bar = function(x, y) {
y = unlist(y, use.names = FALSE)
vals = union(x, y)
list(
x = sapply(vals, function(v) sum(x == v)),
y = sapply(vals, function(v) sum(y == v))
)
}
x = df[, .(prev_class = list(class)), by = .(year, firm)]
df[x,
on = .(firm, year > year),
prev_class := i.prev_class]
df[, dist := {
temp = bar(class, prev_class[1L])
foo(temp$x, temp$y)
}, by = .(firm, year)]
df
# firm year patent class prev_class dist
# 1: A 1979 3988941 410 Inf
# 2: A 1981 3934057 73 410 1.224745
# 3: A 1981 3924021 52 410 1.224745
# 4: A 1984 3960996 250 73,52 1.000000
# 5: A 1984 4026317 252 73,52 1.000000
# 6: B 1959 4165208 105 Inf
# 7: B 1960 3924506 454 105 1.414214
# 8: B 1963 3993626 380 454 1.118034
# 9: B 1963 3845403 380 454 1.118034
# 10: B 1965 3865160 60 380,380 1.118034
原始答复: 要使用“隐式展开”计算距离,可以使用以下方法。然而,我的结果与OP公司在1963年和1965年对B公司的预期产出不同。我不清楚OP是如何计算这些结果的
foo = function(x, y) {
sqrt(sum((x - y)^2)) / (sqrt(sum(x^2)) * sqrt(sum(y^2)))
}
bar = function(x, y) {
y = unlist(y, use.names = FALSE)
vals = union(x, y)
list(
x = as.integer(vals %in% x),
y = as.integer(vals %in% y)
)
}
x = df[, .(prev_class = list(unique(class))), by = .(year, firm)]
df[x,
on = .(firm, year > year),
prev_class := i.prev_class]
df[, dist := {
temp = bar(class, prev_class)
foo(temp$x, temp$y)
}, by = .(firm, year)]
df
# firm year patent class prev_class dist op_expected
# 1: A 1979 3988941 410 Inf NA
# 2: A 1981 3934057 73 410 1.224745 1.224745
# 3: A 1981 3924021 52 410 1.224745 1.224745
# 4: A 1984 3960996 250 73,52 1.000000 1.000000
# 5: A 1984 4026317 252 73,52 1.000000 1.000000
# 6: B 1959 4165208 105 Inf NA
# 7: B 1960 3924506 454 105 1.414214 1.414214
# 8: B 1963 3993626 380 454 1.414214 1.118034
# 9: B 1963 3845403 380 454 1.414214 1.118034
# 10: B 1965 3865160 60 380 1.414214 1.118034
不是答案,而是对所提供答案的扩展 以下是阻止我的修改成为最终答案的原因:
foo = function(x, y) {
sqrt(sum((x - y)^2)) / (sqrt(sum(x^2)) * sqrt(sum(y^2)))
}
bar = function(x, y) {
x = unlist(x, use.names = FALSE)
y = unlist(y, use.names = FALSE)
vals = c(x, y)
xl = length(unique(x))
yl = length(unique(y))
ul = length(union(x,y))
list(
x = c(table(vals)[names(table(vals)) %in% x],rep(0,(ul-xl))),
y = c(rep(0,(ul-yl)),table(vals)[names(table(vals)) %in% y])
)
}
x1 = df[, .(prev_class = list(class)), by = .(year, firm)]
x2 = df[, .(curr_class = list(class)), by = .(year, firm)]
x1
x2
x = merge(x1,x2)
df[x,
on = .(firm, year > year),
prev_class := i.prev_class]
df[x,
on = .(firm, year),
curr_class := curr_class]
df[, dist := {
temp = bar(unique(curr_class), unique(prev_class))
foo(temp$x, temp$y)
}, by = .(firm, year)]
df
firm year patent class prev_class curr_class dist
1: A 1979 3988941 410 410 Inf
2: A 1981 3934057 73 410 73,52 1.224745
3: A 1981 3924021 52 410 73,52 1.224745
4: A 1984 3960996 250 73,52 250,252 1.000000
5: A 1984 4026317 252 73,52 250,252 1.000000
6: B 1959 4165208 105 105 Inf
7: B 1960 3924506 454 105 454 1.414214
8: B 1963 3993626 380 454 380,380 1.118034
9: B 1963 3845403 380 454 380,380 1.118034
10: B 1965 3865160 60 380,380 60 1.118034
@Akrun在1984年使用了250、252类[属于1984年]和73和52类[属于1983年]。1和0使用的公式对于那些缺少的不清楚cases@akrun我只是补充了进一步的解释。如果有帮助,请告诉我。我可以在没有“El_Dist”的情况下获取您的19行数据集,因为计算不确定me@akrun谢谢,确实有用。对不起,我不能更有效地解释逻辑。非常感谢你的回答。差异的原因是我的公式也考虑了每一类的专利数量。1965年有两个班(60班和380班)。对于60,Xi=1,Yi=0。但是对于360,Xi=0和Yi=2,因为上一年有两项专利在380上。这就是造成差异的原因。“dist”和“op_expected”在只有一项专利的情况下同样有效。@lovestacksflow感谢您的澄清。我已经更新了我的答案,现在得到了样本数据的正确结果谢谢!我还在我的原始数据集(大约700万行)上检查了它,速度非常快!
foo = function(x, y) {
sqrt(sum((x - y)^2)) / (sqrt(sum(x^2)) * sqrt(sum(y^2)))
}
bar = function(x, y) {
y = unlist(y, use.names = FALSE)
vals = union(x, y)
list(
x = as.integer(vals %in% x),
y = as.integer(vals %in% y)
)
}
x = df[, .(prev_class = list(unique(class))), by = .(year, firm)]
df[x,
on = .(firm, year > year),
prev_class := i.prev_class]
df[, dist := {
temp = bar(class, prev_class)
foo(temp$x, temp$y)
}, by = .(firm, year)]
df
# firm year patent class prev_class dist op_expected
# 1: A 1979 3988941 410 Inf NA
# 2: A 1981 3934057 73 410 1.224745 1.224745
# 3: A 1981 3924021 52 410 1.224745 1.224745
# 4: A 1984 3960996 250 73,52 1.000000 1.000000
# 5: A 1984 4026317 252 73,52 1.000000 1.000000
# 6: B 1959 4165208 105 Inf NA
# 7: B 1960 3924506 454 105 1.414214 1.414214
# 8: B 1963 3993626 380 454 1.414214 1.118034
# 9: B 1963 3845403 380 454 1.414214 1.118034
# 10: B 1965 3865160 60 380 1.414214 1.118034
foo = function(x, y) {
sqrt(sum((x - y)^2)) / (sqrt(sum(x^2)) * sqrt(sum(y^2)))
}
bar = function(x, y) {
x = unlist(x, use.names = FALSE)
y = unlist(y, use.names = FALSE)
vals = c(x, y)
xl = length(unique(x))
yl = length(unique(y))
ul = length(union(x,y))
list(
x = c(table(vals)[names(table(vals)) %in% x],rep(0,(ul-xl))),
y = c(rep(0,(ul-yl)),table(vals)[names(table(vals)) %in% y])
)
}
x1 = df[, .(prev_class = list(class)), by = .(year, firm)]
x2 = df[, .(curr_class = list(class)), by = .(year, firm)]
x1
x2
x = merge(x1,x2)
df[x,
on = .(firm, year > year),
prev_class := i.prev_class]
df[x,
on = .(firm, year),
curr_class := curr_class]
df[, dist := {
temp = bar(unique(curr_class), unique(prev_class))
foo(temp$x, temp$y)
}, by = .(firm, year)]
df
firm year patent class prev_class curr_class dist
1: A 1979 3988941 410 410 Inf
2: A 1981 3934057 73 410 73,52 1.224745
3: A 1981 3924021 52 410 73,52 1.224745
4: A 1984 3960996 250 73,52 250,252 1.000000
5: A 1984 4026317 252 73,52 250,252 1.000000
6: B 1959 4165208 105 105 Inf
7: B 1960 3924506 454 105 454 1.414214
8: B 1963 3993626 380 454 380,380 1.118034
9: B 1963 3845403 380 454 380,380 1.118034
10: B 1965 3865160 60 380,380 60 1.118034