Julia 朱莉娅：如何使用IndexedTables.jl的聚合计算组平均值？_Julia

Julia 朱莉娅：如何使用IndexedTables.jl的聚合计算组平均值？

julia

Julia 朱莉娅：如何使用IndexedTables.jl的聚合计算组平均值？,julia,Julia,我试图使用聚合函数按组计算变量的平均值 using Distributions, PooledArrays N=Int64(2e9/8); K=100; pool = [@sprintf "id%03d" k for k in 1:K] pool1 = [@sprintf "id%010d" k for k in 1:(N/K)] function randstrarray(pool, N) PooledArray(PooledArrays.RefArray(rand(UInt8(

我试图使用

聚合

函数按组计算变量的平均值

using Distributions, PooledArrays

N=Int64(2e9/8); K=100;

pool = [@sprintf "id%03d" k for k in 1:K]
pool1 = [@sprintf "id%010d" k for k in 1:(N/K)]

function randstrarray(pool, N)
    PooledArray(PooledArrays.RefArray(rand(UInt8(1):UInt8(K), N)), pool)
end

using JuliaDB
DT = IndexedTable(Columns([1:N;]), Columns(
  id1 = randstrarray(pool, N),
  v3 =  rand(round.(rand(Uniform(0,100),100),4), N) # numeric e.g. 23.5749
 ));

res = IndexedTables.aggregate(mean, DT, by=(:id1,), with=:v3)

我是如何得到错误的

MethodError: no method matching mean(::Float64, ::Float64)
Closest candidates are:
  mean(!Matched::Union{Function, Type}, ::Any) at statistics.jl:19
  mean(!Matched::AbstractArray{T,N} where N, ::Any) where T at statistics.jl:57
  mean(::Any) at statistics.jl:34
in  at base\<missing>
in #aggregate#144 at IndexedTables\src\query.jl:119
in aggregate_to at IndexedTables\src\query.jl:148

工作正常

我真的很想帮助你，但我花了10分钟安装所有的软件包，又花了几分钟运行代码，弄清楚它到底做了什么（或没有）。如果你能提供一个“最简单的工作示例”，集中在这个问题上，那就太好了。事实上，重现问题的唯一要求似乎是

索引表

和两个随机数组

（很抱歉，这不是一个完整的答案，但太长，不能作为评论。）

无论如何，如果您阅读

IndexedTables.aggregate

的docstring，您会发现它需要一个函数，该函数包含两个参数，并且显然返回一个值：

help?> IndexedTables.aggregate
  aggregate(f::Function, arr::IndexedTable)

  Combine adjacent rows with equal indices using the given 2-argument
  reduction function, returning the result in a new array.

您在发布的错误消息中看到

no method matching mean(::Float64, ::Float64)

由于我不知道您希望计算什么，我现在假设您希望计算两个数字的

平均值。在这种情况下，您可以为mean（）
定义另一种方法：
这将满足聚合
功能签名要求。但我不确定这是否是您想要的。
我真的很想帮助您，但我花了10分钟安装了所有的软件包，又花了几分钟运行了代码，并弄清楚它到底做了什么（或没有）。如果你能提供一个“最简单的工作示例”，集中在这个问题上，那就太好了。事实上，重现问题的唯一要求似乎是索引表
和两个随机数组
（很抱歉，这不是一个完整的答案，但太长，不能作为评论。）
无论如何，如果您阅读IndexedTables.aggregate
的docstring，您会发现它需要一个函数，该函数包含两个参数，并且显然返回一个值：
help?> IndexedTables.aggregate
  aggregate(f::Function, arr::IndexedTable)

  Combine adjacent rows with equal indices using the given 2-argument
  reduction function, returning the result in a new array.

您在发布的错误消息中看到
no method matching mean(::Float64, ::Float64)

由于我不知道您希望计算什么，我现在假设您希望计算两个数字的平均值。在这种情况下，您可以为mean（）
定义另一种方法：
这将满足聚合
功能签名要求。但我不确定这是否是您想要的。
您需要告诉它如何将两个数字减为一<代码>平均值
用于数组。因此，只需使用匿名函数：
res = IndexedTables.aggregate((x,y)->(x+y)/2, DT, by=(:id1,), with=:v3)

您需要告诉它如何将两个数字减为一<代码>平均值

用于数组。因此，只需使用匿名函数：

res = IndexedTables.aggregate((x,y)->(x+y)/2, DT, by=(:id1,), with=:v3)

编辑：

res = IndexedTables.aggregate_vec(mean, DT, by=(:id1,), with=:v3)

从帮助：

help?> IndexedTables.aggregate_vec

聚合向量（f:：函数，x:：可索引）使用从矢量到标量的函数组合具有相等索引的相邻行，例如，mean

旧答案：

（我保留它是因为这是一个愉快的练习（对我来说）如何创建助手类型和函数，如果某些东西不能像我们希望的那样工作。也许它可以在将来帮助某人：）

我不知道你想表达什么意思。我的想法是计算具有等效质量的点的“重心”

两点中心：G=（A+B）/2

添加（聚合）第三点C为（2G+C）/3（2G，因为G的质量是A的质量+B的质量）

等等

测试：

对于聚合函数，我们需要做更多的工作：

import Base.convert

" we need method for convert Atractor to Float64 because aggregate
  function wants to store result in Float64 "
convert(Float64, x::Atractor) = x.center

现在它（可能是：p）起作用了

我希望你们看到聚合平均值对精度有影响！（有更多的求和和和除法运算）

编辑：

res = IndexedTables.aggregate_vec(mean, DT, by=(:id1,), with=:v3)

从帮助：

help?> IndexedTables.aggregate_vec

聚合向量（f:：函数，x:：可索引）使用从矢量到标量的函数组合具有相等索引的相邻行，例如，mean

旧答案：

我不知道你想表达什么意思。我的想法是计算具有等效质量的点的“重心”

两点中心：G=（A+B）/2

添加（聚合）第三点C为（2G+C）/3（2G，因为G的质量是A的质量+B的质量）

等等

测试：

对于聚合函数，我们需要做更多的工作：

import Base.convert

" we need method for convert Atractor to Float64 because aggregate
  function wants to store result in Float64 "
convert(Float64, x::Atractor) = x.center

现在它（可能是：p）起作用了

我希望你们看到聚合平均值对精度有影响！（有更多的求和和和除法运算）

我很确定OP不是想要的-我想他想要的是v3给出的所有组的v3值的平均值-即

tapply（DT$v3，DT$id1，mean）

R；或者在数据帧中为groupby（DT，：id1）中的subdf设置

[平均值（subdf[：v3]）

。我很想让tapply出现在Julia中（我知道它在rlevctors.jl中，但感觉有点笨拙）@MichaelK.Borregaard是的，我正在考虑按组取a值的平均值。我很确定OP并不想要这个值——我想他想要的是v3给出的所有组的v3值的平均值——也就是说，

tapply（DT$v3，DT$id1，mean）

in R；或者数据帧中的subdf（DT，：id1）中的subdf的

[平均值（subdf[：v3]）

。我很想让tapply出现在Julia中（我知道它在RLEVectors.jl中，但感觉有点笨拙）@MichaelK.Borregaard是的，我正在考虑按组取a值的平均值。