如何根据Julia中列中的值查找数据帧行的平均值？_Julia

如何根据Julia中列中的值查找数据帧行的平均值？

julia

如何根据Julia中列中的值查找数据帧行的平均值？,julia,Julia,我在Julia中有以下数据帧 using DataFrames data = DataFrame(Value = [23, 56, 10, 48, 51], Type = ["A", "B", "A", "B", "B"]) 5×2 DataFrame │ Row │ Value │ Type │ │ │ Int64 │ String │ ├─────┼───────┼────────┤ │ 1 │ 23 │ A │ │ 2 │ 56 │ B

我在Julia中有以下数据帧

using DataFrames 
data = DataFrame(Value = [23, 56, 10, 48, 51], Type = ["A", "B", "A", "B", "B"])

5×2 DataFrame
│ Row │ Value │ Type   │
│     │ Int64 │ String │
├─────┼───────┼────────┤
│ 1   │ 23    │ A      │
│ 2   │ 56    │ B      │
│ 3   │ 10    │ A      │
│ 4   │ 48    │ B      │
│ 5   │ 51    │ B      │

如何根据列类型获取列值的平均值？
使用函数
by（）
将列中的行分组，然后应用函数
mean（）
（来自
统计信息）将为每个类型生成一个值：类型a和类型B的平均值 using DataFrames using Statistics data = DataFrame(Value = [23, 56, 10, 48, 51], Type = ["A", "B", "A", "B", "B"]); by(data, [:Type], df -> mean(df[:, :Value])) 2×2 DataFrame │ Row │ Type │ x1 │ │ │ String │ Float64 │ ├─────┼────────┼─────────┤ │ 1 │ A │ 16.5 │ │ 2 │ B │ 51.6667 │ 有关Julia中数据帧的更多信息，请参见此处：更简洁的书写方式是： julia> by(data, :Type, :Value => mean) 2×2 DataFrame │ Row │ Type │ Value_mean │ │ │ String │ Float64 │ ├─────┼────────┼────────────┤ │ 1 │ A │ 16.5 │ │ 2 │ B │ 51.6667 │ 请注意，如果按一个变量分组，则无需将数组作为第二个参数传递。此外，要应用的函数可以作为要应用的列名和函数的对直接传递。这也可以扩展到其他功能： julia> by(data, :Type, :Value => mean, :Value => median) 2×3 DataFrame │ Row │ Type │ Value_mean │ Value_median │ │ │ String │ Float64 │ Float64 │ ├─────┼────────┼────────────┼──────────────┤ │ 1 │ A │ 16.5 │ 16.5 │ │ 2 │ B │ 51.6667 │ 51.0 │ 这将创建新列，自动将函数名附加到正在分组的列。可以通过如下方式传递新列名来覆盖这些默认名称： julia> by(data, :Type, my_new_column = :Value => mean) 2×2 DataFrame │ Row │ Type │ my_new_column │ │ │ String │ Float64 │ ├─────┼────────┼───────────────┤ │ 1 │ A │ 16.5 │ │ 2 │ B │ 51.6667 │ 如果您想要性能，请考虑以下选项 julia> using DataFrames julia> using Statistics julia> using BenchmarkTools julia> data = DataFrame(Value = rand(1:10, 10^6), Type = categorical(rand(["A", "B"], 10^6))); 请注意，我将生成：将列键入为“分类”，因为这样以后聚合起来会快得多首先，从上面的答案中选择一个时间： julia> @benchmark by($data, [:Type], df -> mean(df[:, :Value])) BenchmarkTools.Trial: memory estimate: 30.53 MiB allocs estimate: 212 -------------- minimum time: 12.173 ms (0.00% GC) median time: 13.305 ms (3.63% GC) mean time: 14.229 ms (4.30% GC) maximum time: 20.491 ms (2.98% GC) -------------- samples: 352 evals/sample: 1 下面是我将df[：，：Value] 更改为df.Value 的时间。区别在于df.Value 不会不必要地复制数据。您可以看到，您已经节省了超过10%的运行时间： julia> @benchmark by($data, :Type, df -> mean(df.Value)) BenchmarkTools.Trial: memory estimate: 22.90 MiB allocs estimate: 203 -------------- minimum time: 10.926 ms (0.00% GC) median time: 13.151 ms (1.92% GC) mean time: 13.093 ms (3.53% GC) maximum time: 16.933 ms (3.25% GC) -------------- samples: 382 evals/sample: 1 这里有一个有效的方法来写它。此语句表示我们将列：Value 传递给函数mean ： julia> @benchmark by($data, :Type, :Value => mean) BenchmarkTools.Trial: memory estimate: 15.27 MiB allocs estimate: 190 -------------- minimum time: 8.326 ms (0.00% GC) median time: 8.667 ms (0.00% GC) mean time: 9.599 ms (2.74% GC) maximum time: 17.364 ms (3.57% GC) -------------- samples: 521 evals/sample: 1 最后，如果：Value 是向量{String} （另一个答案中给出的方法），那么让我们检查一下差异：你可以看到它比推荐答案慢三倍左右。还请注意： julia> by(data, :Type, :Value => mean) 2×2 DataFrame │ Row │ Type │ Value_mean │ │ │ String │ Float64 │ ├─────┼────────┼────────────┤ │ 1 │ B │ 5.50175 │ │ 2 │ A │ 5.49524 │ 为生成的列生成更好的默认名称（因为它知道源列名和转换函数名）。 Ah-我们是并行编写的。在0.21版本的DataFrames.jl中，一个小注释my_new_column=:Value=>mean 语法很可能会被弃用，取而代之的是:Value=>mean=>：my_new_column。 julia> by(data, :Type, :Value => mean) 2×2 DataFrame │ Row │ Type │ Value_mean │ │ │ String │ Float64 │ ├─────┼────────┼────────────┤ │ 1 │ B │ 5.50175 │ │ 2 │ A │ 5.49524 │