检查Julia中Dataframe多列中的元素_Dataframe_Loops_Julia

检查Julia中Dataframe多列中的元素

dataframe loops julia

检查Julia中Dataframe多列中的元素,dataframe,loops,julia,Dataframe,Loops,Julia,我有一个关于在DataFrame上操作时在任何循环中使用条件的问题比如说,，我有一个数据帧 df: a b c 1 2 5 3 4 3 2 1 7 6 3 6 5 1 9 我试图编写一个循环，条件是每次检查两个col（a和b），如果值I在任一列或两列中都可用，那么它应该从c列中获取值并将其存储在数组中使用它，我可以在以后执行统计操作，比如查找数组的平均值我已经为此任务编写了一个简化的代码段： for i in 1:5 result1 = Float64[] result2

我有一个关于在DataFrame上操作时在任何循环中使用条件的问题

比如说,，我有一个数据帧

我试图编写一个循环，条件是每次检查两个col（

a和b

），如果值

在任一列或两列中都可用，那么它应该从

c列中获取值并将其存储在数组中
使用它，我可以在以后执行统计操作，比如查找数组的平均值
我已经为此任务编写了一个简化的代码段：
for i in 1:5
  result1 = Float64[]
  result2 = Float64[]
  if (df[:, :a] = i) 
      push!(result1, df[:, :c])
  elseif (df[:, :b] = i)
      push!(result2, df[:, :c])
  end

  unique!(result1)
  unique!(result2)

  result = vcat(result1, result2)

  global mean_val = mean(result)
end

此处，i
值的范围为1到5，对于每个值，将检查a列和b列是否存在，如果该值存在，则应将c列中的值推送到相应的结果数组中
我尝试使用社区的其他建议，如：
代码示例1：

for i in 1:5
  mean_val = mean(df[:, :c] for i in ("a", "b")
end

代码示例2：
for i in 1:5
  df.row = axes(df, 1)
  mean_val = mean((filter(x->x[:a] == i || x[:b] == i ,df))[:c])
end

但是，这些都不起作用，并返回所需的输出
请就我在代码中的错误提出建议。
另外，请建议是否有任何文档解释如何在语句中实现多个条件，以及如何访问数据帧元素以进行julia中的任何其他操作
提前感谢您
实现（我认为）您想要实现的第一种方法是使用获取数据帧的子集：
julia> using DataFrames
julia> df = DataFrame(a = rand(1:5, 10), b = rand(1:5, 10), c = rand(1:100, 10))
10×3 DataFrame
 Row │ a      b      c     
     │ Int64  Int64  Int64 
─────┼─────────────────────
   1 │     1      2     25
   2 │     5      4     72
   3 │     4      3     37
   4 │     4      3     46
   5 │     3      2     31
   6 │     3      5     43
   7 │     5      1     35
   8 │     5      2     54
   9 │     1      1     64
  10 │     1      4     57

然后，您可以计算所产生的过滤值的任何统计信息：
julia> using Statistics

julia> mean(filtered_c)
39.25


执行相同操作的另一种方法是使用筛选要保留的行：
julia> filtered_df = filter(row -> (row.a==3 || row.b==3), df)
4×3 DataFrame
 Row │ a      b      c     
     │ Int64  Int64  Int64 
─────┼─────────────────────
   1 │     4      3     37
   2 │     4      3     46
   3 │     3      2     31
   4 │     3      5     43

# This way of writing things is equivalent to the previous one, but
# might be more readable in cases where the condition you're checking
# is more complex
julia> filtered_df = filter(df) do row
           row.a == 3 || row.b == 3
       end
4×3 DataFrame
 Row │ a      b      c     
     │ Int64  Int64  Int64 
─────┼─────────────────────
   1 │     4      3     37
   2 │     4      3     46
   3 │     3      2     31
   4 │     3      5     43

julia> mean(filtered_df.c)
39.25

作为弗朗索瓦·费沃特（François Févotte）一个出色答案的一个小效率注释，它可以更快地做到：
julia> filter([:a, :b] => (a,b) -> a == 3 || b == 3, df, view=true)
4×3 SubDataFrame
 Row │ a      b      c
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     3      5      1
   2 │     3      5      9
   3 │     4      3     74
   4 │     4      3     63

如果您有一个非常大的数据帧。这里有两个区别：
我使用a[：a，：b]=>（a，b）->a==3 | | b==3
synax，它是类型稳定的（因此它将更快地迭代行）
我使用view=true
生成源数据帧的视图，该视图分配的数据要少得多（对于非常大的数据帧可能很重要）
以下是一个较大数据帧上不同行子集设置选项的小示例：
julia> df = DataFrame(a=rand(1:3, 10^8), b=rand(1:3, 10^8), c=rand(10^8));

julia> function test()
           @time filter(row -> (row.a==3 || row.b==3), df)
           @time df[(df.a .== 3) .| (df.b .== 3), :]
           @time @view df[(df.a .== 3) .| (df.b .== 3), :]
           @time filter([:a, :b] => (a,b) -> a == 3 || b == 3, df)
           @time filter([:a, :b] => (a,b) -> a == 3 || b == 3, df, view=true)
           return nothing
       end
test (generic function with 1 method)

julia> test()
 19.912672 seconds (333.67 M allocations: 6.652 GiB, 5.71% gc time, 0.41% compilation time)
  1.152460 seconds (29 allocations: 1.667 GiB, 14.88% gc time)
  0.515334 seconds (15 allocations: 435.807 MiB, 40.49% gc time)
  1.066756 seconds (412.82 k allocations: 1.689 GiB, 5.56% gc time, 12.54% compilation time)
  0.646710 seconds (382.98 k allocations: 455.835 MiB, 31.27% gc time, 23.02% compilation time)

julia> test()
 18.194791 seconds (333.34 M allocations: 6.635 GiB, 4.87% gc time)
  1.018816 seconds (29 allocations: 1.667 GiB, 15.34% gc time)
  0.469027 seconds (15 allocations: 435.807 MiB, 41.19% gc time)
  0.912572 seconds (30 allocations: 1.667 GiB, 5.32% gc time)
  0.480374 seconds (16 allocations: 435.807 MiB, 41.15% gc time)

julia> df = DataFrame(a=rand(1:3, 10^8), b=rand(1:3, 10^8), c=rand(10^8));

julia> function test()
           @time filter(row -> (row.a==3 || row.b==3), df)
           @time df[(df.a .== 3) .| (df.b .== 3), :]
           @time @view df[(df.a .== 3) .| (df.b .== 3), :]
           @time filter([:a, :b] => (a,b) -> a == 3 || b == 3, df)
           @time filter([:a, :b] => (a,b) -> a == 3 || b == 3, df, view=true)
           return nothing
       end
test (generic function with 1 method)

julia> test()
 19.912672 seconds (333.67 M allocations: 6.652 GiB, 5.71% gc time, 0.41% compilation time)
  1.152460 seconds (29 allocations: 1.667 GiB, 14.88% gc time)
  0.515334 seconds (15 allocations: 435.807 MiB, 40.49% gc time)
  1.066756 seconds (412.82 k allocations: 1.689 GiB, 5.56% gc time, 12.54% compilation time)
  0.646710 seconds (382.98 k allocations: 455.835 MiB, 31.27% gc time, 23.02% compilation time)

julia> test()
 18.194791 seconds (333.34 M allocations: 6.635 GiB, 4.87% gc time)
  1.018816 seconds (29 allocations: 1.667 GiB, 15.34% gc time)
  0.469027 seconds (15 allocations: 435.807 MiB, 41.19% gc time)
  0.912572 seconds (30 allocations: 1.667 GiB, 5.32% gc time)
  0.480374 seconds (16 allocations: 435.807 MiB, 41.15% gc time)