检查Julia中Dataframe多列中的元素
我有一个关于在DataFrame上操作时在任何循环中使用条件的问题 比如说,, 我有一个数据帧检查Julia中Dataframe多列中的元素,dataframe,loops,julia,Dataframe,Loops,Julia,我有一个关于在DataFrame上操作时在任何循环中使用条件的问题 比如说,, 我有一个数据帧 df: a b c 1 2 5 3 4 3 2 1 7 6 3 6 5 1 9 我试图编写一个循环,条件是每次检查两个col(a和b),如果值I在任一列或两列中都可用,那么它应该从c列中获取值并将其存储在数组中 使用它,我可以在以后执行统计操作,比如查找数组的平均值 我已经为此任务编写了一个简化的代码段: for i in 1:5 result1 = Float64[] result2
df:
a b c
1 2 5
3 4 3
2 1 7
6 3 6
5 1 9
我试图编写一个循环,条件是每次检查两个col(a和b
),如果值I
在任一列或两列中都可用,那么它应该从c列中获取值并将其存储在数组中
使用它,我可以在以后执行统计操作,比如查找数组的平均值
我已经为此任务编写了一个简化的代码段:
for i in 1:5
result1 = Float64[]
result2 = Float64[]
if (df[:, :a] = i)
push!(result1, df[:, :c])
elseif (df[:, :b] = i)
push!(result2, df[:, :c])
end
unique!(result1)
unique!(result2)
result = vcat(result1, result2)
global mean_val = mean(result)
end
此处,i
值的范围为1到5,对于每个值,将检查a列和b列是否存在,如果该值存在,则应将c列中的值推送到相应的结果数组中
我尝试使用社区的其他建议,如:
代码示例1:
for i in 1:5
mean_val = mean(df[:, :c] for i in ("a", "b")
end
代码示例2:
for i in 1:5
df.row = axes(df, 1)
mean_val = mean((filter(x->x[:a] == i || x[:b] == i ,df))[:c])
end
但是,这些都不起作用,并返回所需的输出
请就我在代码中的错误提出建议。
另外,请建议是否有任何文档解释如何在语句中实现多个条件,以及如何访问数据帧元素以进行julia中的任何其他操作
提前感谢您实现(我认为)您想要实现的第一种方法是使用获取数据帧的子集:
julia> using DataFrames
julia> df = DataFrame(a = rand(1:5, 10), b = rand(1:5, 10), c = rand(1:100, 10))
10×3 DataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 2 25
2 │ 5 4 72
3 │ 4 3 37
4 │ 4 3 46
5 │ 3 2 31
6 │ 3 5 43
7 │ 5 1 35
8 │ 5 2 54
9 │ 1 1 64
10 │ 1 4 57
然后,您可以计算所产生的过滤值的任何统计信息:
julia> using Statistics
julia> mean(filtered_c)
39.25
执行相同操作的另一种方法是使用筛选要保留的行:
julia> filtered_df = filter(row -> (row.a==3 || row.b==3), df)
4×3 DataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 4 3 37
2 │ 4 3 46
3 │ 3 2 31
4 │ 3 5 43
# This way of writing things is equivalent to the previous one, but
# might be more readable in cases where the condition you're checking
# is more complex
julia> filtered_df = filter(df) do row
row.a == 3 || row.b == 3
end
4×3 DataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 4 3 37
2 │ 4 3 46
3 │ 3 2 31
4 │ 3 5 43
julia> mean(filtered_df.c)
39.25
作为弗朗索瓦·费沃特(François Févotte)一个出色答案的一个小效率注释,它可以更快地做到:
julia> filter([:a, :b] => (a,b) -> a == 3 || b == 3, df, view=true)
4×3 SubDataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 3 5 1
2 │ 3 5 9
3 │ 4 3 74
4 │ 4 3 63
如果您有一个非常大的数据帧。这里有两个区别:
我使用a[:a,:b]=>(a,b)->a==3 | | b==3
synax,它是类型稳定的(因此它将更快地迭代行)李>
我使用view=true
生成源数据帧的视图,该视图分配的数据要少得多(对于非常大的数据帧可能很重要)李>
以下是一个较大数据帧上不同行子集设置选项的小示例:
julia> df = DataFrame(a=rand(1:3, 10^8), b=rand(1:3, 10^8), c=rand(10^8));
julia> function test()
@time filter(row -> (row.a==3 || row.b==3), df)
@time df[(df.a .== 3) .| (df.b .== 3), :]
@time @view df[(df.a .== 3) .| (df.b .== 3), :]
@time filter([:a, :b] => (a,b) -> a == 3 || b == 3, df)
@time filter([:a, :b] => (a,b) -> a == 3 || b == 3, df, view=true)
return nothing
end
test (generic function with 1 method)
julia> test()
19.912672 seconds (333.67 M allocations: 6.652 GiB, 5.71% gc time, 0.41% compilation time)
1.152460 seconds (29 allocations: 1.667 GiB, 14.88% gc time)
0.515334 seconds (15 allocations: 435.807 MiB, 40.49% gc time)
1.066756 seconds (412.82 k allocations: 1.689 GiB, 5.56% gc time, 12.54% compilation time)
0.646710 seconds (382.98 k allocations: 455.835 MiB, 31.27% gc time, 23.02% compilation time)
julia> test()
18.194791 seconds (333.34 M allocations: 6.635 GiB, 4.87% gc time)
1.018816 seconds (29 allocations: 1.667 GiB, 15.34% gc time)
0.469027 seconds (15 allocations: 435.807 MiB, 41.19% gc time)
0.912572 seconds (30 allocations: 1.667 GiB, 5.32% gc time)
0.480374 seconds (16 allocations: 435.807 MiB, 41.15% gc time)
julia> df = DataFrame(a=rand(1:3, 10^8), b=rand(1:3, 10^8), c=rand(10^8));
julia> function test()
@time filter(row -> (row.a==3 || row.b==3), df)
@time df[(df.a .== 3) .| (df.b .== 3), :]
@time @view df[(df.a .== 3) .| (df.b .== 3), :]
@time filter([:a, :b] => (a,b) -> a == 3 || b == 3, df)
@time filter([:a, :b] => (a,b) -> a == 3 || b == 3, df, view=true)
return nothing
end
test (generic function with 1 method)
julia> test()
19.912672 seconds (333.67 M allocations: 6.652 GiB, 5.71% gc time, 0.41% compilation time)
1.152460 seconds (29 allocations: 1.667 GiB, 14.88% gc time)
0.515334 seconds (15 allocations: 435.807 MiB, 40.49% gc time)
1.066756 seconds (412.82 k allocations: 1.689 GiB, 5.56% gc time, 12.54% compilation time)
0.646710 seconds (382.98 k allocations: 455.835 MiB, 31.27% gc time, 23.02% compilation time)
julia> test()
18.194791 seconds (333.34 M allocations: 6.635 GiB, 4.87% gc time)
1.018816 seconds (29 allocations: 1.667 GiB, 15.34% gc time)
0.469027 seconds (15 allocations: 435.807 MiB, 41.19% gc time)
0.912572 seconds (30 allocations: 1.667 GiB, 5.32% gc time)
0.480374 seconds (16 allocations: 435.807 MiB, 41.15% gc time)