Julia 按索引访问CSV.Row中的列_Julia

Julia 按索引访问CSV.Row中的列

julia

Julia 按索引访问CSV.Row中的列,julia,Julia,我有一个非常大（~120GB）的CSV文件，有~100列。我想使用CSV.file逐行遍历文件，并聚合特定范围的列。但是，CSV.Row类型似乎没有getindex方法。以下是一个简化的示例： using CSV using DataFrames df = DataFrame(reshape(1:60, 6, 10)) # column names are x1 through x10 CSV.write("test_data.csv", df) file = CSV.File("test_

我有一个非常大（~120GB）的CSV文件，有~100列。我想使用

CSV.file

逐行遍历文件，并聚合特定范围的列。但是，

CSV.Row

类型似乎没有

getindex

方法。以下是一个简化的示例：

using CSV
using DataFrames

df = DataFrame(reshape(1:60, 6, 10)) # column names are x1 through x10
CSV.write("test_data.csv", df)

file = CSV.File("test_data.csv")

row1 = first(file)
row1.x3 # Works fine

# Both of these throw method errors:
row1[4]
row1[4:7]

假设对于每一行，我要对变量

中的列

[1:3；8:10]

求和，并对变量

中的列

4:7

求和。最终输出应该是一个数据框，其中包含列

和

。当遍历

CSV.Row

s时，有没有一种简单的方法可以做到这一点？

您可以从中使用

表格。因为CSV.File
支持表格界面<代码>表格。每个列

将在列上创建迭代器。然后您可以使用

迭代器的组合。使用和迭代器。拖放以访问所需的列范围：
using CSV
using Tables
using DataFrames
using Base.Iterators: take
using Base.Iterators: drop


function aggregate_file(path)
    file = CSV.File(path)

    a, b = Int64[], Int64[]
    for row in file
        cols = Tables.eachcolumn(row)

        sum1to3 = sum(take(cols, 3))
        sum8to10 = sum(drop(cols, 7))
        push!(a, sum1to3 + sum8to10)

        sum4to7 = sum(drop(take(cols, 7), 3))
        push!(b, sum4to7)
    end

    DataFrame(a = a, b = b)
end

如果需要对任意一组列索引进行聚合，可以在列迭代器上使用enumerate
：
inds = [2, 4, 7]
sum(j for (i, j) in enumerate(cols) if i in inds)

编辑：
我对我的答案和@IanFiske的答案进行了性能比较。他的版本似乎更快，占用的内存更少：
julia> using BenchmarkTools

julia> @btime aggregate_file("test_data.csv");
  118.687 μs (550 allocations: 24.42 KiB)

julia> @btime aggregate_file("test_data.csv", [1:3; 8:10], 4:7);
  62.416 μs (236 allocations: 14.48 KiB)

这里有一个版本可以让您避免考虑转换为“下降”/“下降”逻辑：
谢谢要是我能在一行上使用索引符号就好了。。。语法会更好。也许我会在CSV.jl上提交一个问题，看看维护人员的想法。：）顺便说一句，[1:3；8:10]
工作正常，无需像[1:3…；8:10…]
那样将范围解压缩到vcat
。酷！您可以用Tables.jl而不是CSV来解决这个问题，这样getindex就可以处理任何符合Tables接口的数据结构。看起来这一问题在
julia> using BenchmarkTools

julia> @btime aggregate_file("test_data.csv");
  118.687 μs (550 allocations: 24.42 KiB)

julia> @btime aggregate_file("test_data.csv", [1:3; 8:10], 4:7);
  62.416 μs (236 allocations: 14.48 KiB)

using CSV, Tables, DataFrames

df = DataFrame(reshape(1:60, 6, 10)) # column names are x1 through x10
CSV.write("test_data.csv", df)

function aggregate_file(path, a_inds, b_inds)
    file = CSV.File(path)

    a, b = Int64[], Int64[]
    a_cols = propertynames(file)[a_inds]
    b_cols = propertynames(file)[b_inds]

    for row in file
        push!(a, sum(getproperty.(Ref(row), a_cols)))
        push!(b, sum(getproperty.(Ref(row), b_cols)))
    end

    DataFrame(a = a, b = b)
end

julia> aggregate_file("test_data.csv", [1:3; 8:10], 4:7)
6×2 DataFrame
│ Row │ a     │ b     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 168   │ 112   │
│ 2   │ 174   │ 116   │
│ 3   │ 180   │ 120   │
│ 4   │ 186   │ 124   │
│ 5   │ 192   │ 128   │
│ 6   │ 198   │ 132   │