如何在Julia中执行快速分组操作?

如何在Julia中执行快速分组操作?,julia,Julia,特别是,我想要类似于R::data.tabled[,function(…),by=key]。使用另一个Stackoverflow问题的答案( )我有一个解决方案: using DataFrames df =DataFrame(Location = [ "NY", "SF", "NY", "NY", "SF", "SF", "TX", "TX", "TX", "DC"], Class = ["H","L","H","L","L","H", "H","L","L"

特别是,我想要类似于
R::data.table
d[,function(…),by=key]
。使用另一个Stackoverflow问题的答案( )我有一个解决方案:

using DataFrames

df =DataFrame(Location = [ "NY", "SF", "NY", "NY", "SF", "SF", "TX", "TX", "TX", "DC"],
                 Class = ["H","L","H","L","L","H", "H","L","L","M"],
                 Address = ["12 Silver","10 Fak","12 Silver","1 North","10 Fak","2 Fake", "1 Red","1 Dog","2 Fake","1 White"],
                 Score = ["4","5","3","2","1","5","4","3","2","1"])


julia> by(df, :Location, d -> DataFrame(count=nrow(d)))
4x2 DataFrames.DataFrame
| Row | Location | count |
|-----|----------|-------|
| 1   | "DC"     | 1     |
| 2   | "NY"     | 3     |
| 3   | "SF"     | 3     |
| 4   | "TX"     | 3     |
这很好,但对于大型数据集来说速度非常慢。有更快的解决方案吗?

对于计数,以下解决方案更快,但可读性较差:

cmap = countmap(df[:Location]); 
res = DataFrame(Location=collect(keys(cmap)),count=collect(values(cmap)))
或者,更一般地说(再次用于计数):

给予:

julia> countdf(df,:Location)
4×2 DataFrames.DataFrame
│ Row │ Location │ count │
├─────┼──────────┼───────┤
│ 1   │ "DC"     │ 1     │
│ 2   │ "SF"     │ 3     │
│ 3   │ "NY"     │ 3     │
│ 4   │ "TX"     │ 3     │
对于其他聚合函数(可以按顺序计算),我们可以定义函数:

foldmap(op, v0, df, col) = 
  foldl((x,y)->setindex!(x,op(get(x,y[col],v0),y),y[col]),
  Dict{eltype(df[col]),typeof(v0)}(), eachrow(df))
folddf(op, v0, df, col) = 
  (h = foldmap(op, v0, df, col) ; 
   DataFrame(collect.([keys(h),values(h)]),[col,:res]) )

inc1(x,y) = x+1
sumScore(x,y) = x+y[:Score]
maxScore(x,y) = max(x,y[:Score])
根据这些定义:

julia> eltype(df[:Score])<:Real || ( df[:Score] = parse.(Float64, df[:Score]) );

julia> foldmap(inc1, 0, df, :Location)
Dict{String,Int64} with 4 entries:
  "DC" => 1
  "SF" => 3
  "NY" => 3
  "TX" => 3

julia> folddf(sumScore, 0.0, df, :Location)
4×2 DataFrames.DataFrame
│ Row │ Location │ res  │
├─────┼──────────┼──────┤
│ 1   │ "DC"     │ 1.0  │
│ 2   │ "SF"     │ 11.0 │
│ 3   │ "NY"     │ 9.0  │
│ 4   │ "TX"     │ 9.0  │

julia> folddf(maxScore, 0.0, df, :Location)
4×2 DataFrames.DataFrame
│ Row │ Location │ res │
├─────┼──────────┼─────┤
│ 1   │ "DC"     │ 1.0 │
│ 2   │ "SF"     │ 5.0 │
│ 3   │ "NY"     │ 4.0 │
│ 4   │ "TX"     │ 4.0 │
julia>eltype(df[:Score])foldmap(inc1,0,df,:Location)
具有4个条目的Dict{String,Int64}:
“DC”=>1
“SF”=>3
“NY”=>3
“TX”=>3
julia>folddf(sumScore,0.0,df,:位置)
4×2数据帧。数据帧
│ 一行│ 位置│ 物件│
├─────┼──────────┼──────┤
│ 1.│ “DC”│ 1│
│ 2.│ “SF”│ 11│
│ 3.│ “纽约”│ 9│
│ 4.│ “TX”│ 9│
julia>folddf(maxScore,0.0,df,:位置)
4×2数据帧。数据帧
│ 一行│ 位置│ 物件│
├─────┼──────────┼─────┤
│ 1.│ “DC”│ 1│
│ 2.│ “SF”│ 5│
│ 3.│ “纽约”│ 4│
│ 4.│ “TX”│ 4│
还要检查这个(有点凌乱的)帖子和相关的博客帖子
julia> eltype(df[:Score])<:Real || ( df[:Score] = parse.(Float64, df[:Score]) );

julia> foldmap(inc1, 0, df, :Location)
Dict{String,Int64} with 4 entries:
  "DC" => 1
  "SF" => 3
  "NY" => 3
  "TX" => 3

julia> folddf(sumScore, 0.0, df, :Location)
4×2 DataFrames.DataFrame
│ Row │ Location │ res  │
├─────┼──────────┼──────┤
│ 1   │ "DC"     │ 1.0  │
│ 2   │ "SF"     │ 11.0 │
│ 3   │ "NY"     │ 9.0  │
│ 4   │ "TX"     │ 9.0  │

julia> folddf(maxScore, 0.0, df, :Location)
4×2 DataFrames.DataFrame
│ Row │ Location │ res │
├─────┼──────────┼─────┤
│ 1   │ "DC"     │ 1.0 │
│ 2   │ "SF"     │ 5.0 │
│ 3   │ "NY"     │ 4.0 │
│ 4   │ "TX"     │ 4.0 │