Dataframe 数据框列的不同数据类型不支持Julia中的Impute(处理缺失值方法)

Dataframe 数据框列的不同数据类型不支持Julia中的Impute(处理缺失值方法),dataframe,csv,julia,Dataframe,Csv,Julia,我做了一个小实验,我知道这只是因为CSV中包含不同数据类型的列。请参阅以下代码 julia> using DataFrames julia> df = DataFrame(:a => [1.0, 2, missing, missing, 5.0], :b => [1.1, 2.2, 3, missing, 5],:c => [1,3,5,missing,6]) 5×3 DataFrame │ Row │ a │ b │ c

我做了一个小实验,我知道这只是因为CSV中包含不同数据类型的列。请参阅以下代码

julia> using DataFrames

julia> df = DataFrame(:a => [1.0, 2, missing, missing, 5.0], :b => [1.1, 2.2, 3, missing, 5],:c => [1,3,5,missing,6])
5×3 DataFrame
│ Row │ a        │ b        │ c       │
│     │ Float64? │ Float64? │ Int64?  │
├─────┼──────────┼──────────┼─────────┤
│ 1   │ 1.0      │ 1.1      │ 1       │
│ 2   │ 2.0      │ 2.2      │ 3       │
│ 3   │ missing  │ 3.0      │ 5       │
│ 4   │ missing  │ missing  │ missing │
│ 5   │ 5.0      │ 5.0      │ 6       │

julia> df
5×3 DataFrame
│ Row │ a        │ b        │ c       │
│     │ Float64? │ Float64? │ Int64?  │
├─────┼──────────┼──────────┼─────────┤
│ 1   │ 1.0      │ 1.1      │ 1       │
│ 2   │ 2.0      │ 2.2      │ 3       │
│ 3   │ missing  │ 3.0      │ 5       │
│ 4   │ missing  │ missing  │ missing │
│ 5   │ 5.0      │ 5.0      │ 6       │

julia> using Impute

julia> Impute.interp(df)
ERROR: InexactError: Int64(5.5)
Stacktrace:
 [1] Int64 at ./float.jl:710 [inlined]
 [2] convert at ./number.jl:7 [inlined]
 [3] convert at ./missing.jl:69 [inlined]
 [4] setindex! at ./array.jl:826 [inlined]
 [5] (::Impute.var"#58#59"{Int64,Array{Union{Missing, Int64},1}})(::Impute.Context) at /home/synerzip/.julia/packages/Impute/GmIMg/src/imputors/interp.jl:67
 [6] (::Impute.Context)(::Impute.var"#58#59"{Int64,Array{Union{Missing, Int64},1}}) at /home/synerzip/.julia/packages/Impute/GmIMg/src/context.jl:227
 [7] _impute!(::Array{Union{Missing, Int64},1}, ::Impute.Interpolate) at /home/synerzip/.julia/packages/Impute/GmIMg/src/imputors/interp.jl:49
 [8] impute!(::Array{Union{Missing, Int64},1}, ::Impute.Interpolate) at /home/synerzip/.julia/packages/Impute/GmIMg/src/imputors.jl:84
 [9] impute!(::DataFrame, ::Impute.Interpolate) at /home/synerzip/.julia/packages/Impute/GmIMg/src/imputors.jl:172
 [10] #impute#17 at /home/synerzip/.julia/packages/Impute/GmIMg/src/imputors.jl:76 [inlined]
 [11] impute at /home/synerzip/.julia/packages/Impute/GmIMg/src/imputors.jl:76 [inlined]
 [12] _impute(::DataFrame, ::Type{Impute.Interpolate}) at /home/synerzip/.julia/packages/Impute/GmIMg/src/imputors.jl:58
 [13] #interp#105 at /home/synerzip/.julia/packages/Impute/GmIMg/src/Impute.jl:84 [inlined]
 [14] interp(::DataFrame) at /home/synerzip/.julia/packages/Impute/GmIMg/src/Impute.jl:84
 [15] top-level scope at REPL[15]:1
julia> df = DataFrame(:a => [1.0, 2, missing, missing, 5.0], :b => [1.1, 2.2, 3, missing, 5])
5×2 DataFrame
│ Row │ a        │ b        │
│     │ Float64? │ Float64? │
├─────┼──────────┼──────────┤
│ 1   │ 1.0      │ 1.1      │
│ 2   │ 2.0      │ 2.2      │
│ 3   │ missing  │ 3.0      │
│ 4   │ missing  │ missing  │
│ 5   │ 5.0      │ 5.0      │

julia> Impute.interp(df)
5×2 DataFrame
│ Row │ a        │ b        │
│     │ Float64? │ Float64? │
├─────┼──────────┼──────────┤
│ 1   │ 1.0      │ 1.1      │
│ 2   │ 2.0      │ 2.2      │
│ 3   │ 3.0      │ 3.0      │
│ 4   │ 4.0      │ 4.0      │
│ 5   │ 5.0      │ 5.0      │
运行以下代码时不会发生此错误

julia> using DataFrames

julia> df = DataFrame(:a => [1.0, 2, missing, missing, 5.0], :b => [1.1, 2.2, 3, missing, 5],:c => [1,3,5,missing,6])
5×3 DataFrame
│ Row │ a        │ b        │ c       │
│     │ Float64? │ Float64? │ Int64?  │
├─────┼──────────┼──────────┼─────────┤
│ 1   │ 1.0      │ 1.1      │ 1       │
│ 2   │ 2.0      │ 2.2      │ 3       │
│ 3   │ missing  │ 3.0      │ 5       │
│ 4   │ missing  │ missing  │ missing │
│ 5   │ 5.0      │ 5.0      │ 6       │

julia> df
5×3 DataFrame
│ Row │ a        │ b        │ c       │
│     │ Float64? │ Float64? │ Int64?  │
├─────┼──────────┼──────────┼─────────┤
│ 1   │ 1.0      │ 1.1      │ 1       │
│ 2   │ 2.0      │ 2.2      │ 3       │
│ 3   │ missing  │ 3.0      │ 5       │
│ 4   │ missing  │ missing  │ missing │
│ 5   │ 5.0      │ 5.0      │ 6       │

julia> using Impute

julia> Impute.interp(df)
ERROR: InexactError: Int64(5.5)
Stacktrace:
 [1] Int64 at ./float.jl:710 [inlined]
 [2] convert at ./number.jl:7 [inlined]
 [3] convert at ./missing.jl:69 [inlined]
 [4] setindex! at ./array.jl:826 [inlined]
 [5] (::Impute.var"#58#59"{Int64,Array{Union{Missing, Int64},1}})(::Impute.Context) at /home/synerzip/.julia/packages/Impute/GmIMg/src/imputors/interp.jl:67
 [6] (::Impute.Context)(::Impute.var"#58#59"{Int64,Array{Union{Missing, Int64},1}}) at /home/synerzip/.julia/packages/Impute/GmIMg/src/context.jl:227
 [7] _impute!(::Array{Union{Missing, Int64},1}, ::Impute.Interpolate) at /home/synerzip/.julia/packages/Impute/GmIMg/src/imputors/interp.jl:49
 [8] impute!(::Array{Union{Missing, Int64},1}, ::Impute.Interpolate) at /home/synerzip/.julia/packages/Impute/GmIMg/src/imputors.jl:84
 [9] impute!(::DataFrame, ::Impute.Interpolate) at /home/synerzip/.julia/packages/Impute/GmIMg/src/imputors.jl:172
 [10] #impute#17 at /home/synerzip/.julia/packages/Impute/GmIMg/src/imputors.jl:76 [inlined]
 [11] impute at /home/synerzip/.julia/packages/Impute/GmIMg/src/imputors.jl:76 [inlined]
 [12] _impute(::DataFrame, ::Type{Impute.Interpolate}) at /home/synerzip/.julia/packages/Impute/GmIMg/src/imputors.jl:58
 [13] #interp#105 at /home/synerzip/.julia/packages/Impute/GmIMg/src/Impute.jl:84 [inlined]
 [14] interp(::DataFrame) at /home/synerzip/.julia/packages/Impute/GmIMg/src/Impute.jl:84
 [15] top-level scope at REPL[15]:1
julia> df = DataFrame(:a => [1.0, 2, missing, missing, 5.0], :b => [1.1, 2.2, 3, missing, 5])
5×2 DataFrame
│ Row │ a        │ b        │
│     │ Float64? │ Float64? │
├─────┼──────────┼──────────┤
│ 1   │ 1.0      │ 1.1      │
│ 2   │ 2.0      │ 2.2      │
│ 3   │ missing  │ 3.0      │
│ 4   │ missing  │ missing  │
│ 5   │ 5.0      │ 5.0      │

julia> Impute.interp(df)
5×2 DataFrame
│ Row │ a        │ b        │
│     │ Float64? │ Float64? │
├─────┼──────────┼──────────┤
│ 1   │ 1.0      │ 1.1      │
│ 2   │ 2.0      │ 2.2      │
│ 3   │ 3.0      │ 3.0      │
│ 4   │ 4.0      │ 4.0      │
│ 5   │ 5.0      │ 5.0      │
现在我知道了原因,但不知道如何解决它。我不能在读取CSV时使用eltype,因为在我的数据集中包含171列,并且它通常具有Int或Float。无法理解如何转换Float64中的所有列。

我想您需要:

  • 简单的东西,不一定要有最大的效率
  • 所有列都是数字(可能缺少值)
  • 然后写下:

    julia> df
    5×3 DataFrame
    │ Row │ a        │ b        │ c       │
    │     │ Float64? │ Float64? │ Int64?  │
    ├─────┼──────────┼──────────┼─────────┤
    │ 1   │ 1.5      │ 1.65     │ 1       │
    │ 2   │ 3.0      │ 3.3      │ 3       │
    │ 3   │ missing  │ 4.5      │ 5       │
    │ 4   │ missing  │ missing  │ missing │
    │ 5   │ 7.5      │ 7.5      │ 6       │
    
    julia> float.(df)
    5×3 DataFrame
    │ Row │ a        │ b        │ c        │
    │     │ Float64? │ Float64? │ Float64? │
    ├─────┼──────────┼──────────┼──────────┤
    │ 1   │ 1.5      │ 1.65     │ 1.0      │
    │ 2   │ 3.0      │ 3.3      │ 3.0      │
    │ 3   │ missing  │ 4.5      │ 5.0      │
    │ 4   │ missing  │ missing  │ missing  │
    │ 5   │ 7.5      │ 7.5      │ 6.0      │
    
    可以更高效(即,仅转换源数据帧中的整数列,但需要更多代码-如果需要此类解决方案,请进行注释)

    编辑

    还请注意,CSV.jl有一个关键字参数,该参数应允许在读取中的数据时处理此问题