Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/r/64.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
R:聚合数据,但一些聚合观察值需要拆分_R_Merge_Dplyr_Aggregate_Plyr - Fatal编程技术网

R:聚合数据,但一些聚合观察值需要拆分

R:聚合数据,但一些聚合观察值需要拆分,r,merge,dplyr,aggregate,plyr,R,Merge,Dplyr,Aggregate,Plyr,我有两个数据帧需要合并。下面是每种方法的合成示例。这些是学区:第一个是收入,第二个是年级 目标是合并这两个数据帧,并将最终结果聚合到与第二个(级别)数据帧相同的级别。我不得不做一个数据字典来合并它们,因为它们的名称各不相同(虽然为了简化这里我删除了它),但也要处理聚合。我计划按以下方式设置词典: School_dist1 School_dist2 Richland 1 Richland 1 Richland 2 Richland 2 ?????

我有两个数据帧需要合并。下面是每种方法的合成示例。这些是学区:第一个是收入,第二个是年级



目标是合并这两个数据帧,并将最终结果聚合到与第二个(级别)数据帧相同的级别。我不得不做一个数据字典来合并它们,因为它们的名称各不相同(虽然为了简化这里我删除了它),但也要处理聚合。我计划按以下方式设置词典:

School_dist1    School_dist2
Richland 1      Richland 1
Richland 2      Richland 2
?????           Richland Board
Charleston      Charleston
Greenville      Greenville
Greenville      Greenville Board
然后,我将简单地在school_dist1列中汇总。正如您所看到的,问题在于,虽然Greenville Board可以简单地聚合到Greenville中,但Richland Board需要在两个Richland Board之间(平均)拆分

我试着用我能想到的每一个可能的关键字来搜索解决方案,但由于问题的奇怪性质,我什么也找不到。其要点是,我需要聚合数据,但一些观察结果需要拆分,然后在其他观察结果中共享,这些观察结果被聚合到一起


有没有办法做到这一点?我说得通吗?我在这件事上完全被难住了

回家的路很长,但它会让你到达那里

# your data, dont use spaces in column names
df1 <- read.table(text = "School_district     revenue
Richland_1          8702
                 Richland_2          3749
                 Richland_Board       892
                 Charleston          6324
                 Greenville          1245
                 Greenville_Board     371", header = T)

df2 <- read.table(text = "School_district     grade
Richland_1          A
Richland_2          A+
Charleston          B
Greenville          D", header = T)

library(tidyverse)
# split df1 with boards and non-boards into separate dfs
boards <- dplyr::filter(df1, grepl("Board", df1$School_district)) %>%
    dplyr::mutate(School_district = gsub("_Board", "", School_district))
df1 <-  dplyr::filter(df1, !grepl("Board", df1$School_district))

# look up how many times a certain school district appears in df1
boards$num_splits <- map_int(boards$School_district,
                             ~ grep(., df1$School_district) %>% length)
# add new column for revenue divided by number of appearances
boards <- transmute(boards,
                    match_name = School_district,
                    add_value = revenue / num_splits)

# if I knew how to use fuzzy_join you could probably drop this part
df1$match_name <- gsub("_.*", "", df1$School_district)

full_join(df1, boards) %>%
    rowwise() %>%
    mutate(new_revenue = sum(revenue, add_value, na.rm = T)) %>%
    select(-match_name) %>%
    full_join(df2)

# A tibble: 4 × 5
School_district revenue add_value new_revenue  grade
<chr>   <int>     <dbl>       <dbl> <fctr>
1      Richland_1    8702       446        9148      A
2      Richland_2    3749       446        4195     A+
3      Charleston    6324        NA        6324      B
4      Greenville    1245       371        1616      D
#您的数据,不要在列名中使用空格

如果您展示了您尝试的代码,那么df1将非常有用。我真的不知道从哪里开始。我甚至不知道这是否是可能的,这就是我来这里的原因。我通常可以在这里搜索并在途中找到它,但我找不到任何人询问如何执行类似的操作。听起来您可能需要使用
dplyr
包中的一个连接函数。您可能需要的是一个
完全连接
。这里有一个很好的描述:
School_dist1    School_dist2
Richland 1      Richland 1
Richland 2      Richland 2
?????           Richland Board
Charleston      Charleston
Greenville      Greenville
Greenville      Greenville Board
# your data, dont use spaces in column names
df1 <- read.table(text = "School_district     revenue
Richland_1          8702
                 Richland_2          3749
                 Richland_Board       892
                 Charleston          6324
                 Greenville          1245
                 Greenville_Board     371", header = T)

df2 <- read.table(text = "School_district     grade
Richland_1          A
Richland_2          A+
Charleston          B
Greenville          D", header = T)

library(tidyverse)
# split df1 with boards and non-boards into separate dfs
boards <- dplyr::filter(df1, grepl("Board", df1$School_district)) %>%
    dplyr::mutate(School_district = gsub("_Board", "", School_district))
df1 <-  dplyr::filter(df1, !grepl("Board", df1$School_district))

# look up how many times a certain school district appears in df1
boards$num_splits <- map_int(boards$School_district,
                             ~ grep(., df1$School_district) %>% length)
# add new column for revenue divided by number of appearances
boards <- transmute(boards,
                    match_name = School_district,
                    add_value = revenue / num_splits)

# if I knew how to use fuzzy_join you could probably drop this part
df1$match_name <- gsub("_.*", "", df1$School_district)

full_join(df1, boards) %>%
    rowwise() %>%
    mutate(new_revenue = sum(revenue, add_value, na.rm = T)) %>%
    select(-match_name) %>%
    full_join(df2)

# A tibble: 4 × 5
School_district revenue add_value new_revenue  grade
<chr>   <int>     <dbl>       <dbl> <fctr>
1      Richland_1    8702       446        9148      A
2      Richland_2    3749       446        4195     A+
3      Charleston    6324        NA        6324      B
4      Greenville    1245       371        1616      D