R 如何基于组而不是在组内创建重复索引_R

R 如何基于组而不是在组内创建重复索引

R 如何基于组而不是在组内创建重复索引,r,R,我有一个索引，编号为5、10、15、17。这个索引是从一个大型csv文件生成的，对应于该文件中这些短语的顺序。最后，我想将这些短语映射回我的循环生成的新列每个索引都与一个短语相关联。我的代码将短语分开，并根据短语中的单词创建列。我需要在我的数据框中创建另一列，该列的索引号对应于每个短语 For example: column 1 column 2 index phrase A book

我有一个索引，编号为5、10、15、17。这个索引是从一个大型csv文件生成的，对应于该文件中这些短语的顺序。最后，我想将这些短语映射回我的循环生成的新列

每个索引都与一个短语相关联。我的代码将短语分开，并根据短语中的单词创建列。我需要在我的数据框中创建另一列，该列的索引号对应于每个短语

For example: 
    column 1          column 2            index
    phrase A            book                5
    phrase A            tree                5
    phrase B            tree                10

如何在循环中实现此结果，并确保索引随着第1列中的每个新输入而变化。

您可以在tidyverse中使用group_索引。下面是一个将制造商的mpg数据集分组的示例

library(tidyverse)

mpgGroupNbr <- mpg %>%
  arrange(manufacturer) %>%
  group_by(manufacturer) %>% 
  mutate(groupNbr = group_indices()*5)

#check coding - max/min should be the same if coded correctly
mpgGroupNbr %>% 
  group_by(manufacturer) %>%
  summarize(max = max(groupNbr), min = min(groupNbr))

结果:

   manufacturer   max   min
    <chr>        <dbl> <dbl>
 1 audi             5     5
 2 chevrolet       10    10
 3 dodge           15    15
 4 ford            20    20
 5 honda           25    25
 6 hyundai         30    30
 7 jeep            35    35
 8 land rover      40    40
 9 lincoln         45    45
10 mercury         50    50
11 nissan          55    55
12 pontiac         60    60
13 subaru          65    65
14 toyota          70    70
15 volkswagen      75    75

您可以在tidyverse中使用组索引。下面是一个将制造商的mpg数据集分组的示例

library(tidyverse)

mpgGroupNbr <- mpg %>%
  arrange(manufacturer) %>%
  group_by(manufacturer) %>% 
  mutate(groupNbr = group_indices()*5)

#check coding - max/min should be the same if coded correctly
mpgGroupNbr %>% 
  group_by(manufacturer) %>%
  summarize(max = max(groupNbr), min = min(groupNbr))

结果:

   manufacturer   max   min
    <chr>        <dbl> <dbl>
 1 audi             5     5
 2 chevrolet       10    10
 3 dodge           15    15
 4 ford            20    20
 5 honda           25    25
 6 hyundai         30    30
 7 jeep            35    35
 8 land rover      40    40
 9 lincoln         45    45
10 mercury         50    50
11 nissan          55    55
12 pontiac         60    60
13 subaru          65    65
14 toyota          70    70
15 volkswagen      75    75

像这样的

index_by <- function(DF, group, index_list = NULL){
  f <- ave(as.character(DF[[group]]), DF[[group]], FUN = function(x) rnorm(1))
  i <- as.integer(factor(f, levels = unique(f)))
  if(is.null(index_list)) i else index_list[i]
}

df1$index <- index_by(df1, "column1")
df1$index2 <- index_by(df1, "column1", c(5, 10, 15, 17))

df1
#    column1 index index2
#1  phrase 1     1      5
#2  phrase 1     1      5
#3  phrase 1     1      5
#4  phrase 1     1      5
#5  phrase 2     2     10
#6  phrase 2     2     10
#7  phrase 3     3     15
#8  phrase 3     3     15
#9  phrase 3     3     15
#10 phrase 4     4     17

数据创建代码

像这样的

index_by <- function(DF, group, index_list = NULL){
  f <- ave(as.character(DF[[group]]), DF[[group]], FUN = function(x) rnorm(1))
  i <- as.integer(factor(f, levels = unique(f)))
  if(is.null(index_list)) i else index_list[i]
}

df1$index <- index_by(df1, "column1")
df1$index2 <- index_by(df1, "column1", c(5, 10, 15, 17))

df1
#    column1 index index2
#1  phrase 1     1      5
#2  phrase 1     1      5
#3  phrase 1     1      5
#4  phrase 1     1      5
#5  phrase 2     2     10
#6  phrase 2     2     10
#7  phrase 3     3     15
#8  phrase 3     3     15
#9  phrase 3     3     15
#10 phrase 4     4     17

数据创建代码

索引=c5，10，15，17，namesindex=cphrase A，phrase B，phrase C，phrase D的可能重复。您的_data$index=index[您的_data$column_1]。我认为这不是组内编号的重复-OP希望每个组的索引值相同。@Gregor，您尝试过吗？没用吗？乘以5？@Reeza OP希望数字对应第1列，而不是第2列。我想他们只是想要一个表的连接，描述短语如何映射到索引？因为他们指定他们需要特定的索引，而不仅仅是任何数字。索引的可能副本=c5、10、15、17，namesindex=C短语A、短语B、短语C、短语D。您的\数据$index=索引[您的\数据$column\ 1]。我想这不是组内的重复编号-OP希望每个组都有相同的索引值。@Gregor，您试过吗？没用吗？乘以5？@Reeza OP希望数字对应第1列，而不是第2列。我想他们只是想要一个表的连接，描述短语如何映射到索引？因为它们指定它们需要特定的索引，而不仅仅是任何数字。它们不是以5为增量，索引是一个基于短语的完全随机数字列表。为什么5与否有关系？@aa710那么同样的方法也可以，只需删除乘法。索引是根据这些短语在大型csv中的位置生成的，最终我需要将其映射回该文件，因此每个短语的索引在此处很重要。然后您需要更好地解释基本情况，我觉得这回答了您提出的问题，但是你问的问题并不是你真正想解决的问题。它们不是以5为增量，索引是一个基于短语的完全随机数列表。为什么它是5与否很重要？@aa710那么同样的方法也可以做到，只需删除乘法。索引是根据这些短语在大型csv中的位置生成的，最终我需要将其映射回该文件，因此每个短语的索引在此处很重要。然后您需要更好地解释基本情况，我觉得这回答了您提出的问题，但你问的问题并不是你真正想解决的问题。