R 更名；凌乱的；数据帧级名称_R_Dataframe

R 更名；凌乱的；数据帧级名称

r dataframe

R 更名；凌乱的；数据帧级名称,r,dataframe,R,Dataframe,我有一个“凌乱”的数据框架，到处都是因子级别的名称 DF <- data.frame(V1 = factor(c("A.", "zB,", "Cs", "At", "Dp", "Df")), V2=factor(c("Af", "A_", "A_", ".A", "D.", "rB"))) 但对于一个大数据帧来说，这很耗时是否有一种方法可以自动执行此操作，以便将包含字母“a”的每个级别重命名为a（无论它是称为a.还是Af）等？这里有一个相当通用的解

我有一个“凌乱”的数据框架，到处都是因子级别的名称

DF <- data.frame(V1 = factor(c("A.", "zB,", "Cs", "At", "Dp", "Df")),
                   V2=factor(c("Af", "A_", "A_", ".A", "D.", "rB")))

但对于一个大数据帧来说，这很耗时

是否有一种方法可以自动执行此操作，以便将包含字母“a”的每个级别重命名为a（无论它是称为a.还是Af）等？

这里有一个相当通用的解决方案

#' @description Renames factor levels containing a pattern 
#' @details  If an input element matches more than one pattern, the first will be used.
#' @param f Factor or character vector to modify
#' 
#' @param pattern Pattern to match (regex optional)
#' @param label Label to assign for each pattern. Defaults to the pattern
#' @param ... Extra arguments passed to `grep`
#' @return Vector of the same type as `f`, with any elements matching a `pattern`
#' replaced by the corresponding `label`.
#' 
#' @author Gregor Thomas
contain_relabel = function(f, pattern, label= pattern, ...) {
    if (length(pattern) != length(label)) stop("pattern and label must have same")
    is_input_factor = is.factor(f)
    f = as.character(f)
        for (i in seq_along(pattern)) {
        f[grep(pattern[i], f, ...)] = label[i]
    }
    if (is_input_factor) return(factor(f))
    return(f)
}

V1 = factor(c("A.", "zB,", "Cs", "At", "Dp", "Df"))
contain_relabel(V1, "A")
# [1] A   zB, Cs  A   Dp  Df 
# Levels: A Cs Df Dp zB,

contain_relabel(V1, LETTERS[1:4])
# [1] A B C A D D
# Levels: A B C D

与任何其他获取并返回向量的函数一样，您可以在数据帧上使用

lappy

，将其应用于所有列：

DF[] = lapply(DF, contain_relabel, pattern = LETTERS[1:4])
DF
#   V1 V2
# 1  A  A
# 2  B  A
# 3  C  A
# 4  A  A
# 5  D  D
# 6  D  B

只能将其应用于具有

fc = sapply(DF, is.factor)
DF[fc] = lapply(DF[fc], ...<same as above>...)

fc=sapply（DF，is.factor）
DF[fc]=lappy（DF[fc]，…）

一般性是，它将默认重命名为如上所述的模式匹配，但您也可以更灵活。例如，如果您希望将任何包含“A”的内容重命名为“Alpha”，则可以执行

contain\u relabel（x，“A”，“Alpha”）

。您还可以使用

…

将参数传递到

grep

，如果您想使其不区分大小写，请使用固定模式而不是regex，等等。

这里有一个相当通用的解决方案

#' @description Renames factor levels containing a pattern 
#' @details  If an input element matches more than one pattern, the first will be used.
#' @param f Factor or character vector to modify
#' 
#' @param pattern Pattern to match (regex optional)
#' @param label Label to assign for each pattern. Defaults to the pattern
#' @param ... Extra arguments passed to `grep`
#' @return Vector of the same type as `f`, with any elements matching a `pattern`
#' replaced by the corresponding `label`.
#' 
#' @author Gregor Thomas
contain_relabel = function(f, pattern, label= pattern, ...) {
    if (length(pattern) != length(label)) stop("pattern and label must have same")
    is_input_factor = is.factor(f)
    f = as.character(f)
        for (i in seq_along(pattern)) {
        f[grep(pattern[i], f, ...)] = label[i]
    }
    if (is_input_factor) return(factor(f))
    return(f)
}

V1 = factor(c("A.", "zB,", "Cs", "At", "Dp", "Df"))
contain_relabel(V1, "A")
# [1] A   zB, Cs  A   Dp  Df 
# Levels: A Cs Df Dp zB,

contain_relabel(V1, LETTERS[1:4])
# [1] A B C A D D
# Levels: A B C D

与任何其他获取并返回向量的函数一样，您可以在数据帧上使用

lappy

，将其应用于所有列：

DF[] = lapply(DF, contain_relabel, pattern = LETTERS[1:4])
DF
#   V1 V2
# 1  A  A
# 2  B  A
# 3  C  A
# 4  A  A
# 5  D  D
# 6  D  B

只能将其应用于具有

fc = sapply(DF, is.factor)
DF[fc] = lapply(DF[fc], ...<same as above>...)

fc=sapply（DF，is.factor）
DF[fc]=lappy（DF[fc]，…）

一般性是，它将默认重命名为如上所述的模式匹配，但您也可以更灵活。例如，如果您希望将任何包含“A”的内容重命名为“Alpha”，则可以执行

contain\u relabel（x，“A”，“Alpha”）

。您还可以使用

..

将参数传递给

grep

，以防不区分大小写，使用固定模式而不是regex，等等。

对于本例，您可以使用

stringr:：stru extract

快速完成，并使用

mutate_all

从

dplyr

应用于每一列

library(dplyr)
DF %>% mutate_all(stringr::str_extract, "[A-D]")
  V1 V2
1  A  A
2  B  A
3  C  A
4  A  A
5  D  D
6  D  B

对于本例，您可以使用

stringr:：str_extract

快速执行，并使用

mutate_all

from

dplyr

library(dplyr)
DF %>% mutate_all(stringr::str_extract, "[A-D]")
  V1 V2
1  A  A
2  B  A
3  C  A
4  A  A
5  D  D
6  D  B

由于您需要从每个元素中提取大写字母A、B、C或D，因此它非常适合正则表达式为

“[A-D]”提取匹配项
这里有3种方法，具体取决于您的偏好。第一种方法使用lappy
将regmatches
和regexpr
应用于每一列。第二种方法使用lappy
从stringr
应用str\u extract
，这是对stri\u extract
从stringi
的包装。第三种方法跳过lappy
，而是从dplyr
使用mutate\u all
在所有列上应用一个函数（或者mutate\u at
如果需要为列的子集应用该函数），然后再次使用str\u extract

dfv1 V2
#>A
#>2 B A
#>3 C A
#>4 A A
#>5天
#>6 D B
as.data.frame（lappy（DF，函数（x）stringr:：str_extract（x，“[A-D]”））
#>V1 V2
#>A
#>2 B A
#>3 C A
#>4 A A
#>5天
#>6 D B
图书馆（tidyverse）
DF%>%突变（str提取，“[A-D]”）
#>V1 V2
#>A
#>2 B A
#>3 C A
#>4 A A
#>5天
#>6 D B

由（v0.2.0）于2018年5月10日创建。
由于您需要从每个元素中提取大写字母A、B、C或D，因此它非常适合于正则表达式提取“[A-D]”的匹配项。

这里有3种方法，具体取决于您的偏好。第一种方法使用lappy
将regmatches
和regexpr
应用于每一列。第二种方法使用lappy
从stringr
应用str\u extract
，这是对stri\u extract
从stringi
的包装。第三种方法跳过lappy
，而是从dplyr
使用mutate\u all
在所有列上应用一个函数（或者mutate\u at
如果需要为列的子集应用该函数），然后再次使用str\u extract

dfv1 V2
#>A
#>2 B A
#>3 C A
#>4 A A
#>5天
#>6 D B
as.data.frame（lappy（DF，函数（x）stringr:：str_extract（x，“[A-D]”））
#>V1 V2
#>A
#>2 B A
#>3 C A
#>4 A A
#>5天
#>6 D B
图书馆（tidyverse）
DF%>%突变（str提取，“[A-D]”）
#>V1 V2
#>A
#>2 B A
#>3 C A
#>4 A A
#>5天
#>6 D B

由（v0.2.0）于2018年5月10日创建。
是否只想使用第一个字符重新标记？还是有不同的模式？@camille:不同的模式。我基本上想要的是将任何一个有“A”的关卡重命名为A等。我编辑了这个问题来澄清。@Gregor:E是一个打字错误。我已经把它移走了。@KaC现在它不工作了，因为它们的长度不同。对不起，愚蠢的错误。已修复。是否只想使用第一个字符重新标记？还是有不同的模式？@camille:不同的模式。我基本上想要的是将任何一个有“A”的关卡重命名为A等。我编辑了这个问题来澄清。@Gregor:E是一个打字错误。我已经把它移走了。@KaC现在它不工作了，因为它们的长度不同。对不起，愚蠢的错误。修正。这在示例中非常有效，但由于该解决方案一次只适用于一个因素，因此对于范围数据帧来说仍然非常耗时。我没有向您演示如何lappy
它，因为您的问题中有lappy
DF[]=lappy（DF，contain_relabel，pattern=LETTERS[1:4]）
。这对于示例来说非常有效，但由于该解决方案一次只适用于一个因素，对于一个范围数据帧来说仍然非常耗时。我没有向您演示如何lappy
它，因为您的问题中有lappy
DF[]=lappy（DF，contain_relabel，pattern=LETTERS[1:4]）
。对于这个例子来说，这是一个非常优雅的解决方案，但我希望有更通用的解决方案。