如何提取子字符串作为dplyr::mutate管道的一部分
我有以下数据框:如何提取子字符串作为dplyr::mutate管道的一部分,r,regex,dplyr,tidyverse,R,Regex,Dplyr,Tidyverse,我有以下数据框: 库(tidyverse) df#A tibble:10 x 3 #>pfc_chr pfc_chr st peak_名称 #> #>1 chr1 3046442 XXX-ad_peak_1 #>2 chr1 3119671 XXX-ad_peak_2a #>3 chr1 3164756 PMN_峰2 #>4 chr1 3167322 Ytb_峰3 #>5 chr1 3210838 PMN_peak_3 #>6 chr1 32121
库(tidyverse)
df#A tibble:10 x 3
#>pfc_chr pfc_chr st peak_名称
#>
#>1 chr1 3046442 XXX-ad_peak_1
#>2 chr1 3119671 XXX-ad_peak_2a
#>3 chr1 3164756 PMN_峰2
#>4 chr1 3167322 Ytb_峰3
#>5 chr1 3210838 PMN_peak_3
#>6 chr1 3212196 XXX-ad_峰6
#>7 chr1 3249068 XXX-ad_peak_8
#>8 chr1 3268246 PMN_峰5
#>9 chr1 3444892 XXX-ad_峰11
#>10 chr1 3451544 XXX-ad_峰12
我想做的是提取peak\u name
中的子字符串作为
dplyr管道。最终的预期结果是:
pfc_chr pfc_chr st peak_name new_col
1 chr1 3046442 XXX-ad_peak_1 XXX ad
2 chr1 3119671 XXX-ad_peak_2a XXX ad
3 chr1 3164756 PMN\u峰\u 2 PMN
4 chr1 3167322 Ytb_峰_3 Ytb
5 chr1 3210838 PMN\U peak\U 3 PMN
6 chr1 3212196 XXX-ad\U peak\U 6 XXX ad
7 chr1 3249068 XXX-ad_peak_8 XXX ad
8 chr1 3268246 PMN\U peak\U 5 PMN
9 chr1 3444892 XXX-ad_peak_11 XXX ad
10 chr1 3451544 XXX-ad_peak_12 XXX ad
我试过了,但失败了:
>df%>%变异(新列=stringr::str\u匹配(峰值名称“^(.*?\\\\\\\\\\\\\\\?”)
mutate_impl(.data,dots)中出错:
“new_col”列的长度必须为10(行数)或1,而不是20
正确的方法是什么?选择第二列
df %>% mutate(new_col = stringr::str_match(peak_name, "^(.*?)\\_peak\\_*?")[, 2])
输出
pfc_chr pfc_chr_st peak_name new_col
1 chr1 3046442 XXX-ad_peak_1 XXX-ad
2 chr1 3119671 XXX-ad_peak_2a XXX-ad
3 chr1 3164756 PMN_peak_2 PMN
4 chr1 3167322 Ytb_peak_3 Ytb
5 chr1 3210838 PMN_peak_3 PMN
6 chr1 3212196 XXX-ad_peak_6 XXX-ad
7 chr1 3249068 XXX-ad_peak_8 XXX-ad
8 chr1 3268246 PMN_peak_5 PMN
9 chr1 3444892 XXX-ad_peak_11 XXX-ad
10 chr1 3451544 XXX-ad_peak_12 XXX-ad
我建议
stringr::str_extract()
使用前瞻:
df %>%
mutate(new_col = stringr::str_extract(peak_name, "^.*(?=_peak)"))
结果如下:
> df %>%
+ mutate(new_col = stringr::str_extract(peak_name, "^.*(?=_peak)"))
# A tibble: 10 x 4
pfc_chr pfc_chr_st peak_name new_col
<chr> <int> <chr> <chr>
1 chr1 3046442 XXX-ad_peak_1 XXX-ad
2 chr1 3119671 XXX-ad_peak_2a XXX-ad
3 chr1 3164756 PMN_peak_2 PMN
4 chr1 3167322 Ytb_peak_3 Ytb
5 chr1 3210838 PMN_peak_3 PMN
6 chr1 3212196 XXX-ad_peak_6 XXX-ad
7 chr1 3249068 XXX-ad_peak_8 XXX-ad
8 chr1 3268246 PMN_peak_5 PMN
9 chr1 3444892 XXX-ad_peak_11 XXX-ad
10 chr1 3451544 XXX-ad_peak_12 XXX-ad
>df%>%
+突变(新的列=stringr::str提取(峰名“^.*(=\u峰)”)
#一个tibble:10x4
pfc_chr pfc_chr st peak_name new_col
1 chr1 3046442 XXX-ad_peak_1 XXX ad
2 chr1 3119671 XXX-ad_peak_2a XXX ad
3 chr1 3164756 PMN\u峰\u 2 PMN
4 chr1 3167322 Ytb_峰_3 Ytb
5 chr1 3210838 PMN\U peak\U 3 PMN
6 chr1 3212196 XXX-ad\U peak\U 6 XXX ad
7 chr1 3249068 XXX-ad_peak_8 XXX ad
8 chr1 3268246 PMN\U peak\U 5 PMN
9 chr1 3444892 XXX-ad_peak_11 XXX ad
10 chr1 3451544 XXX-ad_peak_12 XXX ad
请注意,诸如“_peak_8”之类的数据将返回一个空字符串;诸如“peak_8”之类的数据返回NA
尝试sub(^(.*)peak_.*”,“\\1”,peak_name)
(),而不是stringr::str_match(…)
或甚至sub(“\u peak.*$”,“”,peak_name)