Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/r/77.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
R 从文件名中提取字符串并使用mutate创建新列_R_Dplyr_Stringr_Mutate - Fatal编程技术网

R 从文件名中提取字符串并使用mutate创建新列

R 从文件名中提取字符串并使用mutate创建新列,r,dplyr,stringr,mutate,R,Dplyr,Stringr,Mutate,我有一个data.frame,有两列。第二列是文件名 df <- data.frame(paragraph = "Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua.", filename = "./data

我有一个data.frame,有两列。第二列是文件名

df  <- data.frame(paragraph = "Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua.",
             filename = "./data/RevCon_2015_C1_Austria_05_06.txt", stringsAsFactors = FALSE)

我们可以使用
tidyr::separate
执行以下操作:

library(tidyverse);
df %>%
    mutate(tmp = gsub("(\\./data/|\\.txt)", "", filename)) %>%
    separate(
        tmp,
        into = c("conference", "year", "ignored", "country", "month", "day")) %>%
    mutate(date = paste(day, month, year, sep = "/")) %>%
    select(-ignored, -month, -day)
#          paragraph                                filename conference year
#1 Lorem ipsum [...] ./data/RevCon_2015_C1_Austria_05_06.txt     RevCon 2015
#  country        date
#1 Austria  06/05/2015
注意,这假设
filename
s遵循以下模式:
/data/{conference}{year}{ignored}{country}{month}{day}.txt


样本数据
df这里有两种不同的方法使用
separate
extract
from
tidyr

library(dplyr)
library(tidyr)

df %>%
  mutate(filename2 = gsub("^(\\w+)_(\\d+)_.+?_(\\w+)_(\\d{2})_(\\d{2}).+$", 
                          "\\1_\\2_\\3_\\5.\\4.\\2", basename(filename))) %>%
  separate(filename2, c("conference", "year", "country", "date"), sep = "_")
或使用
提取

df %>%
  extract(filename, c("conference", "year", "country", "day", "month"),
          "^.+/(\\w+)_(\\d+)_.+?_(\\w+)_(\\d{2})_(\\d{2}).+$",
          remove = FALSE) %>%
  unite(date, month, day, year, sep = ".", remove = FALSE) %>%
  select(paragraph, filename, conference, year, country, date)
结果:

                                                                   paragraph
1 Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua.
                                 filename conference year country       date
1 ./data/RevCon_2015_C1_Austria_05_06.txt     RevCon 2015 Austria 06.05.2015
注意事项:

                                                                   paragraph
1 Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua.
                                 filename conference year country       date
1 ./data/RevCon_2015_C1_Austria_05_06.txt     RevCon 2015 Austria 06.05.2015
  • 第一种方法使用
    gsub
    来匹配我们希望使用捕获组的每个“列”,并根据需要重新排序。请注意,添加了
    \u
    以区分列
    • 我使用
      basename
      函数提取最后一个
      /
      之后的所有内容
    • 然后使用
      separate
      将元素拆分为实际列,并使用
      \uu
      作为分隔符
  • 第二种方法使用相同的正则表达式,但不是重新排列,
    extract
    将每个捕获组视为一个单独的列
    • unite
      绑定在一起,而不删除原始列
    • 最后,
      select
      删除
      day
      month
      并重新排列列顺序

  • filename
    中的所有条目是否具有相同的模式?也就是说,
    /data/{conference}{year}{ignored}}{country}{month}{day}.txt
    ?大多数都是这样。有一些例外,但我认为我可以过滤它们,在第二次运行中提取这些信息,然后加入这两个数据集
                                                                       paragraph
    1 Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua.
                                     filename conference year country       date
    1 ./data/RevCon_2015_C1_Austria_05_06.txt     RevCon 2015 Austria 06.05.2015