R 从列中提取国家名称（或其他实体）_R_Dataframe

R 从列中提取国家名称（或其他实体）

r dataframe

R 从列中提取国家名称（或其他实体）,r,dataframe,R,Dataframe,我有一个data.frame，列位置中包含国家和城市，我想通过从librarymaps或任何其他国家名称集合中匹配world.cities$country.etc数据框来提取前者考虑这个例子： df <- data.frame(location = c("Aarup, Denmark", "Switzerland", "Estonia: Aaspere"),

我有一个data.frame，列位置中包含国家和城市，我想通过从librarymaps或任何其他国家名称集合中匹配world.cities$country.etc数据框来提取前者

考虑这个例子：

df <- data.frame(location = c("Aarup, Denmark",
                              "Switzerland",
                              "Estonia: Aaspere"),
                 other_col = c(2,3,4))

但我并不成功；我期待这样的事情：

          location other_col     country rest_location
1   Aarup, Denmark         2     Denmark       Aarup, 
2      Switzerland         3 Switzerland              
3 Estonia: Aaspere         4     Estonia     : Aaspere

您可以尝试将此作为起点

library(tidyverse)
df %>% 
  rownames_to_column() %>% 
  separate_rows(location) %>% 
  mutate(gr = location %in% world.cities$country.etc) %>% 
  mutate(gr = ifelse(gr, "country", "rest_location")) %>% 
  spread(gr, location) %>% 
  right_join(df %>% 
              rownames_to_column(), 
              by = c("rowname", "other_col")) %>% 
  select(location, other_col, country, rest_location)
          location other_col     country rest_location
1   Aarup, Denmark         2     Denmark         Aarup
2      Switzerland         3 Switzerland          <NA>
3 Estonia: Aaspere         4     Estonia       Aaspere

请注意，只有在“位置”列中只有两个单词时，此选项才有效。如有必要，您必须指定一个合适的单独选项，例如sep=，|：

您可以尝试将其作为起点

library(tidyverse)
df %>% 
  rownames_to_column() %>% 
  separate_rows(location) %>% 
  mutate(gr = location %in% world.cities$country.etc) %>% 
  mutate(gr = ifelse(gr, "country", "rest_location")) %>% 
  spread(gr, location) %>% 
  right_join(df %>% 
              rownames_to_column(), 
              by = c("rowname", "other_col")) %>% 
  select(location, other_col, country, rest_location)
          location other_col     country rest_location
1   Aarup, Denmark         2     Denmark         Aarup
2      Switzerland         3 Switzerland          <NA>
3 Estonia: Aaspere         4     Estonia       Aaspere

请注意，只有在“位置”列中只有两个单词时，此选项才有效。如有必要，您必须指定一个合适的单独名称，例如sep=，|：：

我们可以通过将所有国家名称粘贴在一起创建一个模式，并使用str_extract_all获取与位置模式匹配的所有国家名称，并删除与国家名称匹配的单词以获取剩余位置

使用sapply和toString表示国家，因为如果位置中有多个国家名称，它们都会连接在一个字符串中

我们可以通过将所有国家名称粘贴在一起来创建一个模式，并使用str_extract_all获取所有与该模式位置匹配的国家名称，删除与国家名称匹配的单词以获取剩余位置

使用sapply和toString表示国家，因为如果位置中有多个国家名称，它们都会连接在一个字符串中

基本R不包括地图包：

# Import the library: 

library(maps)

# Split the string on the spaces: 

country_city_vec <- strsplit(df$location, "\\s+")

# Replicate the other col's rows by the split string vec: 

rolled_out_df <- data.frame(other_col = rep(df$other_col, sapply(country_city_vec, length)), 

                            location = gsub("[[:punct:]]", "", unlist(country_city_vec)), stringsAsFactors = F)

# Match with the world df: 

matched_with_world_df <- merge(df,

                               setNames(rolled_out_df[rolled_out_df$location %in% world.cities$country.etc,],
                                        c("other_col", "country")),

                               by = "other_col", all.x = T)

# Extract the city/location drilldown: 

matched_with_world_df$rest_location <- trimws(gsub("[[:punct:]]",
                                                   "",
                                                   gsub(paste0(matched_with_world_df$country,
                                                               collapse = "|"),
                                           "", matched_with_world_df$location)), "both")

基本R不包括地图包：

# Import the library: 

library(maps)

# Split the string on the spaces: 

country_city_vec <- strsplit(df$location, "\\s+")

# Replicate the other col's rows by the split string vec: 

rolled_out_df <- data.frame(other_col = rep(df$other_col, sapply(country_city_vec, length)), 

                            location = gsub("[[:punct:]]", "", unlist(country_city_vec)), stringsAsFactors = F)

# Match with the world df: 

matched_with_world_df <- merge(df,

                               setNames(rolled_out_df[rolled_out_df$location %in% world.cities$country.etc,],
                                        c("other_col", "country")),

                               by = "other_col", all.x = T)

# Extract the city/location drilldown: 

matched_with_world_df$rest_location <- trimws(gsub("[[:punct:]]",
                                                   "",
                                                   gsub(paste0(matched_with_world_df$country,
                                                               collapse = "|"),
                                           "", matched_with_world_df$location)), "both")