R 通过连续冒号从字符串中提取字符

R 通过连续冒号从字符串中提取字符,r,string,stringr,grepl,R,String,Stringr,Grepl,我试图从数据帧中的变量中提取一些信息。我使用的是R3.3.3 信息格式如下: t <- "China: Officially the People's Republic of China (PRC), is a unitary sovereign state in East Asia and the world's most populous country, with a population of around 1.404 billion. GUAM: Is an unincorpor

我试图从数据帧中的变量中提取一些信息。我使用的是R3.3.3

信息格式如下:

t <- "China: Officially the People's Republic of China (PRC), is a unitary sovereign state in East Asia and the world's most populous country, with a population of around 1.404 billion. GUAM: Is an unincorporated and organized territory of the United States in Micronesia in the western Pacific Ocean. MICRONESIA: Is a subregion of Oceania, comprising thousands of small islands in the western Pacific Ocean. It has a shared cultural history with two other island regions, Polynesia to the east and Melanesia to the south. DOMINCAN REPUBLIC: Is a country located on the island of Hispaniola, in the Greater Antilles archipelago of the Caribbean region."
我在试图提取这些信息时遇到了一些困难。所以像和这样的问题非常有帮助。从这些信息中,我发现可以使用某种形式的stringr/gsub来提取这些信息,但我不知道如何在gsub语句中指定范围

我已经能够计算出如何拉出第一部分:

>test4 <- gsub("(.*{1})(:.*)","\\1", t)
我的总体问题是:

[1] "CHINA: Officially the People's Republic of China (PRC), is a unitary sovereign state in East Asia and the world's most populous country, with a population of around 1.404 billion. GUAM: Is an unincorporated and organized territory of the United States in Micronesia in the western Pacific Ocean. MICRONESIA: Is a subregion of Oceania, comprising thousands of small islands in the western Pacific Ocean. It has a shared cultural history with two other island regions, Polynesia to the east and Melanesia to the south. DOMINCAN REPUBLIC"
如果我不必把“多米尼加共和国”部分从绳子的末端清理干净,那就太好了

总之:

1。如何通过连续的冒号从字符串中提取字符?(第一到第二个冒号,第二到第三个等)

2。有没有一种方法可以将单词也保留在冒号前面?


任何信息或指导都将不胜感激。

关于base R的以下内容如何

# Your sample string
t <- "China: Officially the People's Republic of China (PRC), is a unitary sovereign state in East Asia and the world's most populous country, with a population of around 1.404 billion. GUAM: Is an unincorporated and organized territory of the United States in Micronesia in the western Pacific Ocean. MICRONESIA: Is a subregion of Oceania, comprising thousands of small islands in the western Pacific Ocean. It has a shared cultural history with two other island regions, Polynesia to the east and Melanesia to the south. DOMINCAN REPUBLIC: Is a country located on the island of Hispaniola, in the Greater Antilles archipelago of the Caribbean region.";

# Get position of regexp matches
matches <- data.frame(
    idx = unlist(gregexpr(pattern = "([A-Z]*\\s*[A-Z]+:|\\w+:)", t)),
    len = c(diff(unlist(gregexpr(pattern = "([A-Z]*\\s*[A-Z]+:|\\w+:)", t))), nchar(t))
);

# Get substrings based on positions and store in list
lst <- apply(matches, 1, function(x) {
    trimws(substr(t, x[1], sum(x) - 1));
})
lst;

#[1] "China: Officially the People's Republic of China (PRC), is a unitary sovereign state in East Asia and the world's most populous country, with a population of around 1.404 billion."
#[2] "GUAM: Is an unincorporated and organized territory of the United States in Micronesia in the western Pacific Ocean."
#[3] "MICRONESIA: Is a subregion of Oceania, comprising thousands of small islands in the western Pacific Ocean. It has a shared cultural history with two other island regions, Polynesia to the east and Melanesia to the south. DOMINCAN"
#[4] "DOMINCAN REPUBLIC: Is a country located on the island of Hispaniola, in the Greater Antilles archipelago of the Caribbean region."
#您的示例字符串

t您可以使用
strsplit
和适当的正则表达式:

strsplit(t, "\\.\\s(?=[\\w\\s]+:)", perl=TRUE)

注意事项:

[[1]]
[1] "China: Officially the People's Republic of China (PRC), is a unitary sovereign state in East Asia and the world's most populous country, with a population of around 1.404 billion"                                         
[2] "GUAM: Is an unincorporated and organized territory of the United States in Micronesia in the western Pacific Ocean"                                                                                                         
[3] "MICRONESIA: Is a subregion of Oceania, comprising thousands of small islands in the western Pacific Ocean. It has a shared cultural history with two other island regions, Polynesia to the east and Melanesia to the south"
[4] "DOMINCAN REPUBLIC: Is a country located on the island of Hispaniola, in the Greater Antilles archipelago of the Caribbean region." 
  • \\.\\s
    匹配文字点和空格
  • (?=[\\w\\s]+:)
    是一种正向前瞻,它在冒号后面匹配单词字符或空格一次或多次
  • 因此,
    \.\\s(?=[\\w\\s]+:)
    仅当紧跟单词字符或空格一次或多次以及冒号时,才匹配点和空格。这将是每一段的结尾
  • 因为我在strsplit
  • 中使用正则表达式,所以我按正则表达式匹配的任何对象进行拆分。这将导致在每个段落的末尾拆分
  • 启用lookaheads/behinds需要
    perl=TRUE
  • 结果:

    [[1]]
    [1] "China: Officially the People's Republic of China (PRC), is a unitary sovereign state in East Asia and the world's most populous country, with a population of around 1.404 billion"                                         
    [2] "GUAM: Is an unincorporated and organized territory of the United States in Micronesia in the western Pacific Ocean"                                                                                                         
    [3] "MICRONESIA: Is a subregion of Oceania, comprising thousands of small islands in the western Pacific Ocean. It has a shared cultural history with two other island regions, Polynesia to the east and Melanesia to the south"
    [4] "DOMINCAN REPUBLIC: Is a country located on the island of Hispaniola, in the Greater Antilles archipelago of the Caribbean region." 
    

    这太棒了!我需要花一些时间来完全理解它是如何划分数据的,但是非常感谢!非常感谢你!我真的很感激!非常感谢。我也试过这个,效果很好!
    stringr::str_split(t, "\\.\\s(?=[\\w\\s]+:)")
    
    [[1]]
    [1] "China: Officially the People's Republic of China (PRC), is a unitary sovereign state in East Asia and the world's most populous country, with a population of around 1.404 billion"                                         
    [2] "GUAM: Is an unincorporated and organized territory of the United States in Micronesia in the western Pacific Ocean"                                                                                                         
    [3] "MICRONESIA: Is a subregion of Oceania, comprising thousands of small islands in the western Pacific Ocean. It has a shared cultural history with two other island regions, Polynesia to the east and Melanesia to the south"
    [4] "DOMINCAN REPUBLIC: Is a country located on the island of Hispaniola, in the Greater Antilles archipelago of the Caribbean region."