R 通过连续冒号从字符串中提取字符_R_String_Stringr_Grepl

R 通过连续冒号从字符串中提取字符

r string

R 通过连续冒号从字符串中提取字符,r,string,stringr,grepl,R,String,Stringr,Grepl,我试图从数据帧中的变量中提取一些信息。我使用的是R3.3.3 信息格式如下： t <- "China: Officially the People's Republic of China (PRC), is a unitary sovereign state in East Asia and the world's most populous country, with a population of around 1.404 billion. GUAM: Is an unincorpor

我试图从数据帧中的变量中提取一些信息。我使用的是R3.3.3

信息格式如下：

t <- "China: Officially the People's Republic of China (PRC), is a unitary sovereign state in East Asia and the world's most populous country, with a population of around 1.404 billion. GUAM: Is an unincorporated and organized territory of the United States in Micronesia in the western Pacific Ocean. MICRONESIA: Is a subregion of Oceania, comprising thousands of small islands in the western Pacific Ocean. It has a shared cultural history with two other island regions, Polynesia to the east and Melanesia to the south. DOMINCAN REPUBLIC: Is a country located on the island of Hispaniola, in the Greater Antilles archipelago of the Caribbean region."

我在试图提取这些信息时遇到了一些困难。所以像和这样的问题非常有帮助。从这些信息中，我发现可以使用某种形式的stringr/gsub来提取这些信息，但我不知道如何在gsub语句中指定范围

我已经能够计算出如何拉出第一部分：

>test4 <- gsub("(.*{1})(:.*)","\\1", t)

我的总体问题是：

[1] "CHINA: Officially the People's Republic of China (PRC), is a unitary sovereign state in East Asia and the world's most populous country, with a population of around 1.404 billion. GUAM: Is an unincorporated and organized territory of the United States in Micronesia in the western Pacific Ocean. MICRONESIA: Is a subregion of Oceania, comprising thousands of small islands in the western Pacific Ocean. It has a shared cultural history with two other island regions, Polynesia to the east and Melanesia to the south. DOMINCAN REPUBLIC"

如果我不必把“多米尼加共和国”部分从绳子的末端清理干净，那就太好了

总之：

1。如何通过连续的冒号从字符串中提取字符？（第一到第二个冒号，第二到第三个等）

2。有没有一种方法可以将单词也保留在冒号前面？

任何信息或指导都将不胜感激。

关于base R的以下内容如何

# Your sample string
t <- "China: Officially the People's Republic of China (PRC), is a unitary sovereign state in East Asia and the world's most populous country, with a population of around 1.404 billion. GUAM: Is an unincorporated and organized territory of the United States in Micronesia in the western Pacific Ocean. MICRONESIA: Is a subregion of Oceania, comprising thousands of small islands in the western Pacific Ocean. It has a shared cultural history with two other island regions, Polynesia to the east and Melanesia to the south. DOMINCAN REPUBLIC: Is a country located on the island of Hispaniola, in the Greater Antilles archipelago of the Caribbean region.";

# Get position of regexp matches
matches <- data.frame(
    idx = unlist(gregexpr(pattern = "([A-Z]*\\s*[A-Z]+:|\\w+:)", t)),
    len = c(diff(unlist(gregexpr(pattern = "([A-Z]*\\s*[A-Z]+:|\\w+:)", t))), nchar(t))
);

# Get substrings based on positions and store in list
lst <- apply(matches, 1, function(x) {
    trimws(substr(t, x[1], sum(x) - 1));
})
lst;

#[1] "China: Officially the People's Republic of China (PRC), is a unitary sovereign state in East Asia and the world's most populous country, with a population of around 1.404 billion."
#[2] "GUAM: Is an unincorporated and organized territory of the United States in Micronesia in the western Pacific Ocean."
#[3] "MICRONESIA: Is a subregion of Oceania, comprising thousands of small islands in the western Pacific Ocean. It has a shared cultural history with two other island regions, Polynesia to the east and Melanesia to the south. DOMINCAN"
#[4] "DOMINCAN REPUBLIC: Is a country located on the island of Hispaniola, in the Greater Antilles archipelago of the Caribbean region."

#您的示例字符串
t您可以使用strsplit
和适当的正则表达式：
strsplit(t, "\\.\\s(?=[\\w\\s]+:)", perl=TRUE)

或
注意事项：
[[1]]
[1] "China: Officially the People's Republic of China (PRC), is a unitary sovereign state in East Asia and the world's most populous country, with a population of around 1.404 billion"                                         
[2] "GUAM: Is an unincorporated and organized territory of the United States in Micronesia in the western Pacific Ocean"                                                                                                         
[3] "MICRONESIA: Is a subregion of Oceania, comprising thousands of small islands in the western Pacific Ocean. It has a shared cultural history with two other island regions, Polynesia to the east and Melanesia to the south"
[4] "DOMINCAN REPUBLIC: Is a country located on the island of Hispaniola, in the Greater Antilles archipelago of the Caribbean region." 

\\.\\s
匹配文字点和空格
（？=[\\w\\s]+：）
是一种正向前瞻，它在冒号后面匹配单词字符或空格一次或多次
因此，\.\\s（？=[\\w\\s]+：）
仅当紧跟单词字符或空格一次或多次以及冒号时，才匹配点和空格。这将是每一段的结尾
因为我在strsplit

中使用正则表达式，所以我按正则表达式匹配的任何对象进行拆分。这将导致在每个段落的末尾拆分

启用lookaheads/behinds需要

perl=TRUE

结果：

[[1]]
[1] "China: Officially the People's Republic of China (PRC), is a unitary sovereign state in East Asia and the world's most populous country, with a population of around 1.404 billion"                                         
[2] "GUAM: Is an unincorporated and organized territory of the United States in Micronesia in the western Pacific Ocean"                                                                                                         
[3] "MICRONESIA: Is a subregion of Oceania, comprising thousands of small islands in the western Pacific Ocean. It has a shared cultural history with two other island regions, Polynesia to the east and Melanesia to the south"
[4] "DOMINCAN REPUBLIC: Is a country located on the island of Hispaniola, in the Greater Antilles archipelago of the Caribbean region."

这太棒了！我需要花一些时间来完全理解它是如何划分数据的，但是非常感谢！非常感谢你！我真的很感激！非常感谢。我也试过这个，效果很好！

stringr::str_split(t, "\\.\\s(?=[\\w\\s]+:)")

[[1]]
[1] "China: Officially the People's Republic of China (PRC), is a unitary sovereign state in East Asia and the world's most populous country, with a population of around 1.404 billion"                                         
[2] "GUAM: Is an unincorporated and organized territory of the United States in Micronesia in the western Pacific Ocean"                                                                                                         
[3] "MICRONESIA: Is a subregion of Oceania, comprising thousands of small islands in the western Pacific Ocean. It has a shared cultural history with two other island regions, Polynesia to the east and Melanesia to the south"
[4] "DOMINCAN REPUBLIC: Is a country located on the island of Hispaniola, in the Greater Antilles archipelago of the Caribbean region."