R 通过连续冒号从字符串中提取字符
我试图从数据帧中的变量中提取一些信息。我使用的是R3.3.3 信息格式如下:R 通过连续冒号从字符串中提取字符,r,string,stringr,grepl,R,String,Stringr,Grepl,我试图从数据帧中的变量中提取一些信息。我使用的是R3.3.3 信息格式如下: t <- "China: Officially the People's Republic of China (PRC), is a unitary sovereign state in East Asia and the world's most populous country, with a population of around 1.404 billion. GUAM: Is an unincorpor
t <- "China: Officially the People's Republic of China (PRC), is a unitary sovereign state in East Asia and the world's most populous country, with a population of around 1.404 billion. GUAM: Is an unincorporated and organized territory of the United States in Micronesia in the western Pacific Ocean. MICRONESIA: Is a subregion of Oceania, comprising thousands of small islands in the western Pacific Ocean. It has a shared cultural history with two other island regions, Polynesia to the east and Melanesia to the south. DOMINCAN REPUBLIC: Is a country located on the island of Hispaniola, in the Greater Antilles archipelago of the Caribbean region."
我在试图提取这些信息时遇到了一些困难。所以像和这样的问题非常有帮助。从这些信息中,我发现可以使用某种形式的stringr/gsub来提取这些信息,但我不知道如何在gsub语句中指定范围
我已经能够计算出如何拉出第一部分:
>test4 <- gsub("(.*{1})(:.*)","\\1", t)
我的总体问题是:
[1] "CHINA: Officially the People's Republic of China (PRC), is a unitary sovereign state in East Asia and the world's most populous country, with a population of around 1.404 billion. GUAM: Is an unincorporated and organized territory of the United States in Micronesia in the western Pacific Ocean. MICRONESIA: Is a subregion of Oceania, comprising thousands of small islands in the western Pacific Ocean. It has a shared cultural history with two other island regions, Polynesia to the east and Melanesia to the south. DOMINCAN REPUBLIC"
如果我不必把“多米尼加共和国”部分从绳子的末端清理干净,那就太好了
总之:
1。如何通过连续的冒号从字符串中提取字符?(第一到第二个冒号,第二到第三个等)
2。有没有一种方法可以将单词也保留在冒号前面?
任何信息或指导都将不胜感激。关于base R的以下内容如何
# Your sample string
t <- "China: Officially the People's Republic of China (PRC), is a unitary sovereign state in East Asia and the world's most populous country, with a population of around 1.404 billion. GUAM: Is an unincorporated and organized territory of the United States in Micronesia in the western Pacific Ocean. MICRONESIA: Is a subregion of Oceania, comprising thousands of small islands in the western Pacific Ocean. It has a shared cultural history with two other island regions, Polynesia to the east and Melanesia to the south. DOMINCAN REPUBLIC: Is a country located on the island of Hispaniola, in the Greater Antilles archipelago of the Caribbean region.";
# Get position of regexp matches
matches <- data.frame(
idx = unlist(gregexpr(pattern = "([A-Z]*\\s*[A-Z]+:|\\w+:)", t)),
len = c(diff(unlist(gregexpr(pattern = "([A-Z]*\\s*[A-Z]+:|\\w+:)", t))), nchar(t))
);
# Get substrings based on positions and store in list
lst <- apply(matches, 1, function(x) {
trimws(substr(t, x[1], sum(x) - 1));
})
lst;
#[1] "China: Officially the People's Republic of China (PRC), is a unitary sovereign state in East Asia and the world's most populous country, with a population of around 1.404 billion."
#[2] "GUAM: Is an unincorporated and organized territory of the United States in Micronesia in the western Pacific Ocean."
#[3] "MICRONESIA: Is a subregion of Oceania, comprising thousands of small islands in the western Pacific Ocean. It has a shared cultural history with two other island regions, Polynesia to the east and Melanesia to the south. DOMINCAN"
#[4] "DOMINCAN REPUBLIC: Is a country located on the island of Hispaniola, in the Greater Antilles archipelago of the Caribbean region."
#您的示例字符串
t您可以使用strsplit
和适当的正则表达式:
strsplit(t, "\\.\\s(?=[\\w\\s]+:)", perl=TRUE)
或
注意事项:
[[1]]
[1] "China: Officially the People's Republic of China (PRC), is a unitary sovereign state in East Asia and the world's most populous country, with a population of around 1.404 billion"
[2] "GUAM: Is an unincorporated and organized territory of the United States in Micronesia in the western Pacific Ocean"
[3] "MICRONESIA: Is a subregion of Oceania, comprising thousands of small islands in the western Pacific Ocean. It has a shared cultural history with two other island regions, Polynesia to the east and Melanesia to the south"
[4] "DOMINCAN REPUBLIC: Is a country located on the island of Hispaniola, in the Greater Antilles archipelago of the Caribbean region."
\\.\\s
匹配文字点和空格
(?=[\\w\\s]+:)
是一种正向前瞻,它在冒号后面匹配单词字符或空格一次或多次李>
因此,\.\\s(?=[\\w\\s]+:)
仅当紧跟单词字符或空格一次或多次以及冒号时,才匹配点和空格。这将是每一段的结尾
因为我在strsplit
中使用正则表达式,所以我按正则表达式匹配的任何对象进行拆分。这将导致在每个段落的末尾拆分
perl=TRUE
[[1]]
[1] "China: Officially the People's Republic of China (PRC), is a unitary sovereign state in East Asia and the world's most populous country, with a population of around 1.404 billion"
[2] "GUAM: Is an unincorporated and organized territory of the United States in Micronesia in the western Pacific Ocean"
[3] "MICRONESIA: Is a subregion of Oceania, comprising thousands of small islands in the western Pacific Ocean. It has a shared cultural history with two other island regions, Polynesia to the east and Melanesia to the south"
[4] "DOMINCAN REPUBLIC: Is a country located on the island of Hispaniola, in the Greater Antilles archipelago of the Caribbean region."
这太棒了!我需要花一些时间来完全理解它是如何划分数据的,但是非常感谢!非常感谢你!我真的很感激!非常感谢。我也试过这个,效果很好!
stringr::str_split(t, "\\.\\s(?=[\\w\\s]+:)")
[[1]]
[1] "China: Officially the People's Republic of China (PRC), is a unitary sovereign state in East Asia and the world's most populous country, with a population of around 1.404 billion"
[2] "GUAM: Is an unincorporated and organized territory of the United States in Micronesia in the western Pacific Ocean"
[3] "MICRONESIA: Is a subregion of Oceania, comprising thousands of small islands in the western Pacific Ocean. It has a shared cultural history with two other island regions, Polynesia to the east and Melanesia to the south"
[4] "DOMINCAN REPUBLIC: Is a country located on the island of Hispaniola, in the Greater Antilles archipelago of the Caribbean region."