将文本转换为矩阵以在R中变成.csv_R_Export To Csv_Strsplit

将文本转换为矩阵以在R中变成.csv

将文本转换为矩阵以在R中变成.csv,r,export-to-csv,strsplit,R,Export To Csv,Strsplit,我有以下案文： Anada - Asociación de nada Address: calle 13 13 Medellin Colombia Other address: Phone.: 13-13-136131 13-13-13-1313 E-mail: anada@13.co Web page: Category: 3. Private sector Notes: Atodo - Asociación de todo Address: calle 12 Bogota Colombia

我有以下案文：

Anada - Asociación de nada Address: calle 13 13 Medellin Colombia Other
address: Phone.: 13-13-136131 13-13-13-1313 E-mail: anada@13.co Web page: Category: 3. Private sector Notes:
Atodo - Asociación de todo Address: calle 12 Bogota Colombia
Other address: Phone.: 12-1-23-32  E-mail: Web page: www.atodoooo.com, Category: 99. Public sector Notes: note that there are missing fields.

我想获得一个包含列名的矩阵，将其转换为.csv文件，如下所示：

Company, Address, Other Address, Tel, E-mail, Web page, Category, Sector, Notes

和行：

Anada - Asociación de nada, calle 13 13 Medellin Colombia, 13-13-136131 13-13-13-1313,anada@13.co,,3,Private,,

Atodo - Asociación de todo,calle 12 Bogota Colombia,,12-1-23-32,www.atodoooo.com,99,Public,note that there are missing fields.

如何使用R来完成它呢？

这可能很乏味，但似乎需要字符串处理

splitlist = 'Address|Other address|Phone|E-mail|Web page|Category'  
a = str_split(text[1], ':')  

for (i in 1:length(a[[1]])) {  
 a[[1]][i] = str_replace_all(a[[1]][i], splitlist, "")  
}  

# [[1]]
# [1] "Atodo - Asociacin de todo "           " calle 12 Bogota Colombia "          
# [3] " ."                                   " 12-1-23-32  "                       
# [5] " "                                    " www.atodoooo.com, "                 
# [7] " 99. Public sector Notes"             " note that there are missing fields."

然后，您可以用较少的字符串处理来提取每个字段

splitlist = 'Address|Other address|Phone|E-mail|Web page|Category'  
a = str_split(text[1], ':')  

for (i in 1:length(a[[1]])) {  
 a[[1]][i] = str_replace_all(a[[1]][i], splitlist, "")  
}  

# [[1]]
# [1] "Atodo - Asociacin de todo "           " calle 12 Bogota Colombia "          
# [3] " ."                                   " 12-1-23-32  "                       
# [5] " "                                    " www.atodoooo.com, "                 
# [7] " 99. Public sector Notes"             " note that there are missing fields."

在这种情况下，除了正则表达式之外，我想不出任何更简单的方法。

这可能很乏味，但似乎需要字符串处理

splitlist = 'Address|Other address|Phone|E-mail|Web page|Category'  
a = str_split(text[1], ':')  

for (i in 1:length(a[[1]])) {  
 a[[1]][i] = str_replace_all(a[[1]][i], splitlist, "")  
}  

# [[1]]
# [1] "Atodo - Asociacin de todo "           " calle 12 Bogota Colombia "          
# [3] " ."                                   " 12-1-23-32  "                       
# [5] " "                                    " www.atodoooo.com, "                 
# [7] " 99. Public sector Notes"             " note that there are missing fields."

然后，您可以用较少的字符串处理来提取每个字段

splitlist = 'Address|Other address|Phone|E-mail|Web page|Category'  
a = str_split(text[1], ':')  

for (i in 1:length(a[[1]])) {  
 a[[1]][i] = str_replace_all(a[[1]][i], splitlist, "")  
}  

# [[1]]
# [1] "Atodo - Asociacin de todo "           " calle 12 Bogota Colombia "          
# [3] " ."                                   " 12-1-23-32  "                       
# [5] " "                                    " www.atodoooo.com, "                 
# [7] " 99. Public sector Notes"             " note that there are missing fields."

在这种情况下，除了正则表达式之外，我想不出任何更简单的方法。

这可能很乏味，但似乎需要字符串处理

splitlist = 'Address|Other address|Phone|E-mail|Web page|Category'  
a = str_split(text[1], ':')  

for (i in 1:length(a[[1]])) {  
 a[[1]][i] = str_replace_all(a[[1]][i], splitlist, "")  
}  

# [[1]]
# [1] "Atodo - Asociacin de todo "           " calle 12 Bogota Colombia "          
# [3] " ."                                   " 12-1-23-32  "                       
# [5] " "                                    " www.atodoooo.com, "                 
# [7] " 99. Public sector Notes"             " note that there are missing fields."

然后，您可以用较少的字符串处理来提取每个字段

splitlist = 'Address|Other address|Phone|E-mail|Web page|Category'  
a = str_split(text[1], ':')  

for (i in 1:length(a[[1]])) {  
 a[[1]][i] = str_replace_all(a[[1]][i], splitlist, "")  
}  

# [[1]]
# [1] "Atodo - Asociacin de todo "           " calle 12 Bogota Colombia "          
# [3] " ."                                   " 12-1-23-32  "                       
# [5] " "                                    " www.atodoooo.com, "                 
# [7] " 99. Public sector Notes"             " note that there are missing fields."

在这种情况下，除了正则表达式之外，我想不出任何更简单的方法。

这可能很乏味，但似乎需要字符串处理

splitlist = 'Address|Other address|Phone|E-mail|Web page|Category'  
a = str_split(text[1], ':')  

for (i in 1:length(a[[1]])) {  
 a[[1]][i] = str_replace_all(a[[1]][i], splitlist, "")  
}  

# [[1]]
# [1] "Atodo - Asociacin de todo "           " calle 12 Bogota Colombia "          
# [3] " ."                                   " 12-1-23-32  "                       
# [5] " "                                    " www.atodoooo.com, "                 
# [7] " 99. Public sector Notes"             " note that there are missing fields."

然后，您可以用较少的字符串处理来提取每个字段

splitlist = 'Address|Other address|Phone|E-mail|Web page|Category'  
a = str_split(text[1], ':')  

for (i in 1:length(a[[1]])) {  
 a[[1]][i] = str_replace_all(a[[1]][i], splitlist, "")  
}  

# [[1]]
# [1] "Atodo - Asociacin de todo "           " calle 12 Bogota Colombia "          
# [3] " ."                                   " 12-1-23-32  "                       
# [5] " "                                    " www.atodoooo.com, "                 
# [7] " 99. Public sector Notes"             " note that there are missing fields."

在这种情况下，除了regex之外，我想不出任何更简单的方法。

以下假设您的记录在每个条目的一行上，即它看起来像：

text <- c("Anada - Asociación de nada Address: calle 13 13 Medellin Colombia Other address: Phone.: 13-13-136131 13-13-13-1313 E-mail: anada@13.co Web page: Category: 3. Private sector Notes:", 
          "Atodo - Asociación de todo Address: calle 12 Bogota Colombia Other address: Phone.: 12-1-23-32  E-mail: Web page: www.atodoooo.com, Category: 99. Public sector Notes: note that there are missing fields.")

在此基础上，方法基本如下：

library(devtools)
library(data.table)
library(reshape2)
source_gist("11380733") ## For cSplit

提取“标题”部分的
```
列表
```


提取相关值的列表
把它们重新组合成一个向量
再把他们分开
将结果从“长”格式改为“宽”格式


使用的工具如下：
library(devtools)
library(data.table)
library(reshape2)
source_gist("11380733") ## For cSplit

该方法与@won782类似
splitlist <- c("Address:", "Other address:", "Phone.:", "E-mail:", "Web page:",
               "Category:", "Public sector Notes:", "Private sector Notes:")
pattern <- paste0(splitlist, collapse = "|")

cSplit
函数与data.table
s很好地配合使用，所以让我们直接使用它
DT <- data.table(V1 = unlist(Combined))       ## unlist the values
DT <- cSplit(DT, "V1", ":")                   ## Split by a colon
DT[, V1_1 := gsub("Public sector |Private sector ", "", V1_1)]  ## Just "notes"
DT[, id := cumsum(V1_1 == "Company")]         ## Add an id column

以下假设您的记录在每个条目的一行上，即它看起来像：
text <- c("Anada - Asociación de nada Address: calle 13 13 Medellin Colombia Other address: Phone.: 13-13-136131 13-13-13-1313 E-mail: anada@13.co Web page: Category: 3. Private sector Notes:", 
          "Atodo - Asociación de todo Address: calle 12 Bogota Colombia Other address: Phone.: 12-1-23-32  E-mail: Web page: www.atodoooo.com, Category: 99. Public sector Notes: note that there are missing fields.")


在此基础上，方法基本如下：
library(devtools)
library(data.table)
library(reshape2)
source_gist("11380733") ## For cSplit


提取“标题”部分的列表

提取相关值的列表
把它们重新组合成一个向量
再把他们分开
将结果从“长”格式改为“宽”格式

使用的工具如下：
library(devtools)
library(data.table)
library(reshape2)
source_gist("11380733") ## For cSplit

该方法与@won782类似
splitlist <- c("Address:", "Other address:", "Phone.:", "E-mail:", "Web page:",
               "Category:", "Public sector Notes:", "Private sector Notes:")
pattern <- paste0(splitlist, collapse = "|")

cSplit
函数与data.table
s很好地配合使用，所以让我们直接使用它
DT <- data.table(V1 = unlist(Combined))       ## unlist the values
DT <- cSplit(DT, "V1", ":")                   ## Split by a colon
DT[, V1_1 := gsub("Public sector |Private sector ", "", V1_1)]  ## Just "notes"
DT[, id := cumsum(V1_1 == "Company")]         ## Add an id column

以下假设您的记录在每个条目的一行上，即它看起来像：
text <- c("Anada - Asociación de nada Address: calle 13 13 Medellin Colombia Other address: Phone.: 13-13-136131 13-13-13-1313 E-mail: anada@13.co Web page: Category: 3. Private sector Notes:", 
          "Atodo - Asociación de todo Address: calle 12 Bogota Colombia Other address: Phone.: 12-1-23-32  E-mail: Web page: www.atodoooo.com, Category: 99. Public sector Notes: note that there are missing fields.")


在此基础上，方法基本如下：
library(devtools)
library(data.table)
library(reshape2)
source_gist("11380733") ## For cSplit


提取“标题”部分的列表

提取相关值的列表
把它们重新组合成一个向量
再把他们分开
将结果从“长”格式改为“宽”格式

使用的工具如下：
library(devtools)
library(data.table)
library(reshape2)
source_gist("11380733") ## For cSplit

该方法与@won782类似
splitlist <- c("Address:", "Other address:", "Phone.:", "E-mail:", "Web page:",
               "Category:", "Public sector Notes:", "Private sector Notes:")
pattern <- paste0(splitlist, collapse = "|")

cSplit
函数与data.table
s很好地配合使用，所以让我们直接使用它
DT <- data.table(V1 = unlist(Combined))       ## unlist the values
DT <- cSplit(DT, "V1", ":")                   ## Split by a colon
DT[, V1_1 := gsub("Public sector |Private sector ", "", V1_1)]  ## Just "notes"
DT[, id := cumsum(V1_1 == "Company")]         ## Add an id column

以下假设您的记录在每个条目的一行上，即它看起来像：
text <- c("Anada - Asociación de nada Address: calle 13 13 Medellin Colombia Other address: Phone.: 13-13-136131 13-13-13-1313 E-mail: anada@13.co Web page: Category: 3. Private sector Notes:", 
          "Atodo - Asociación de todo Address: calle 12 Bogota Colombia Other address: Phone.: 12-1-23-32  E-mail: Web page: www.atodoooo.com, Category: 99. Public sector Notes: note that there are missing fields.")


在此基础上，方法基本如下：
library(devtools)
library(data.table)
library(reshape2)
source_gist("11380733") ## For cSplit


提取“标题”部分的列表

提取相关值的列表
把它们重新组合成一个向量
再把他们分开
将结果从“长”格式改为“宽”格式

使用的工具如下：
library(devtools)
library(data.table)
library(reshape2)
source_gist("11380733") ## For cSplit

该方法与@won782类似
splitlist <- c("Address:", "Other address:", "Phone.:", "E-mail:", "Web page:",
               "Category:", "Public sector Notes:", "Private sector Notes:")
pattern <- paste0(splitlist, collapse = "|")

cSplit
函数与data.table
s很好地配合使用，所以让我们直接使用它
DT <- data.table(V1 = unlist(Combined))       ## unlist the values
DT <- cSplit(DT, "V1", ":")                   ## Split by a colon
DT[, V1_1 := gsub("Public sector |Private sector ", "", V1_1)]  ## Just "notes"
DT[, id := cumsum(V1_1 == "Company")]         ## Add an id column

Thnaks，但它不能解决我在多个案例中的问题。解决我的问题（第一项除外）的是textThnaks，但它不能解决我在多个案例中的问题。解决我的问题（第一项除外）的是textThnaks，但它不能解决我在多个案例中的问题。什么能解决我的问题（第一项除外）是textThnaks，但它不能解决我在多个情况下的问题。什么能解决我的问题（第一项除外），是文本谢谢，但假设每个条目一行是不正确的。上面的示例包含两个条目，但它们没有分成行。所有条目都是未匹配的文本，但是将文本转换为每个条目一行应该很容易。哪种方法最简单？@xav，是否可能是“地址：”是否总是在条目的第一行？如果是这样，那么应该很容易修复。我已经看到您转换的文本已被拆分。下面是应该的文本：c("阿纳达-纳达协会地址：calle 13 13麦德林哥伦比亚其他地址：电话：13-13-136131 13-13-13-1313电子邮件：anada@13.co网页：类别：3.私营部门注释：Atodo-Asociación de todo地址：calle 12 Bogota Colombia其他地址：电话：12-1-23-32电子邮件：网页：www.atodoo.com，类别：99.公共部门注释：请注意，有缺少的字段。”）谢谢，但假设每个条目一行是不正确的。上面的示例包含两个条目，但它们没有分为几行。所有条目都在未匹配的文本中，但是将文本转换为每个条目一行应该很容易。哪种方法最简单？@xav，是否可能是“地址：“总是在条目的第一行吗？如果是这样，那么这应该是一个简单的修复方法。我看到您转换的文本已经被拆分。下面是应该的文本：c。”("阿纳达-纳达协会地址：calle 13 13麦德林哥伦比亚其他地址：电话：13-13-136131 13-13-13-1313电子邮件：anada@13.co网页：类别：3.私营部门注释：Atodo-Asociación de todo地址：calle 12 Bogota Colombia其他地址：电话：12-1-23-32电子邮件：网页：www.atodoo.com，类别：99.公共部门注释：请注意，缺少字段。”）谢谢，但假设每个条目只有一行