将文本转换为矩阵以在R中变成.csv
我有以下案文:将文本转换为矩阵以在R中变成.csv,r,export-to-csv,strsplit,R,Export To Csv,Strsplit,我有以下案文: Anada - Asociación de nada Address: calle 13 13 Medellin Colombia Other address: Phone.: 13-13-136131 13-13-13-1313 E-mail: anada@13.co Web page: Category: 3. Private sector Notes: Atodo - Asociación de todo Address: calle 12 Bogota Colombia
Anada - Asociación de nada Address: calle 13 13 Medellin Colombia Other
address: Phone.: 13-13-136131 13-13-13-1313 E-mail: anada@13.co Web page: Category: 3. Private sector Notes:
Atodo - Asociación de todo Address: calle 12 Bogota Colombia
Other address: Phone.: 12-1-23-32 E-mail: Web page: www.atodoooo.com, Category: 99. Public sector Notes: note that there are missing fields.
我想获得一个包含列名的矩阵,将其转换为.csv文件,如下所示:
Company, Address, Other Address, Tel, E-mail, Web page, Category, Sector, Notes
和行:
Anada - Asociación de nada, calle 13 13 Medellin Colombia, 13-13-136131 13-13-13-1313,anada@13.co,,3,Private,,
Atodo - Asociación de todo,calle 12 Bogota Colombia,,12-1-23-32,www.atodoooo.com,99,Public,note that there are missing fields.
如何使用R来完成它呢?这可能很乏味,但似乎需要字符串处理
splitlist = 'Address|Other address|Phone|E-mail|Web page|Category'
a = str_split(text[1], ':')
for (i in 1:length(a[[1]])) {
a[[1]][i] = str_replace_all(a[[1]][i], splitlist, "")
}
# [[1]]
# [1] "Atodo - Asociacin de todo " " calle 12 Bogota Colombia "
# [3] " ." " 12-1-23-32 "
# [5] " " " www.atodoooo.com, "
# [7] " 99. Public sector Notes" " note that there are missing fields."
然后,您可以用较少的字符串处理来提取每个字段
splitlist = 'Address|Other address|Phone|E-mail|Web page|Category'
a = str_split(text[1], ':')
for (i in 1:length(a[[1]])) {
a[[1]][i] = str_replace_all(a[[1]][i], splitlist, "")
}
# [[1]]
# [1] "Atodo - Asociacin de todo " " calle 12 Bogota Colombia "
# [3] " ." " 12-1-23-32 "
# [5] " " " www.atodoooo.com, "
# [7] " 99. Public sector Notes" " note that there are missing fields."
在这种情况下,除了正则表达式之外,我想不出任何更简单的方法。这可能很乏味,但似乎需要字符串处理
splitlist = 'Address|Other address|Phone|E-mail|Web page|Category'
a = str_split(text[1], ':')
for (i in 1:length(a[[1]])) {
a[[1]][i] = str_replace_all(a[[1]][i], splitlist, "")
}
# [[1]]
# [1] "Atodo - Asociacin de todo " " calle 12 Bogota Colombia "
# [3] " ." " 12-1-23-32 "
# [5] " " " www.atodoooo.com, "
# [7] " 99. Public sector Notes" " note that there are missing fields."
然后,您可以用较少的字符串处理来提取每个字段
splitlist = 'Address|Other address|Phone|E-mail|Web page|Category'
a = str_split(text[1], ':')
for (i in 1:length(a[[1]])) {
a[[1]][i] = str_replace_all(a[[1]][i], splitlist, "")
}
# [[1]]
# [1] "Atodo - Asociacin de todo " " calle 12 Bogota Colombia "
# [3] " ." " 12-1-23-32 "
# [5] " " " www.atodoooo.com, "
# [7] " 99. Public sector Notes" " note that there are missing fields."
在这种情况下,除了正则表达式之外,我想不出任何更简单的方法。这可能很乏味,但似乎需要字符串处理
splitlist = 'Address|Other address|Phone|E-mail|Web page|Category'
a = str_split(text[1], ':')
for (i in 1:length(a[[1]])) {
a[[1]][i] = str_replace_all(a[[1]][i], splitlist, "")
}
# [[1]]
# [1] "Atodo - Asociacin de todo " " calle 12 Bogota Colombia "
# [3] " ." " 12-1-23-32 "
# [5] " " " www.atodoooo.com, "
# [7] " 99. Public sector Notes" " note that there are missing fields."
然后,您可以用较少的字符串处理来提取每个字段
splitlist = 'Address|Other address|Phone|E-mail|Web page|Category'
a = str_split(text[1], ':')
for (i in 1:length(a[[1]])) {
a[[1]][i] = str_replace_all(a[[1]][i], splitlist, "")
}
# [[1]]
# [1] "Atodo - Asociacin de todo " " calle 12 Bogota Colombia "
# [3] " ." " 12-1-23-32 "
# [5] " " " www.atodoooo.com, "
# [7] " 99. Public sector Notes" " note that there are missing fields."
在这种情况下,除了正则表达式之外,我想不出任何更简单的方法。这可能很乏味,但似乎需要字符串处理
splitlist = 'Address|Other address|Phone|E-mail|Web page|Category'
a = str_split(text[1], ':')
for (i in 1:length(a[[1]])) {
a[[1]][i] = str_replace_all(a[[1]][i], splitlist, "")
}
# [[1]]
# [1] "Atodo - Asociacin de todo " " calle 12 Bogota Colombia "
# [3] " ." " 12-1-23-32 "
# [5] " " " www.atodoooo.com, "
# [7] " 99. Public sector Notes" " note that there are missing fields."
然后,您可以用较少的字符串处理来提取每个字段
splitlist = 'Address|Other address|Phone|E-mail|Web page|Category'
a = str_split(text[1], ':')
for (i in 1:length(a[[1]])) {
a[[1]][i] = str_replace_all(a[[1]][i], splitlist, "")
}
# [[1]]
# [1] "Atodo - Asociacin de todo " " calle 12 Bogota Colombia "
# [3] " ." " 12-1-23-32 "
# [5] " " " www.atodoooo.com, "
# [7] " 99. Public sector Notes" " note that there are missing fields."
在这种情况下,除了regex之外,我想不出任何更简单的方法。以下假设您的记录在每个条目的一行上,即它看起来像:
text <- c("Anada - Asociación de nada Address: calle 13 13 Medellin Colombia Other address: Phone.: 13-13-136131 13-13-13-1313 E-mail: anada@13.co Web page: Category: 3. Private sector Notes:",
"Atodo - Asociación de todo Address: calle 12 Bogota Colombia Other address: Phone.: 12-1-23-32 E-mail: Web page: www.atodoooo.com, Category: 99. Public sector Notes: note that there are missing fields.")
在此基础上,方法基本如下:
library(devtools)
library(data.table)
library(reshape2)
source_gist("11380733") ## For cSplit
- 提取“标题”部分的
列表
- 提取相关值的
列表
- 把它们重新组合成一个向量
- 再把他们分开
- 将结果从“长”格式改为“宽”格式
library(devtools)
library(data.table)
library(reshape2)
source_gist("11380733") ## For cSplit
该方法与@won782类似
splitlist <- c("Address:", "Other address:", "Phone.:", "E-mail:", "Web page:",
"Category:", "Public sector Notes:", "Private sector Notes:")
pattern <- paste0(splitlist, collapse = "|")
cSplit
函数与data.table
s很好地配合使用,所以让我们直接使用它
DT <- data.table(V1 = unlist(Combined)) ## unlist the values
DT <- cSplit(DT, "V1", ":") ## Split by a colon
DT[, V1_1 := gsub("Public sector |Private sector ", "", V1_1)] ## Just "notes"
DT[, id := cumsum(V1_1 == "Company")] ## Add an id column
以下假设您的记录在每个条目的一行上,即它看起来像:
text <- c("Anada - Asociación de nada Address: calle 13 13 Medellin Colombia Other address: Phone.: 13-13-136131 13-13-13-1313 E-mail: anada@13.co Web page: Category: 3. Private sector Notes:",
"Atodo - Asociación de todo Address: calle 12 Bogota Colombia Other address: Phone.: 12-1-23-32 E-mail: Web page: www.atodoooo.com, Category: 99. Public sector Notes: note that there are missing fields.")
在此基础上,方法基本如下:
library(devtools)
library(data.table)
library(reshape2)
source_gist("11380733") ## For cSplit
- 提取“标题”部分的
列表
- 提取相关值的
列表
- 把它们重新组合成一个向量
- 再把他们分开
- 将结果从“长”格式改为“宽”格式
library(devtools)
library(data.table)
library(reshape2)
source_gist("11380733") ## For cSplit
该方法与@won782类似
splitlist <- c("Address:", "Other address:", "Phone.:", "E-mail:", "Web page:",
"Category:", "Public sector Notes:", "Private sector Notes:")
pattern <- paste0(splitlist, collapse = "|")
cSplit
函数与data.table
s很好地配合使用,所以让我们直接使用它
DT <- data.table(V1 = unlist(Combined)) ## unlist the values
DT <- cSplit(DT, "V1", ":") ## Split by a colon
DT[, V1_1 := gsub("Public sector |Private sector ", "", V1_1)] ## Just "notes"
DT[, id := cumsum(V1_1 == "Company")] ## Add an id column
以下假设您的记录在每个条目的一行上,即它看起来像:
text <- c("Anada - Asociación de nada Address: calle 13 13 Medellin Colombia Other address: Phone.: 13-13-136131 13-13-13-1313 E-mail: anada@13.co Web page: Category: 3. Private sector Notes:",
"Atodo - Asociación de todo Address: calle 12 Bogota Colombia Other address: Phone.: 12-1-23-32 E-mail: Web page: www.atodoooo.com, Category: 99. Public sector Notes: note that there are missing fields.")
在此基础上,方法基本如下:
library(devtools)
library(data.table)
library(reshape2)
source_gist("11380733") ## For cSplit
- 提取“标题”部分的
列表
- 提取相关值的
列表
- 把它们重新组合成一个向量
- 再把他们分开
- 将结果从“长”格式改为“宽”格式
library(devtools)
library(data.table)
library(reshape2)
source_gist("11380733") ## For cSplit
该方法与@won782类似
splitlist <- c("Address:", "Other address:", "Phone.:", "E-mail:", "Web page:",
"Category:", "Public sector Notes:", "Private sector Notes:")
pattern <- paste0(splitlist, collapse = "|")
cSplit
函数与data.table
s很好地配合使用,所以让我们直接使用它
DT <- data.table(V1 = unlist(Combined)) ## unlist the values
DT <- cSplit(DT, "V1", ":") ## Split by a colon
DT[, V1_1 := gsub("Public sector |Private sector ", "", V1_1)] ## Just "notes"
DT[, id := cumsum(V1_1 == "Company")] ## Add an id column
以下假设您的记录在每个条目的一行上,即它看起来像:
text <- c("Anada - Asociación de nada Address: calle 13 13 Medellin Colombia Other address: Phone.: 13-13-136131 13-13-13-1313 E-mail: anada@13.co Web page: Category: 3. Private sector Notes:",
"Atodo - Asociación de todo Address: calle 12 Bogota Colombia Other address: Phone.: 12-1-23-32 E-mail: Web page: www.atodoooo.com, Category: 99. Public sector Notes: note that there are missing fields.")
在此基础上,方法基本如下:
library(devtools)
library(data.table)
library(reshape2)
source_gist("11380733") ## For cSplit
- 提取“标题”部分的
列表
- 提取相关值的
列表
- 把它们重新组合成一个向量
- 再把他们分开
- 将结果从“长”格式改为“宽”格式
library(devtools)
library(data.table)
library(reshape2)
source_gist("11380733") ## For cSplit
该方法与@won782类似
splitlist <- c("Address:", "Other address:", "Phone.:", "E-mail:", "Web page:",
"Category:", "Public sector Notes:", "Private sector Notes:")
pattern <- paste0(splitlist, collapse = "|")
cSplit
函数与data.table
s很好地配合使用,所以让我们直接使用它
DT <- data.table(V1 = unlist(Combined)) ## unlist the values
DT <- cSplit(DT, "V1", ":") ## Split by a colon
DT[, V1_1 := gsub("Public sector |Private sector ", "", V1_1)] ## Just "notes"
DT[, id := cumsum(V1_1 == "Company")] ## Add an id column
Thnaks,但它不能解决我在多个案例中的问题。解决我的问题(第一项除外)的是textThnaks,但它不能解决我在多个案例中的问题。解决我的问题(第一项除外)的是textThnaks,但它不能解决我在多个案例中的问题。什么能解决我的问题(第一项除外)是textThnaks,但它不能解决我在多个情况下的问题。什么能解决我的问题(第一项除外),是文本谢谢,但假设每个条目一行是不正确的。上面的示例包含两个条目,但它们没有分成行。所有条目都是未匹配的文本,但是将文本转换为每个条目一行应该很容易。哪种方法最简单?@xav,是否可能是“地址:”是否总是在条目的第一行?如果是这样,那么应该很容易修复。我已经看到您转换的文本已被拆分。下面是应该的文本:c("阿纳达-纳达协会地址:calle 13 13麦德林哥伦比亚其他地址:电话:13-13-136131 13-13-13-1313电子邮件:anada@13.co网页:类别:3.私营部门注释:Atodo-Asociación de todo地址:calle 12 Bogota Colombia其他地址:电话:12-1-23-32电子邮件:网页:www.atodoo.com,类别:99.公共部门注释:请注意,有缺少的字段。”)谢谢,但假设每个条目一行是不正确的。上面的示例包含两个条目,但它们没有分为几行。所有条目都在未匹配的文本中,但是将文本转换为每个条目一行应该很容易。哪种方法最简单?@xav,是否可能是“地址:“总是在条目的第一行吗?如果是这样,那么这应该是一个简单的修复方法。我看到您转换的文本已经被拆分。下面是应该的文本:c。”("阿纳达-纳达协会地址:calle 13 13麦德林哥伦比亚其他地址:电话:13-13-136131 13-13-13-1313电子邮件:anada@13.co网页:类别:3.私营部门注释:Atodo-Asociación de todo地址:calle 12 Bogota Colombia其他地址:电话:12-1-23-32电子邮件:网页:www.atodoo.com,类别:99.公共部门注释:请注意,缺少字段。”)谢谢,但假设每个条目只有一行