For循环,带有来自XML包的readHTMLTable
我正在尝试使用For循环,带有来自XML包的readHTMLTable,r,xml,for-loop,R,Xml,For Loop,我正在尝试使用for循环从多个URL提取数据。问题是,我需要的数据可以在不同的表中找到。我最初的问题是。我掌握的初步数据: Code Issuer ISIN Type URL 1 NTK007_1915 NBRK KZW1KD079153 discount notes http://www.kase.kz/en/gsecs/show/NTK007_1915 2 NTK007
for
循环从多个URL提取数据。问题是,我需要的数据可以在不同的表中找到。我最初的问题是。我掌握的初步数据:
Code Issuer ISIN Type URL
1 NTK007_1915 NBRK KZW1KD079153 discount notes http://www.kase.kz/en/gsecs/show/NTK007_1915
2 NTK007_1917 NBRK KZW1KD079179 discount notes http://www.kase.kz/en/gsecs/show/NTK007_1917
3 NTK007_1918 NBRK KZW1KD079187 discount notes http://www.kase.kz/en/gsecs/show/NTK007_1918
4 NTK028_1896 NBRK KZW1KD288960 discount notes http://www.kase.kz/en/gsecs/show/NTK028_1896
5 NTK028_1903 NBRK KZW1KD289034 discount notes http://www.kase.kz/en/gsecs/show/NTK028_1903
6 NTK028_1909 NBRK KZW1KD289091 discount notes http://www.kase.kz/en/gsecs/show/NTK028_1909
我一直在尝试以下代码:
wanted <- c("Nominal value in issue's currency" = "Nominal Value",
"Number of bonds outstanding" = "# of Bonds Issue")
# function returns a data frame of wanted columns for given URL
getValues1 <- function (name, url) {
# get the table and rename columns
sp = readHTMLTable(url, stringsAsFactors = FALSE)
df <- sp[[4]]
names(df) <- c("full_name", "value")
# filter and remap wanted columns
result <- df[df$full_name %in% names(wanted),]
result$column_name <- sapply(result$full_name, function(x) {wanted[[x]]})
# add the identifier to every row
result$name <- name
return (result[,c("name", "column_name", "value")])
}
getValues2 <- function (name, url) {
# get the table and rename columns
sp = readHTMLTable(url, stringsAsFactors = FALSE)
df <- sp[[7]]
names(df) <- c("full_name", "value")
# filter and remap wanted columns
result <- df[df$full_name %in% names(wanted),]
result$column_name <- sapply(result$full_name, function(x) {wanted[[x]]})
# add the identifier to every row
result$name <- name
return (result[,c("name", "column_name", "value")])
}
# invoke function for each name/URL pair - returns list of data frames
for (i in 1:length(newd$URL)) {
sp = readHTMLTable(newd$URL[[i]])
if (dim(sp[[4]])[[2]] = 2) {
columns = getValues1(x[["name"]], x[["URL"]])
} else {
columns = getValues2(x[["name"]], x[["URL"]])
}
print (columns)
}
请提供帮助。最好使用CSS或XPath选择器故意选择所需的表。否则,你就不得不求助于控制流体操,你仍然有很好的机会不小心拉到你不想要的东西。另外,将
for
循环更改为lappy
将更自然地为您提供一个整洁的列表。如果您能看一下这两页,我将不胜感激:我想获得“特征”标签下的数据。这两个页面的CSS和/或XPath代码是什么?仍然在为selectorgadget苦苦挣扎……啊,我想我终于找到了:html\u节点(“\code>main>div.right>div.float-wrapper-right>div>table.content table.top”)%%>%html\u table()
你可以使用h%>%html\u节点('table.top')%%>%html\u table()
其中h
是解析过的页面。但是它不能处理中间的标题行,所以以后必须拆分它。
Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function ‘readHTMLTable’ for signature ‘"NULL"’