R 如何从网站上抓取数据？_R_Web Scraping_Dataframe

R 如何从网站上抓取数据？

r web-scraping dataframe

R 如何从网站上抓取数据？,r,web-scraping,dataframe,R,Web Scraping,Dataframe,我使用以下代码从网站（）获取信息。但我不知道如何得到一个包括“日期、开盘、收盘、高点、低点”的数据框。任何帮助都将不胜感激 thepage = readLines('http://q.stock.sohu.com/hisHq?code=cn_000002&start=20131120&end=20140318&stat=1&order=D&period=d&callback=historySearchHandler&rt=jsonp&

我使用以下代码从网站（）获取信息。但我不知道如何得到一个包括“日期、开盘、收盘、高点、低点”的数据框。任何帮助都将不胜感激

thepage = readLines('http://q.stock.sohu.com/hisHq?code=cn_000002&start=20131120&end=20140318&stat=1&order=D&period=d&callback=historySearchHandler&rt=jsonp&r=0.8740235545448934&0.28161772061461654')

如何获取数据框？

似乎您正试图从一个网站中获取数据

为此，除了为了抓取网站而需要执行的“常规步骤”之外，您还需要处理JSON数据的解析和操作：

json <- getNodeSet(x, "//body/p")
json <- xmlValue(json[[1]])

require("jsonlite")
fromJSON(json, simplifyVector=FALSE)

[[1]]
[[1]]$status
[1] 0

[[1]]$hq
[[1]]$hq[[1]]
[[1]]$hq[[1]][[1]]
[1] "2014-03-18"

[[1]]$hq[[1]][[2]]
[1] "7.76"

[...]

惯常做法如果您的HTML有一个易于抓取的表，那么这应该可以：

require("XML")
x <- readHTMLTable(
    doc="swww.someurl.com"
)

library(RJSONIO)
library(RCurl)

# get the raw data
dat.json.raw <- getURL("http://q.stock.sohu.com/hisHq?code=cn_000002&start=20131120&end=20140318&stat=1&order=D&period=d&callback=historySearchHandler&rt=jsonp&r=0.8740235545448934&0.28161772061461654%27")
tt <- textConnection(dat.json.raw)
dat.json <- readLines(tt)
close(tt)

# remove callback
dat.json <- gsub("^historySearchHandler\\(", "", dat.json)
dat.json <- gsub("\\)$", "", dat.json)

# convert to R structure
dat.l <- fromJSON(dat.json)

# get the meaty part of the data into a data.frame
dat <- data.frame(t(sapply(dat.l[[1]]$hq, unlist)), stringsAsFactors=FALSE)
dat$X1 <- as.Date(dat$X1)
dat$X2 <- as.numeric(dat$X2)
dat$X3 <- as.numeric(dat$X3)
dat$X4 <- as.numeric(dat$X4)

str(dat)
## 'data.frame':    79 obs. of  10 variables:
##  $ X1 : Date, format: "2014-03-18" "2014-03-17" "2014-03-14" ...
##  $ X2 : num  7.76 7.6 7.68 7.58 7.48 7.19 7.22 7.34 6.76 6.92 ...
##  $ X3 : num  7.6 7.76 7.53 7.71 7.6 7.5 7.15 7.27 7.32 6.76 ...
##  $ X4 : num  -0.16 0.23 -0.18 0.11 0.1 0.35 -0.12 -0.05 0.56 -0.16 ...
##  $ X5 : chr  "-2.06%" "3.05%" "-2.33%" "1.45%" ...
##  $ X6 : chr  "7.55" "7.59" "7.50" "7.53" ...
##  $ X7 : chr  "7.76" "7.80" "7.81" "7.85" ...
##  $ X8 : chr  "843900" "1177079" "1303110" "1492359" ...
##  $ X9 : chr  "64268.06" "90829.30" "99621.34" "114990.40" ...
##  $ X10: chr  "0.87%" "1.22%" "1.35%" "1.54%" ...

head(dat)
##           X1   X2   X3    X4     X5   X6   X7      X8        X9   X10
## 1 2014-03-18 7.76 7.60 -0.16 -2.06% 7.55 7.76  843900  64268.06 0.87%
## 2 2014-03-17 7.60 7.76  0.23  3.05% 7.59 7.80 1177079  90829.30 1.22%
## 3 2014-03-14 7.68 7.53 -0.18 -2.33% 7.50 7.81 1303110  99621.34 1.35%
## 4 2014-03-13 7.58 7.71  0.11  1.45% 7.53 7.85 1492359 114990.40 1.54%
## 5 2014-03-12 7.48 7.60  0.10  1.33% 7.42 7.85 2089873 160315.88 2.16%
## 6 2014-03-11 7.19 7.50  0.35  4.90% 7.15 7.59 1892488 141250.94 1.96%

包括JSON数据的方法解析HTML代码：

x <- htmlTreeParse(
    file="http://q.stock.sohu.com/hisHq?code=cn_000002&start=20131120&end=20140318&stat=1&order=D&period=d&callback=historySearchHandler&rt=jsonp&r=0.8740235545448934&0.28161772061461654",
    isURL=TRUE,
    useInternalNodes=TRUE
)

现在，您需要将其转换为更像

data.frame

的顺序（想到的方法有

do.call（）

，

rbind（）

，

cbind

）

编码迟早（而不是像我们在这个例子中看到的那样迟早），您将面临编码问题（比如

“ÃÃ›¼Ã†：”

）

您可以在解析HTML代码时直接使用不同的编码（在

htmlTreeParse（）

中的参数

encoding

），也可以在“之后”通过

编码修改字符串的编码。不过，我没能完全符合你的价值观。编码问题可能是相当痛苦的
一般建议
我建议您以后选择基于英语的示例（本例中是基于英语的网站），否则您将极大地限制可能帮助您的人数。
我不知道返回JSON的哪些部分是您需要的实际值，但我假设它们是hq
记录的组件。这应该起作用：
require("XML")
x <- readHTMLTable(
    doc="swww.someurl.com"
)

library(RJSONIO)
library(RCurl)

# get the raw data
dat.json.raw <- getURL("http://q.stock.sohu.com/hisHq?code=cn_000002&start=20131120&end=20140318&stat=1&order=D&period=d&callback=historySearchHandler&rt=jsonp&r=0.8740235545448934&0.28161772061461654%27")
tt <- textConnection(dat.json.raw)
dat.json <- readLines(tt)
close(tt)

# remove callback
dat.json <- gsub("^historySearchHandler\\(", "", dat.json)
dat.json <- gsub("\\)$", "", dat.json)

# convert to R structure
dat.l <- fromJSON(dat.json)

# get the meaty part of the data into a data.frame
dat <- data.frame(t(sapply(dat.l[[1]]$hq, unlist)), stringsAsFactors=FALSE)
dat$X1 <- as.Date(dat$X1)
dat$X2 <- as.numeric(dat$X2)
dat$X3 <- as.numeric(dat$X3)
dat$X4 <- as.numeric(dat$X4)

str(dat)
## 'data.frame':    79 obs. of  10 variables:
##  $ X1 : Date, format: "2014-03-18" "2014-03-17" "2014-03-14" ...
##  $ X2 : num  7.76 7.6 7.68 7.58 7.48 7.19 7.22 7.34 6.76 6.92 ...
##  $ X3 : num  7.6 7.76 7.53 7.71 7.6 7.5 7.15 7.27 7.32 6.76 ...
##  $ X4 : num  -0.16 0.23 -0.18 0.11 0.1 0.35 -0.12 -0.05 0.56 -0.16 ...
##  $ X5 : chr  "-2.06%" "3.05%" "-2.33%" "1.45%" ...
##  $ X6 : chr  "7.55" "7.59" "7.50" "7.53" ...
##  $ X7 : chr  "7.76" "7.80" "7.81" "7.85" ...
##  $ X8 : chr  "843900" "1177079" "1303110" "1492359" ...
##  $ X9 : chr  "64268.06" "90829.30" "99621.34" "114990.40" ...
##  $ X10: chr  "0.87%" "1.22%" "1.35%" "1.54%" ...

head(dat)
##           X1   X2   X3    X4     X5   X6   X7      X8        X9   X10
## 1 2014-03-18 7.76 7.60 -0.16 -2.06% 7.55 7.76  843900  64268.06 0.87%
## 2 2014-03-17 7.60 7.76  0.23  3.05% 7.59 7.80 1177079  90829.30 1.22%
## 3 2014-03-14 7.68 7.53 -0.18 -2.33% 7.50 7.81 1303110  99621.34 1.35%
## 4 2014-03-13 7.58 7.71  0.11  1.45% 7.53 7.85 1492359 114990.40 1.54%
## 5 2014-03-12 7.48 7.60  0.10  1.33% 7.42 7.85 2089873 160315.88 2.16%
## 6 2014-03-11 7.19 7.50  0.35  4.90% 7.15 7.59 1892488 141250.94 1.96%

库（RJSONIO）
图书馆（RCurl）
#获取原始数据
谢谢你的建议。我还没有一个合适的基于英语的例子。如果我有“页面”的数据。我如何将数据转换成一个数据框架，它只包括：2014-03-18，“2014-03-18”，“7.76”，“7.60”，“-0.16”，“-2.06%，“7.55”，“7.76”，“843900”，“64268.06”，“0.87%，”好吧，按照我概述的工作流程。如果您尝试直接处理由readLines（）
返回的页面（这是纯文本，不是解析对象），您不会走得太远，并且/或者需要使用大量正则表达式。您想要的是使用解析后的HTML代码，因为它允许您在节点树中轻松导航。如果您想使用webscrape，您需要学习一些XPath和/或JSON，抱歉。页面希望回调函数名称不会更改。如果是这样的话，您可以调整第一个gsub（）
（我想如果需要的话，也可以使它更通用一些）。从JSON中获取信息通常很有挑战性，这种特殊的模式在大多数情况下对我来说都很好（经过调整）。