R 如何正确地从网页中提取具有两个标题行的表?
我想在的网页中下载一个表。预期的输出如下所示(它是通过将内容复制并粘贴到excel、手动输入colname并导出为txt:())创建的: 我尝试使用R 如何正确地从网页中提取具有两个标题行的表?,r,xml,rvest,R,Xml,Rvest,我想在的网页中下载一个表。预期的输出如下所示(它是通过将内容复制并粘贴到excel、手动输入colname并导出为txt:())创建的: 我尝试使用rvest或XML将此表从html导入R,但失败 rvest尝试: 我在chrome中通过右键单击->inspect提取了该表的节点xpath。然后我尝试使用以下代码刮取该表,得到的是一个只包含标题的表: library(rvest) ts.url <- "http://www.targetscan.org/cgi-bin/targetscan
rvest
或XML
将此表从html导入R,但失败
rvest
尝试:
我在chrome中通过右键单击->inspect提取了该表的节点xpath。然后我尝试使用以下代码刮取该表,得到的是一个只包含标题的表:
library(rvest)
ts.url <- "http://www.targetscan.org/cgi-bin/targetscan/fish_62/targetscan.cgi?gid=&mir_sc=miR-430&mir_nc=&mirg="
ts.page <- read_html(ts.url)
results <- html_table(html_node(ts.page, xpath='//*[@id="restable"]'), fill = T)
# > results
# Target gene Representative 3' UTR 3' UTR expression profile All sites All sites All sites All sites
# 1 Target gene Representative 3' UTR 3' UTR expression profile total All sites All sites All sites
# Repre- sentative miRNA Total context+ score Links to sites in UTRs
# 1 8mer 7mer-m8 7mer-1A
我的问题是如何正确地从该网页导入带有
rvest
或XML
的表?(只要获取了表内容,标题就无所谓了。)我认为这会让你非常接近你想要的。没有人喜欢标题级别不均匀的表,哈
library(rvest)
url <- "http://www.targetscan.org/cgi-bin/targetscan/fish_62/targetscan.cgi?gid=&mir_sc=miR-430&mir_nc=&mirg="
dat <- read_html(url)
# extract the "td" elements from table
x <- html_node(dat, xpath = '//*[@id="restable"]') %>%
html_nodes("td") %>%
html_text()
# put these in a character matrix -- be careful manually setting number of columns
my_matrix <- matrix(x, ncol = 10, byrow = T)
# put these in a dataframe if you prefer that
my_df <- data.frame(my_matrix, stringsAsFactors = F)
库(rvest)
url我认为这将使您非常接近您想要的内容。没有人喜欢标题级别不均匀的表,哈
library(rvest)
url <- "http://www.targetscan.org/cgi-bin/targetscan/fish_62/targetscan.cgi?gid=&mir_sc=miR-430&mir_nc=&mirg="
dat <- read_html(url)
# extract the "td" elements from table
x <- html_node(dat, xpath = '//*[@id="restable"]') %>%
html_nodes("td") %>%
html_text()
# put these in a character matrix -- be careful manually setting number of columns
my_matrix <- matrix(x, ncol = 10, byrow = T)
# put these in a dataframe if you prefer that
my_df <- data.frame(my_matrix, stringsAsFactors = F)
库(rvest)
url行有一个结束标记,但不是开始标记,因此您可以添加它们,然后readHTMLTable
应该可以工作
x <- readLines(ts.url)
x <- gsub("^<td>", "<tr><td>", x)
y <- readHTMLTable(x, which=3, skip.rows=1)
dim(y)
[1] 4184 10
head(y)
V1 V2 V3 V4 V5 V6 V7
1 si:ch73-269m14.4 ENSDARG00000086612.1 72h,Adult,Brain,Testis 7 0 7 0
2 eef2a.1 ENSDARG00000042094.1 24h,72h,Adult,Brain,Ovary,Testis 3 1 2 0
3 WFIKKN2 (2 of 2) ENSDARG00000059139.1 72h,Adult,Brain,Testis 3 1 2 0
4 si:ch211-59h6.1 ENSDARG00000013022.1 24h,72h,Adult,Brain,Ovary,Testis 3 1 2 0
5 RSF1 (3 of 3) ENSDARG00000074737.1 24h,Adult 6 1 4 1
6 wnt2ba ENSDARG00000005050.1 72h,Adult,Ovary,Testis 3 1 2 0
V8 V9 V10
1 dre-miR-430b -1.63 Sites in UTR
2 dre-miR-430b -0.80 Sites in UTR
3 dre-miR-430b -0.76 Sites in UTR
4 dre-miR-430b -0.68 Sites in UTR
5 dre-miR-430b -0.67 Sites in UTR
6 dre-miR-430a -0.66 Sites in UTR
x行有一个结束但不是开始的
标记,因此您可以添加它们,然后readHTMLTable
应该可以工作
x <- readLines(ts.url)
x <- gsub("^<td>", "<tr><td>", x)
y <- readHTMLTable(x, which=3, skip.rows=1)
dim(y)
[1] 4184 10
head(y)
V1 V2 V3 V4 V5 V6 V7
1 si:ch73-269m14.4 ENSDARG00000086612.1 72h,Adult,Brain,Testis 7 0 7 0
2 eef2a.1 ENSDARG00000042094.1 24h,72h,Adult,Brain,Ovary,Testis 3 1 2 0
3 WFIKKN2 (2 of 2) ENSDARG00000059139.1 72h,Adult,Brain,Testis 3 1 2 0
4 si:ch211-59h6.1 ENSDARG00000013022.1 24h,72h,Adult,Brain,Ovary,Testis 3 1 2 0
5 RSF1 (3 of 3) ENSDARG00000074737.1 24h,Adult 6 1 4 1
6 wnt2ba ENSDARG00000005050.1 72h,Adult,Ovary,Testis 3 1 2 0
V8 V9 V10
1 dre-miR-430b -1.63 Sites in UTR
2 dre-miR-430b -0.80 Sites in UTR
3 dre-miR-430b -0.76 Sites in UTR
4 dre-miR-430b -0.68 Sites in UTR
5 dre-miR-430b -0.67 Sites in UTR
6 dre-miR-430a -0.66 Sites in UTR
x这很好。我也找到了这个有用的教程。这很好。我也找到了这个有用的教程。
x <- readLines(ts.url)
x <- gsub("^<td>", "<tr><td>", x)
y <- readHTMLTable(x, which=3, skip.rows=1)
dim(y)
[1] 4184 10
head(y)
V1 V2 V3 V4 V5 V6 V7
1 si:ch73-269m14.4 ENSDARG00000086612.1 72h,Adult,Brain,Testis 7 0 7 0
2 eef2a.1 ENSDARG00000042094.1 24h,72h,Adult,Brain,Ovary,Testis 3 1 2 0
3 WFIKKN2 (2 of 2) ENSDARG00000059139.1 72h,Adult,Brain,Testis 3 1 2 0
4 si:ch211-59h6.1 ENSDARG00000013022.1 24h,72h,Adult,Brain,Ovary,Testis 3 1 2 0
5 RSF1 (3 of 3) ENSDARG00000074737.1 24h,Adult 6 1 4 1
6 wnt2ba ENSDARG00000005050.1 72h,Adult,Ovary,Testis 3 1 2 0
V8 V9 V10
1 dre-miR-430b -1.63 Sites in UTR
2 dre-miR-430b -0.80 Sites in UTR
3 dre-miR-430b -0.76 Sites in UTR
4 dre-miR-430b -0.68 Sites in UTR
5 dre-miR-430b -0.67 Sites in UTR
6 dre-miR-430a -0.66 Sites in UTR