R 如何正确地从网页中提取具有两个标题行的表？_R_Xml_Rvest

R 如何正确地从网页中提取具有两个标题行的表？

r xml

R 如何正确地从网页中提取具有两个标题行的表？,r,xml,rvest,R,Xml,Rvest,我想在的网页中下载一个表。预期的输出如下所示（它是通过将内容复制并粘贴到excel、手动输入colname并导出为txt:（））创建的：我尝试使用rvest或XML将此表从html导入R，但失败 rvest尝试：我在chrome中通过右键单击->inspect提取了该表的节点xpath。然后我尝试使用以下代码刮取该表，得到的是一个只包含标题的表： library(rvest) ts.url <- "http://www.targetscan.org/cgi-bin/targetscan

我想在的网页中下载一个表。预期的输出如下所示（它是通过将内容复制并粘贴到excel、手动输入colname并导出为txt:（））创建的：

我尝试使用

rvest

或

XML

将此表从html导入R，但失败

rvest

尝试：我在chrome中通过右键单击->inspect提取了该表的节点xpath。然后我尝试使用以下代码刮取该表，得到的是一个只包含标题的表：

library(rvest)
ts.url <- "http://www.targetscan.org/cgi-bin/targetscan/fish_62/targetscan.cgi?gid=&mir_sc=miR-430&mir_nc=&mirg="
ts.page <- read_html(ts.url)
results <- html_table(html_node(ts.page, xpath='//*[@id="restable"]'), fill = T)

# > results
#   Target gene Representative 3' UTR 3' UTR expression profile All sites All sites All sites All sites
# 1 Target gene Representative 3' UTR 3' UTR expression profile     total All sites All sites All sites
#   Repre- sentative miRNA Total context+ score Links to sites in UTRs
# 1                   8mer              7mer-m8                7mer-1A

我的问题是如何正确地从该网页导入带有

rvest

或

XML

的表？（只要获取了表内容，标题就无所谓了。）

我认为这会让你非常接近你想要的。没有人喜欢标题级别不均匀的表，哈

library(rvest)

url <- "http://www.targetscan.org/cgi-bin/targetscan/fish_62/targetscan.cgi?gid=&mir_sc=miR-430&mir_nc=&mirg="


dat <- read_html(url)

# extract the "td" elements from table
x <- html_node(dat, xpath = '//*[@id="restable"]') %>%
    html_nodes("td") %>%
    html_text()


# put these in a character matrix -- be careful manually setting number of columns
my_matrix <- matrix(x, ncol = 10, byrow = T)

# put these in a dataframe if you prefer that
my_df <- data.frame(my_matrix, stringsAsFactors = F)

库（rvest）
url我认为这将使您非常接近您想要的内容。没有人喜欢标题级别不均匀的表，哈
library(rvest)

url <- "http://www.targetscan.org/cgi-bin/targetscan/fish_62/targetscan.cgi?gid=&mir_sc=miR-430&mir_nc=&mirg="


dat <- read_html(url)

# extract the "td" elements from table
x <- html_node(dat, xpath = '//*[@id="restable"]') %>%
    html_nodes("td") %>%
    html_text()


# put these in a character matrix -- be careful manually setting number of columns
my_matrix <- matrix(x, ncol = 10, byrow = T)

# put these in a dataframe if you prefer that
my_df <- data.frame(my_matrix, stringsAsFactors = F)

库（rvest）
url行有一个结束标记，但不是开始标记，因此您可以添加它们，然后readHTMLTable
应该可以工作
x <- readLines(ts.url)
x <- gsub("^<td>", "<tr><td>", x)
y <- readHTMLTable(x, which=3, skip.rows=1)
dim(y)
[1] 4184   10
head(y)
                V1                   V2                               V3 V4 V5 V6 V7
1 si:ch73-269m14.4 ENSDARG00000086612.1           72h,Adult,Brain,Testis  7  0  7  0
2          eef2a.1 ENSDARG00000042094.1 24h,72h,Adult,Brain,Ovary,Testis  3  1  2  0
3 WFIKKN2 (2 of 2) ENSDARG00000059139.1           72h,Adult,Brain,Testis  3  1  2  0
4  si:ch211-59h6.1 ENSDARG00000013022.1 24h,72h,Adult,Brain,Ovary,Testis  3  1  2  0
5    RSF1 (3 of 3) ENSDARG00000074737.1                        24h,Adult  6  1  4  1
6           wnt2ba ENSDARG00000005050.1           72h,Adult,Ovary,Testis  3  1  2  0
            V8    V9          V10
1 dre-miR-430b -1.63 Sites in UTR
2 dre-miR-430b -0.80 Sites in UTR
3 dre-miR-430b -0.76 Sites in UTR
4 dre-miR-430b -0.68 Sites in UTR
5 dre-miR-430b -0.67 Sites in UTR
6 dre-miR-430a -0.66 Sites in UTR

x行有一个结束但不是开始的
标记，因此您可以添加它们，然后readHTMLTable
应该可以工作
x <- readLines(ts.url)
x <- gsub("^<td>", "<tr><td>", x)
y <- readHTMLTable(x, which=3, skip.rows=1)
dim(y)
[1] 4184   10
head(y)
                V1                   V2                               V3 V4 V5 V6 V7
1 si:ch73-269m14.4 ENSDARG00000086612.1           72h,Adult,Brain,Testis  7  0  7  0
2          eef2a.1 ENSDARG00000042094.1 24h,72h,Adult,Brain,Ovary,Testis  3  1  2  0
3 WFIKKN2 (2 of 2) ENSDARG00000059139.1           72h,Adult,Brain,Testis  3  1  2  0
4  si:ch211-59h6.1 ENSDARG00000013022.1 24h,72h,Adult,Brain,Ovary,Testis  3  1  2  0
5    RSF1 (3 of 3) ENSDARG00000074737.1                        24h,Adult  6  1  4  1
6           wnt2ba ENSDARG00000005050.1           72h,Adult,Ovary,Testis  3  1  2  0
            V8    V9          V10
1 dre-miR-430b -1.63 Sites in UTR
2 dre-miR-430b -0.80 Sites in UTR
3 dre-miR-430b -0.76 Sites in UTR
4 dre-miR-430b -0.68 Sites in UTR
5 dre-miR-430b -0.67 Sites in UTR
6 dre-miR-430a -0.66 Sites in UTR

x这很好。我也找到了这个有用的教程。这很好。我也找到了这个有用的教程。
x <- readLines(ts.url)
x <- gsub("^<td>", "<tr><td>", x)
y <- readHTMLTable(x, which=3, skip.rows=1)
dim(y)
[1] 4184   10
head(y)
                V1                   V2                               V3 V4 V5 V6 V7
1 si:ch73-269m14.4 ENSDARG00000086612.1           72h,Adult,Brain,Testis  7  0  7  0
2          eef2a.1 ENSDARG00000042094.1 24h,72h,Adult,Brain,Ovary,Testis  3  1  2  0
3 WFIKKN2 (2 of 2) ENSDARG00000059139.1           72h,Adult,Brain,Testis  3  1  2  0
4  si:ch211-59h6.1 ENSDARG00000013022.1 24h,72h,Adult,Brain,Ovary,Testis  3  1  2  0
5    RSF1 (3 of 3) ENSDARG00000074737.1                        24h,Adult  6  1  4  1
6           wnt2ba ENSDARG00000005050.1           72h,Adult,Ovary,Testis  3  1  2  0
            V8    V9          V10
1 dre-miR-430b -1.63 Sites in UTR
2 dre-miR-430b -0.80 Sites in UTR
3 dre-miR-430b -0.76 Sites in UTR
4 dre-miR-430b -0.68 Sites in UTR
5 dre-miR-430b -0.67 Sites in UTR
6 dre-miR-430a -0.66 Sites in UTR