Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/r/79.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
R 如何正确地从网页中提取具有两个标题行的表?_R_Xml_Rvest - Fatal编程技术网

R 如何正确地从网页中提取具有两个标题行的表?

R 如何正确地从网页中提取具有两个标题行的表?,r,xml,rvest,R,Xml,Rvest,我想在的网页中下载一个表。预期的输出如下所示(它是通过将内容复制并粘贴到excel、手动输入colname并导出为txt:())创建的: 我尝试使用rvest或XML将此表从html导入R,但失败 rvest尝试: 我在chrome中通过右键单击->inspect提取了该表的节点xpath。然后我尝试使用以下代码刮取该表,得到的是一个只包含标题的表: library(rvest) ts.url <- "http://www.targetscan.org/cgi-bin/targetscan

我想在的网页中下载一个表。预期的输出如下所示(它是通过将内容复制并粘贴到excel、手动输入colname并导出为txt:())创建的:

我尝试使用
rvest
XML
将此表从html导入R,但失败

rvest
尝试: 我在chrome中通过右键单击->inspect提取了该表的节点xpath。然后我尝试使用以下代码刮取该表,得到的是一个只包含标题的表:

library(rvest)
ts.url <- "http://www.targetscan.org/cgi-bin/targetscan/fish_62/targetscan.cgi?gid=&mir_sc=miR-430&mir_nc=&mirg="
ts.page <- read_html(ts.url)
results <- html_table(html_node(ts.page, xpath='//*[@id="restable"]'), fill = T)

# > results
#   Target gene Representative 3' UTR 3' UTR expression profile All sites All sites All sites All sites
# 1 Target gene Representative 3' UTR 3' UTR expression profile     total All sites All sites All sites
#   Repre- sentative miRNA Total context+ score Links to sites in UTRs
# 1                   8mer              7mer-m8                7mer-1A

我的问题是如何正确地从该网页导入带有
rvest
XML
的表?(只要获取了表内容,标题就无所谓了。)

我认为这会让你非常接近你想要的。没有人喜欢标题级别不均匀的表,哈

library(rvest)

url <- "http://www.targetscan.org/cgi-bin/targetscan/fish_62/targetscan.cgi?gid=&mir_sc=miR-430&mir_nc=&mirg="


dat <- read_html(url)

# extract the "td" elements from table
x <- html_node(dat, xpath = '//*[@id="restable"]') %>%
    html_nodes("td") %>%
    html_text()


# put these in a character matrix -- be careful manually setting number of columns
my_matrix <- matrix(x, ncol = 10, byrow = T)

# put these in a dataframe if you prefer that
my_df <- data.frame(my_matrix, stringsAsFactors = F)
库(rvest)

url我认为这将使您非常接近您想要的内容。没有人喜欢标题级别不均匀的表,哈

library(rvest)

url <- "http://www.targetscan.org/cgi-bin/targetscan/fish_62/targetscan.cgi?gid=&mir_sc=miR-430&mir_nc=&mirg="


dat <- read_html(url)

# extract the "td" elements from table
x <- html_node(dat, xpath = '//*[@id="restable"]') %>%
    html_nodes("td") %>%
    html_text()


# put these in a character matrix -- be careful manually setting number of columns
my_matrix <- matrix(x, ncol = 10, byrow = T)

# put these in a dataframe if you prefer that
my_df <- data.frame(my_matrix, stringsAsFactors = F)
库(rvest)

url行有一个结束标记,但不是开始标记,因此您可以添加它们,然后
readHTMLTable
应该可以工作

x <- readLines(ts.url)
x <- gsub("^<td>", "<tr><td>", x)
y <- readHTMLTable(x, which=3, skip.rows=1)
dim(y)
[1] 4184   10
head(y)
                V1                   V2                               V3 V4 V5 V6 V7
1 si:ch73-269m14.4 ENSDARG00000086612.1           72h,Adult,Brain,Testis  7  0  7  0
2          eef2a.1 ENSDARG00000042094.1 24h,72h,Adult,Brain,Ovary,Testis  3  1  2  0
3 WFIKKN2 (2 of 2) ENSDARG00000059139.1           72h,Adult,Brain,Testis  3  1  2  0
4  si:ch211-59h6.1 ENSDARG00000013022.1 24h,72h,Adult,Brain,Ovary,Testis  3  1  2  0
5    RSF1 (3 of 3) ENSDARG00000074737.1                        24h,Adult  6  1  4  1
6           wnt2ba ENSDARG00000005050.1           72h,Adult,Ovary,Testis  3  1  2  0
            V8    V9          V10
1 dre-miR-430b -1.63 Sites in UTR
2 dre-miR-430b -0.80 Sites in UTR
3 dre-miR-430b -0.76 Sites in UTR
4 dre-miR-430b -0.68 Sites in UTR
5 dre-miR-430b -0.67 Sites in UTR
6 dre-miR-430a -0.66 Sites in UTR

x行有一个结束但不是开始的
标记,因此您可以添加它们,然后
readHTMLTable
应该可以工作

x <- readLines(ts.url)
x <- gsub("^<td>", "<tr><td>", x)
y <- readHTMLTable(x, which=3, skip.rows=1)
dim(y)
[1] 4184   10
head(y)
                V1                   V2                               V3 V4 V5 V6 V7
1 si:ch73-269m14.4 ENSDARG00000086612.1           72h,Adult,Brain,Testis  7  0  7  0
2          eef2a.1 ENSDARG00000042094.1 24h,72h,Adult,Brain,Ovary,Testis  3  1  2  0
3 WFIKKN2 (2 of 2) ENSDARG00000059139.1           72h,Adult,Brain,Testis  3  1  2  0
4  si:ch211-59h6.1 ENSDARG00000013022.1 24h,72h,Adult,Brain,Ovary,Testis  3  1  2  0
5    RSF1 (3 of 3) ENSDARG00000074737.1                        24h,Adult  6  1  4  1
6           wnt2ba ENSDARG00000005050.1           72h,Adult,Ovary,Testis  3  1  2  0
            V8    V9          V10
1 dre-miR-430b -1.63 Sites in UTR
2 dre-miR-430b -0.80 Sites in UTR
3 dre-miR-430b -0.76 Sites in UTR
4 dre-miR-430b -0.68 Sites in UTR
5 dre-miR-430b -0.67 Sites in UTR
6 dre-miR-430a -0.66 Sites in UTR

x这很好。我也找到了这个有用的教程。这很好。我也找到了这个有用的教程。
x <- readLines(ts.url)
x <- gsub("^<td>", "<tr><td>", x)
y <- readHTMLTable(x, which=3, skip.rows=1)
dim(y)
[1] 4184   10
head(y)
                V1                   V2                               V3 V4 V5 V6 V7
1 si:ch73-269m14.4 ENSDARG00000086612.1           72h,Adult,Brain,Testis  7  0  7  0
2          eef2a.1 ENSDARG00000042094.1 24h,72h,Adult,Brain,Ovary,Testis  3  1  2  0
3 WFIKKN2 (2 of 2) ENSDARG00000059139.1           72h,Adult,Brain,Testis  3  1  2  0
4  si:ch211-59h6.1 ENSDARG00000013022.1 24h,72h,Adult,Brain,Ovary,Testis  3  1  2  0
5    RSF1 (3 of 3) ENSDARG00000074737.1                        24h,Adult  6  1  4  1
6           wnt2ba ENSDARG00000005050.1           72h,Adult,Ovary,Testis  3  1  2  0
            V8    V9          V10
1 dre-miR-430b -1.63 Sites in UTR
2 dre-miR-430b -0.80 Sites in UTR
3 dre-miR-430b -0.76 Sites in UTR
4 dre-miR-430b -0.68 Sites in UTR
5 dre-miR-430b -0.67 Sites in UTR
6 dre-miR-430a -0.66 Sites in UTR