如何使用R(Rcurl/XML包?!)来刮取此网页?

如何使用R(Rcurl/XML包?!)来刮取此网页?,r,web-scraping,R,Web Scraping,我有一个(有点复杂)的网页抓取挑战,我希望完成,并希望有一些方向(无论你想分享到什么水平),如下所示: 我想浏览此链接中的所有“物种页面”: 因此,对于其中的每一项,我将转到: 物种页面链接(例如:) 然后转到“二级结构”页面链接(例如:) 在该链接中,我希望删除页面中的数据,这样我将有一个包含此数据的长列表(例如): chr.trna3(1-77)长度:77 bp 类型:Ala反密码子:CGC在35-37(35-37)分:93.45 序号:GGGCCGGTAGCTGCGGAAGAGCCCGC

我有一个(有点复杂)的网页抓取挑战,我希望完成,并希望有一些方向(无论你想分享到什么水平),如下所示:

我想浏览此链接中的所有“物种页面”:

因此,对于其中的每一项,我将转到:

  • 物种页面链接(例如:)
  • 然后转到“二级结构”页面链接(例如:)
  • 在该链接中,我希望删除页面中的数据,这样我将有一个包含此数据的长列表(例如):

    chr.trna3(1-77)长度:77 bp
    类型:Ala反密码子:CGC在35-37(35-37)分:93.45
    序号:GGGCCGGTAGCTGCGGAAGAGCCCGCCCTCGCACGGCGGAGCCCCGGGTTCAATCCCGGCCGGTCCACCA
    
    Str:>>>>>>>>>>>>>>>>>>>..>这是一个有趣的问题,我同意R很酷,但不知何故,我发现R在这方面有点麻烦。我似乎更喜欢先获取中间纯文本形式的数据,以便能够验证数据在每个步骤中是否正确。。。如果数据以最终形式准备就绪,或者可以将数据上传到某个地方,RCurl非常有用

    在我看来,最简单的方法是(在linux/unix/mac/或cygwin中)镜像整个站点(使用wget),并将名为/-structs.html、sed或awk的文件作为您想要的数据,并将其格式化为R

    我相信还有很多其他的方法

    Tal

    您可以使用R和
    XML
    包来实现这一点,但是(该死)您正在尝试解析的是一些格式糟糕的HTML。事实上,在大多数情况下,您都希望使用
    readHTMLTable()
    函数

    然而,鉴于这种丑陋的HTML,我们将不得不使用
    RCurl
    包来提取原始HTML并创建一些自定义函数来解析它。此问题由两部分组成:

  • 使用
    RCurl
    包中的
    getURLContent()
    函数和一些regex magic:-)从基本网页()获取所有基因组URL
  • 然后获取URL列表并刮取要查找的数据,然后将其粘贴到
    data.frame
  • 所以,这是

    library(RCurl)
    
    ### 1) First task is to get all of the web links we will need ##
    base_url<-"http://gtrnadb.ucsc.edu/"
    base_html<-getURLContent(base_url)[[1]]
    links<-strsplit(base_html,"a href=")[[1]]
    
    get_data_url<-function(s) {
        u_split1<-strsplit(s,"/")[[1]][1]
        u_split2<-strsplit(u_split1,'\\"')[[1]][2]
        ifelse(grep("[[:upper:]]",u_split2)==1 & length(strsplit(u_split2,"#")[[1]])<2,return(u_split2),return(NA))
    }
    
    # Extract only those element that are relevant
    genomes<-unlist(lapply(links,get_data_url))
    genomes<-genomes[which(is.na(genomes)==FALSE)]
    
    ### 2) Now, scrape the genome data from all of those URLS ###
    
    # This requires two complementary functions that are designed specifically
    # for the UCSC website. The first parses the data from a -structs.html page
    # and the second collects that data in to a multi-dimensional list
    parse_genomes<-function(g) {
        g_split1<-strsplit(g,"\n")[[1]]
        g_split1<-g_split1[2:5]
        # Pull all of the data and stick it in a list
        g_split2<-strsplit(g_split1[1],"\t")[[1]]
        ID<-g_split2[1]                             # Sequence ID
        LEN<-strsplit(g_split2[2],": ")[[1]][2]     # Length
        g_split3<-strsplit(g_split1[2],"\t")[[1]]
        TYPE<-strsplit(g_split3[1],": ")[[1]][2]    # Type
        AC<-strsplit(g_split3[2],": ")[[1]][2]      # Anticodon
        SEQ<-strsplit(g_split1[3],": ")[[1]][2]     # ID
        STR<-strsplit(g_split1[4],": ")[[1]][2]     # String
        return(c(ID,LEN,TYPE,AC,SEQ,STR))
    }
    
    # This will be a high dimensional list with all of the data, you can then manipulate as you like
    get_structs<-function(u) {
        struct_url<-paste(base_url,u,"/",u,"-structs.html",sep="")
        raw_data<-getURLContent(struct_url)
        s_split1<-strsplit(raw_data,"<PRE>")[[1]]
        all_data<-s_split1[seq(3,length(s_split1))]
        data_list<-lapply(all_data,parse_genomes)
        for (d in 1:length(data_list)) {data_list[[d]]<-append(data_list[[d]],u)}
        return(data_list)
    }
    
    # Collect data, manipulate, and create data frame (with slight cleaning)
    genomes_list<-lapply(genomes[1:2],get_structs) # Limit to the first two genomes (Bdist & Spurp), a full scrape will take a LONG time
    genomes_rows<-unlist(genomes_list,recursive=FALSE) # The recursive=FALSE saves a lot of work, now we can just do a straigh forward manipulation
    genome_data<-t(sapply(genomes_rows,rbind))
    colnames(genome_data)<-c("ID","LEN","TYPE","AC","SEQ","STR","NAME")
    genome_data<-as.data.frame(genome_data)
    genome_data<-subset(genome_data,ID!="</PRE>")   # Some malformed web pages produce bad rows, but we can remove them
    
    head(genome_data)
    
    库(RCurl)
    ###1)第一项任务是获取我们需要的所有web链接##
    
    base_url刚刚使用Mozenda()进行了尝试。大约10分钟后,我有了一个代理,可以像你描述的那样搜集数据。您可以通过免费试用获得所有这些数据。如果你有时间,编码是很有趣的,但看起来你可能已经有了一个为你编码的解决方案。干得好,德鲁。

    @Tal,如果可以的话,我有个问题:这合法吗?如果是这样的话,让UCSC定期访问他们的数据库不是更容易吗?嗨,塔尔,不管怎样,试着给他们写封信。你可能会发现他们很随和。他们甚至可能没有意识到人们想要以你想要的方式使用数据。也许他们会对你想要的方式感兴趣?啊,谢谢你。我没有对可能的条目进行足够彻底的搜索。很高兴你能让它工作!readHTMLtable的帮助提供了预表的方法。类似于:u=“”;h=htmlParse(u);p=getNodeSet(h,//pre);con=textConnection(xmlValue(p[[2]]);读线(con,n=4)[-1]可能会有所帮助。
    library(RCurl)
    
    ### 1) First task is to get all of the web links we will need ##
    base_url<-"http://gtrnadb.ucsc.edu/"
    base_html<-getURLContent(base_url)[[1]]
    links<-strsplit(base_html,"a href=")[[1]]
    
    get_data_url<-function(s) {
        u_split1<-strsplit(s,"/")[[1]][1]
        u_split2<-strsplit(u_split1,'\\"')[[1]][2]
        ifelse(grep("[[:upper:]]",u_split2)==1 & length(strsplit(u_split2,"#")[[1]])<2,return(u_split2),return(NA))
    }
    
    # Extract only those element that are relevant
    genomes<-unlist(lapply(links,get_data_url))
    genomes<-genomes[which(is.na(genomes)==FALSE)]
    
    ### 2) Now, scrape the genome data from all of those URLS ###
    
    # This requires two complementary functions that are designed specifically
    # for the UCSC website. The first parses the data from a -structs.html page
    # and the second collects that data in to a multi-dimensional list
    parse_genomes<-function(g) {
        g_split1<-strsplit(g,"\n")[[1]]
        g_split1<-g_split1[2:5]
        # Pull all of the data and stick it in a list
        g_split2<-strsplit(g_split1[1],"\t")[[1]]
        ID<-g_split2[1]                             # Sequence ID
        LEN<-strsplit(g_split2[2],": ")[[1]][2]     # Length
        g_split3<-strsplit(g_split1[2],"\t")[[1]]
        TYPE<-strsplit(g_split3[1],": ")[[1]][2]    # Type
        AC<-strsplit(g_split3[2],": ")[[1]][2]      # Anticodon
        SEQ<-strsplit(g_split1[3],": ")[[1]][2]     # ID
        STR<-strsplit(g_split1[4],": ")[[1]][2]     # String
        return(c(ID,LEN,TYPE,AC,SEQ,STR))
    }
    
    # This will be a high dimensional list with all of the data, you can then manipulate as you like
    get_structs<-function(u) {
        struct_url<-paste(base_url,u,"/",u,"-structs.html",sep="")
        raw_data<-getURLContent(struct_url)
        s_split1<-strsplit(raw_data,"<PRE>")[[1]]
        all_data<-s_split1[seq(3,length(s_split1))]
        data_list<-lapply(all_data,parse_genomes)
        for (d in 1:length(data_list)) {data_list[[d]]<-append(data_list[[d]],u)}
        return(data_list)
    }
    
    # Collect data, manipulate, and create data frame (with slight cleaning)
    genomes_list<-lapply(genomes[1:2],get_structs) # Limit to the first two genomes (Bdist & Spurp), a full scrape will take a LONG time
    genomes_rows<-unlist(genomes_list,recursive=FALSE) # The recursive=FALSE saves a lot of work, now we can just do a straigh forward manipulation
    genome_data<-t(sapply(genomes_rows,rbind))
    colnames(genome_data)<-c("ID","LEN","TYPE","AC","SEQ","STR","NAME")
    genome_data<-as.data.frame(genome_data)
    genome_data<-subset(genome_data,ID!="</PRE>")   # Some malformed web pages produce bad rows, but we can remove them
    
    head(genome_data)
    
    head(genome_data)
                                       ID   LEN TYPE                           AC                                                                       SEQ
    1     Scaffold17302.trna1 (1426-1498) 73 bp  Ala     AGC at 34-36 (1459-1461) AGGGAGCTAGCTCAGATGGTAGAGCGCTCGCTTAGCATGCGAGAGGtACCGGGATCGATGCCCGGGTTTTCCA
    2   Scaffold20851.trna5 (43038-43110) 73 bp  Ala   AGC at 34-36 (43071-43073) AGGGAGCTAGCTCAGATGGTAGAGCGCTCGCTTAGCATGCGAGAGGtACCGGGATCGATGCCCGGGTTCTCCA
    3   Scaffold20851.trna8 (45975-46047) 73 bp  Ala   AGC at 34-36 (46008-46010) TGGGAGCTAGCTCAGATGGTAGAGCGCTCGCTTAGCATGCGAGAGGtACCGGGATCGATGCCCGGGTTCTCCA
    4     Scaffold17302.trna2 (2514-2586) 73 bp  Ala     AGC at 34-36 (2547-2549) GGGGAGCTAGCTCAGATGGTAGAGCGCTCGCTTAGCATGCGAGAGGtACAGGGATCGATGCCCGGGTTCTCCA
    5 Scaffold51754.trna5 (253637-253565) 73 bp  Ala AGC at 34-36 (253604-253602) CGGGGGCTAGCTCAGATGGTAGAGCGCTCGCTTAGCATGCGAGAGGtACCGGGATCGATGCCCGGGTCCTCCA
    6     Scaffold17302.trna4 (6027-6099) 73 bp  Ala     AGC at 34-36 (6060-6062) GGGGAGCTAGCTCAGATGGTAGAGCGCTCGCTTAGCATGCGAGAGGtACCGGGATCGATGCCCGAGTTCTCCA
                                                                            STR  NAME
    1 .>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<.. Spurp
    2 .>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<.. Spurp
    3 .>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<.. Spurp
    4 >>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>.>>>.......<<<.<<<<<<<<. Spurp
    5 .>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<.. Spurp
    6 >>>>>>>..>>>>........<<<<.>>>>>.......<<<<<......>>>>.......<<<<.<<<<<<<. Spurp