在R中删除临时web查询_R_Web Scraping

在R中删除临时web查询

r web-scraping

在R中删除临时web查询,r,web-scraping,R,Web Scraping,我一直在编写一个R脚本，可以自动从托马斯（THOMAS.loc.gov）主持的国会记录中删除文本 THOMAS允许用户“浏览”国会记录的每日问题。我认为最好的方法是遵循以下步骤：列出每期日报的链接（例如：是1990年10月19日在众议院发行的链接）对于每个链接，获取单个“文章”的所有链接（文章通过一个仅在30分钟内处于活动状态的临时搜索查询链接到）循环浏览文章，并遵循“打印机友好”版本的链接（出于各种原因，这似乎是最好的选择——例如，如果您查看上面链接的1990年10月19日众议院问题，并

我一直在编写一个R脚本，可以自动从托马斯（THOMAS.loc.gov）主持的国会记录中删除文本

THOMAS允许用户“浏览”国会记录的每日问题。我认为最好的方法是遵循以下步骤：

列出每期日报的链接（例如：是1990年10月19日在众议院发行的链接）

对于每个链接，获取单个“文章”的所有链接（文章通过一个仅在30分钟内处于活动状态的临时搜索查询链接到）

循环浏览文章，并遵循“打印机友好”版本的链接（出于各种原因，这似乎是最好的选择——例如，如果您查看上面链接的1990年10月19日众议院问题，并点击第37条——“关于H.R.5229的会议报告，运输部和相关机构拨款法，1991年”--您将被引导到一个新页面，该页面列出了关于HR 5229的辩论部分的链接列表。“打印机友好”版本在一个位置列出了所有文本

从打印机友好页面中刮取文本

我挂断了第3步。出于某种原因，它总是返回“404未找到”错误

以下代码主要用于实现此目的：

setwd("U:/Congressional Record")
require(XML)

root <- "http://thomas.loc.gov/"
url <- "http://thomas.loc.gov/home/Browse.php?&n=Issues&c=101"

setwd（“U:/Record”）
需要（XML）
根这一行是你的问题：
li <- paste(root, 
            substr(t[i], pr[1]+8, pr[1]+attr(pr, 'match.length')[1]-7),
            sep='')

当您需要时（使用“cgi bin”而不是“gi bin”）：
关于代码的其他一些注释：
txt谢谢你的评论！这些评论真的很有帮助（我真不敢相信我没有抓住那个愚蠢的错误）。没问题；发生在我们所有人身上。如果你觉得这很有帮助，我将感谢你的支持和/或接受答案。
txt <- NULL ##container for scraped text
for (j in links) { ##begin loop through issues
    ##append a break for each day
    txt <- c(txt, '*#*#start new day#*#*', j)
    u <- paste(root, j, sep="")
    doc <- htmlParse(u)
    ##pull out the links
    l <- as.vector(xpathSApply(doc, "//a/@href"))
    ##find subset only the links that lead to text from CR
    s <- grep("query", l)

    ##get a list of titles for each entry
    t <- readLines(u)
    ##clean it up a little
    t <- gsub('</*\\w*/*>', '', t, perl=TRUE)
    ##find the titles
    tInds <- grep(title.ex, t)
    tEnds <- regexpr('<', t[tInds])
    titles <- substr(t[tInds], 1, tEnds-2)

    for (k in 1:length(s)) { ##begin loop through articles of the daily issue
        u <- paste(root, l[s[k]], sep='')
        t <- readLines(u)
        doc2 <- htmlParse(u)
        as.vector(xpathSApply(doc2, "//a/@href"))
        ##refresh the search if it has taken too long
        timed <- grep(timeout, t)
        if (length(timed)>0) {
            u <- paste(root, j, sep="")
            doc <- htmlParse(u)
            ##pull out the links
            l <- as.vector(xpathSApply(doc, "//a/@href"))   
            u <- paste(root, l[k], sep='')
            t <- readLines(u)
        }

        ##find the 'printer friendly' link
        ##for some reason the printer link doesn't work when I try to
        ##automatically scrape it from the site...

        i <- grep('Printer Friendly Display', t)
        ##extract the link and follow it
        pr <- regexpr(link.ex, t[i], perl=TRUE)
        li <- paste(root, 
            substr(t[i], pr[1]+8, pr[1]+attr(pr, 'match.length')[1]-7),
            sep='')
        t <- readLines(li)

        ##clean the text
        t <- gsub('</*\\w*/*>', '', t, perl=TRUE)

        ##code to scrape the text...

    } ##end loop through articles

} ##end loop through issues

li <- paste(root, 
            substr(t[i], pr[1]+8, pr[1]+attr(pr, 'match.length')[1]-7),
            sep='')

> li
[1] "http://thomas.loc.gov/gi-bin/query/C?r101:./temp/~r101PhW9pi"

"http://thomas.loc.gov/cgi-bin/query/C?r101:./temp/~r101PhW9pi"