用python提取网页的某些部分_Python

用python提取网页的某些部分

python

用python提取网页的某些部分,python,Python,因此，我有一个数据检索/输入项目，我想提取网页的某个部分并将其存储在文本文件中。我有一个url文本文件，程序应该为每个url提取页面的相同部分具体来说，该程序会在“法律授权：”之后的页面上复制法律法规，如。正如你所看到的，这里只列出了一条法令。然而，一些URL看起来也像，这意味着有多个独立的法规我的代码适用于第一类页面： from sys import argv from urllib2 import urlopen script, urlfile, legalfile = argv in

因此，我有一个数据检索/输入项目，我想提取网页的某个部分并将其存储在文本文件中。我有一个url文本文件，程序应该为每个url提取页面的相同部分

具体来说，该程序会在“法律授权：”之后的页面上复制法律法规，如。正如你所看到的，这里只列出了一条法令。然而，一些URL看起来也像，这意味着有多个独立的法规

我的代码适用于第一类页面：

from sys import argv
from urllib2 import urlopen

script, urlfile, legalfile = argv
input = open(urlfile, "r")
output = open(legalfile, "w")

def get_legal(page):
    # this is where Legal Authority: starts in the code
    start_link = page.find('Legal Authority:')
    start_legal = page.find('">', start_link+1)
    end_link = page.find('<', start_legal+1)
    legal = page[start_legal+2: end_link]
    return legal

for line in input:
  pg = urlopen(line).read()
  statute = get_legal(pg)
  output.write(get_legal(pg))

从系统导入argv
从urllib2导入urlopen
脚本，urlfile，legalfile=argv
输入=打开（urlfile，“r”）
输出=打开（legalfile，“w”）
def get_法律（第页）：
#这就是法律权威：在代码中开始的地方
start_link=page.find（'法定权限：'）
start_legal=page.find（“>”，start_link+1）
end_link=page.find（“我建议使用来解析和搜索html。这比基本的字符串搜索容易得多
这里有一个示例，它将所有库都拉到这里来获取页面内容-这只是一个推荐的、非常易于使用的urlopen
替代方法
导入请求
从BeautifulSoup导入BeautifulSoup
#使用请求库获取页面内容
url=”http://www.reginfo.gov/public/do/eAgendaViewRule?pubId=200210&RIN=1205-AB16“
response=requests.get（url）
#解析html
html=BeautifulSoup（response.content）
#“查找所有”仍然是您可能希望用于筛选html的工具。
他们在那里提供XML数据，请参阅。如果您认为无法下载那么多文件（或者另一端可能不喜欢这么多HTTP GET请求），我建议询问他们的管理员是否愿意为您提供访问数据的不同方式
我曾经两次这样做（使用科学数据库）。在一个例子中，数据集的绝对大小禁止下载；他们对我运行SQL查询并通过电子邮件发送结果（但之前曾提供发送DVD或硬盘）。在另一个例子中，我可以向Web服务发送数百万个HTTP请求（他们还可以）每个请求大约有1k字节。这会花费很长时间，而且会很不方便（需要一些错误处理，因为这些请求中的一些总是会超时）（并且由于某些原因是非原子的）。我收到了一张DVD
我想管理和预算办公室可能也有类似的适应方式。
您正在浏览的页面提供了这些方便的“下载XML格式的RIN数据”链接。不管RIN是什么，都有一些干净的XML。您不能改用它吗？（blah1blah2）有了python的ElementTree库和@tiwo的建议，解析XMl应该是死路一条了SimpleJ只是注意到了XMl链接——谢谢。但是看起来我需要下载每个XMl文件，我有数百个独特的RIN需要处理。有没有python代码可以高效地下载XMl？谢谢，这似乎是一个有用的模块。
def get_legal(page):
# this is where Legal Authority: starts in the code
    end_link = ""
    legal = ""
    start_link = page.find('Legal Authority:')
    while (end_link != '</a>&nbsp;'):
        start_legal = page.find('">', start_link+1)

        end_link = page.find('<', start_legal+1)
        end2 = page.find('</a>&nbsp;', end_link+1)
        legal += page[start_legal+2: end_link] 
        if 
        break
    return legal

import requests
from BeautifulSoup import BeautifulSoup

# fetch the content of the page with requests library
url = "http://www.reginfo.gov/public/do/eAgendaViewRule?pubId=200210&RIN=1205-AB16"
response = requests.get(url)

# parse the html
html = BeautifulSoup(response.content)

# find all the <a> tags
a_tags = html.findAll('a', attrs={'class': 'pageSubNavTxt'})


def fetch_parent_tag(tags):
    # fetch the parent <td> tag of the first <a> tag
    # whose "previous sibling" is the <b>Legal Authority:</b> tag.
    for tag in tags:
        sibling = tag.findPreviousSibling()
        if not sibling:
            continue
        if sibling.getText() == 'Legal Authority:':
            return tag.findParent()

# now, just find all the child <a> tags of the parent.
# i.e. finding the parent of one child, find all the children
parent_tag = fetch_parent_tag(a_tags)
tags_you_want = parent_tag.findAll('a')

for tag in tags_you_want:
    print 'statute: ' + tag.getText()