Python beautifulSoup屏幕不正确嵌套的刮屏列表<；ul>；s_Python_Screen Scraping_Beautifulsoup_Web Scraping

Python beautifulSoup屏幕不正确嵌套的刮屏列表<；ul>；s

python web-scraping

Python beautifulSoup屏幕不正确嵌套的刮屏列表<；ul>；s,python,screen-scraping,beautifulsoup,web-scraping,Python,Screen Scraping,Beautifulsoup,Web Scraping,我（非常）不熟悉BeautifulSoup，在过去的三天里，我一直在试图从那里得到一份教堂的名单数据似乎没有正确嵌套，但仅为表示目的而标记。据推测，等级结构是 Parishes District (data) Vicariate (data) Church (data) 然而，我所看到的是，每个教堂都以一颗子弹开始，每个入口都由两个断线隔开。我要查找的字段名是斜体的，并用“：”与实际数据分开。每个

我（非常）不熟悉BeautifulSoup，在过去的三天里，我一直在试图从那里得到一份教堂的名单

数据似乎没有正确嵌套，但仅为表示目的而标记。据推测，等级结构是

Parishes
    District
    (data)
        Vicariate
        (data)
            Church
            (data)

然而，我所看到的是，每个教堂都以一颗子弹开始，每个入口都由两个断线隔开。我要查找的字段名是斜体的，并用“：”与实际数据分开。每个单位条目（地区|教区|教区）可以有一个或多个数据字段

到目前为止，我可以梳理出一些数据，但无法显示实体的名称

soup=BeautifulSoup(page)
for e in soup.table.tr.findAll('i'):
    print e.string, e.nextSibling

最后，我希望对数据列进行转换：

地区、牧师、教区、地址、电话、名义、教区牧师，，

如果你能在正确的方向上轻推一下，我会很感激的。

不幸的是，这会有点复杂，因为这种格式有一些你需要的数据，没有清晰的标记

数据模型此外，您对嵌套的理解并不完全正确。实际的天主教会结构（不是本文件结构）更像：

District (also called deanery or vicariate. In this case they all seem to be Vicariates Forane.)
    Cathedral, Parish, Oratory

请注意，虽然教区通常属于区/教区，但并不要求该教区属于区/教区。我认为文件是说，在某个地区之后列出的所有内容都属于该地区，但你不能确定

还有一个入口不是教堂而是社区（圣洛伦佐菲律宾华人社区）。这些人在教堂中没有明确的身份或管理（即，它不是一座建筑）——相反，它是一个由牧师负责照顾的非属地群体

解析我认为你应该采取渐进的方法：

查找所有

li

元素，每个元素都是一个“项”

项目的名称是第一个文本节点

查找所有

元素：它们是键、属性值、列行等

直到下一个

（由

br

分隔）的所有文本都是该键的值

这个页面的一个特殊问题是，它的html非常糟糕，你需要使用
MinimalSoup
来正确解析它。特别是，

beautifulsou

认为

li

元素是嵌套的，因为文档中没有

ol

或

ul

此代码将提供元组列表。每个元组是一个项的
（'key'，'value'）
对。

一旦您拥有了这个数据结构，您就可以随意地进行规范化、转换、嵌套等操作，并将HTML留在后面

from BeautifulSoup import MinimalSoup
import urllib

fp = urllib.urlopen("http://www.ucanews.com/diocesan-directory/html/ordinary-of-philippine-cagayandeoro-parishes.html")
html = fp.read()
fp.close()

soup = MinimalSoup(html);

root = soup.table.tr.td

items = []
currentdistrict = None
# this loops through each "item"
for li in root.findAll(lambda tag: tag.name=='li' and len(tag.attrs)==0):
    attributes = []
    parishordistrict = li.next.strip()
     # look for string "district" to determine if district; otherwise it's something else under the district
    if parishordistrict.endswith(' District'):
        currentdistrict = parishordistrict
        attributes.append(('_isDistrict',True))
    else:
        attributes.append(('_isDistrict',False))

    attributes.append(('_name',parishordistrict))
    attributes.append(('_district',currentdistrict))

    # now loop through all attributes of this thing
    attributekeys = li.findAll('i')

    for i in attributekeys:
        key = i.string # normalize as needed. Will be 'Address:', 'Parochial Victor:', etc
        # now continue among the siblings until we reach an <i> again.
        # these are "values" of this key
        # if you want a nested key:[values] structure, you can use a dict,
        # but beware of multiple <i> with the same name in your logic
        next = i.nextSibling
        while next is not None and getattr(next, 'name', None) != 'i':
            if not hasattr(next, 'name') and getattr(next, 'string', None):
                value = next.string.strip()
                if value:
                    attributes.append((key, value))
            next = next.nextSibling
    items.append(attributes)

from pprint import pprint
pprint(items)

从BeautifulSoup导入
导入URL库
fp=urllib.urlopen（“http://www.ucanews.com/diocesan-directory/html/ordinary-of-philippine-cagayandeoro-parishes.html")
html=fp.read（）
fp.close（）
汤=最小汤（html）；
root=soup.table.tr.td
项目=[]
currentdistrict=无
#这将遍历每个“项”
对于root.findAll中的li（lambda标记：tag.name='li'和len（tag.attrs）==0）：
属性=[]
parishordistrict=li.next.strip（）
#查找字符串“district”以确定是否为district；否则，这是该地区的另一种情况
如果parishordistrict.endswith（'District'）：
currentdistrict=parishordistrict
attributes.append（“”“isDistrict”，True））
其他：
attributes.append（“”“isDistrict”，False））
attributes.append（“\u name”，parishordistrict））
attributes.append（（“地区”，currentdistrict））
#现在循环检查这个东西的所有属性
attributekeys=li.findAll（'i'）
对于attributekeys中的i：
key=i.string#根据需要进行规范化。将是“地址：”、“狭隘的胜利者：”，等等
#现在在兄弟姐妹之间继续，直到我们再次达成一致。
#这些是该键的“值”
#如果需要嵌套键：[values]结构，可以使用dict，
#但是要注意，在您的逻辑中，多个名称相同
下一步=i.nextSibling
而next不是None和getattr（next，'name'，None）！='我：
如果不是hasattr（下一个“名称”）和getattr（下一个“字符串”，无）：
value=next.string.strip（）
如果值：
attributes.append（（键、值））
next=next.nextSibling
items.append（属性）
从pprint导入pprint
pprint（项目）

谢谢你，弗朗西斯。请给我一些时间来学习您的代码，以便我可以从中学习。顺便说一句，当我提到层次结构时，我指的是页面上下文中的层次结构。感谢您的澄清。啊，在这种情况下，

li

嵌套仅仅是因为

BeautifulSoup

解析器如何解释糟糕的html。使用

MinimalSoup

解析器将嵌套的

li

转换为

li

的平面列表，这就是大多数浏览器将如何构建DOM树。Francis，我做了一些小的调整，但你的代码做了所有（非常）肮脏的工作。你是“病态坏”的html。我自己也这么认为，但这可能比我找到的第一页要好。我感谢您对BeautifulSoup的努力和帮助。非常感谢你。