Python 使用Beauty Soup检索关闭和打开html标记之间的所有内容_Python_Beautifulsoup

Python 使用Beauty Soup检索关闭和打开html标记之间的所有内容

python

Python 使用Beauty Soup检索关闭和打开html标记之间的所有内容,python,beautifulsoup,Python,Beautifulsoup,我正在使用Python和Beautiful Soup解析内容，然后将其写入CSV文件，在获取某一组数据时遇到了一个问题。数据通过我精心编制的TidyHTML实现运行，然后剥离出其他不需要的数据问题是我需要检索一组标记之间的所有数据样本数据： <h3><a href="Vol-1-pages-001.pdf">Pages 1-18</a></h3> <ul><li>September 13 1880. First regu

我正在使用Python和Beautiful Soup解析内容，然后将其写入CSV文件，在获取某一组数据时遇到了一个问题。数据通过我精心编制的TidyHTML实现运行，然后剥离出其他不需要的数据

问题是我需要检索一组

标记之间的所有数据

样本数据：

<h3><a href="Vol-1-pages-001.pdf">Pages 1-18</a></h3>
<ul><li>September 13 1880. First regular meeting of the faculty;
 September 14 1880. Discussion of curricular matters. Students are
 debarred from taking algebra until they have completed both mental
 and fractional arithmetic; October 4 1880.</li><li>All members present.</li></ul>
 <ul><li>Moved the faculty henceforth hold regular weekkly meetings in the
 President's room of the University building; 11 October 1880. All
 members present; 18 October 1880. Regular meeting 2. Moved that the
 President wait on the property holders on 12th street and request
 them to abate the nuisance on their property; 25 October 1880.
 Moved that the senior and junior classes for rhetoricals be...</li></ul>
 <h3><a href="Vol-1-pages-019.pdf">Pages 19-33</a></h3>`


1880年9月13日。教员第一次例会；
1880年9月14日。讨论课程内容。学生是
在他们完成这两项工作之前，禁止他们学习代数
分数算术；1880年10月4日。所有成员出席。
感动了全体教员，从此每周定期在学校举行会议
大学大楼的校长室；1880年10月11日。全部的
出席会议的成员；1880年10月18日。常会2。我提议
总统在第12街等候财产持有人并请求
让他们减少对他们财产的滋扰；1880年10月25日。
提议将修辞学的高年级和低年级课程改为……
`

我需要检索第一个结束

标记和下一个开始

标记之间的所有内容。这应该不难，但我的厚脑袋没有进行必要的连接。我可以抓取所有的

标记，但这不起作用，因为

标记和

标记之间没有一对一的关系

我希望实现的产出是：

第1-18页| Vol-1-Pages-001.pdf |和标签之间的内容

前两部分不是问题，但一组标签之间的内容对我来说很困难

我目前的代码如下：

import glob, re, os, csv
from BeautifulSoup import BeautifulSoup
from tidylib import tidy_document
from collections import deque

html_path = 'Z:\\Applications\\MAMP\\htdocs\\uoassembly\\AssemblyRecordsVol1'
csv_path = 'Z:\\Applications\\MAMP\\htdocs\\uoassembly\\AssemblyRecordsVol1\\archiveVol1.csv'

html_cleanup = {'\r\r\n':'', '\n\n':'', '\n':'', '\r':'', '\r\r': '', '<img src="UOSymbol1.jpg"    alt="" />':''}

for infile in glob.glob( os.path.join(html_path, '*.html') ):
    print "current file is: " + infile

    html = open(infile).read()

    for i, j in html_cleanup.iteritems():
            html = html.replace(i, j)

    #parse cleaned up html with Beautiful Soup
    soup = BeautifulSoup(html)

    #print soup
    html_to_csv = csv.writer(open(csv_path, 'a'), delimiter='|',
                      quoting=csv.QUOTE_NONE, escapechar=' ')  
    #retrieve the string that has the page range and file name
    volume = deque()
    fileName = deque()
    summary = deque()
    i = 0
    for title in soup.findAll('a'):
            if title['href'].startswith('V'):
             #print title.string
             volume.append(title.string)
             i+=1
             #print soup('a')[i]['href']
             fileName.append(soup('a')[i]['href'])
             #print html_to_csv
             #html_to_csv.writerow([volume, fileName])

    #retrieve the summary of each archive and store
    #for body in soup.findAll('ul') or soup.findAll('ol'):
    #        summary.append(body)
    for body in soup.findAll('h3'):
            body.findNextSibling(text=True)
            summary.append(body)

    #print out each field into the csv file
    for c in range(i):
            pages = volume.popleft()
            path = fileName.popleft()
            notes = summary
            if not summary: 
                    notes = "help"
            if summary:
                    notes = summary.popleft()
            html_to_csv.writerow([pages, path, notes])

导入全局、re、os、csv
从BeautifulSoup导入BeautifulSoup
从tidylib导入tidy_文档
从集合导入deque
html_path='Z:\\Applications\\MAMP\\htdocs\\uoassembly\\AssemblyRecordsVol1'
csv_path='Z:\\Applications\\MAMP\\htdocs\\uoassembly\\AssemblyRecordsVol1\\archiveVol1.csv'
html_cleanup={'\r\r\n':''，\n\n':''，\n':''，\r':''，''，'\r\r':''，''，''，''，'}
对于glob.glob（os.path.join（html_path，'.*.html'））中的填充：
打印“当前文件为：”+infle
html=open（infle.read（））
对于html_cleanup.iteritems（）中的i，j：
html=html.replace（i，j）
#用漂亮的汤解析干净的html
soup=BeautifulSoup（html）
#印花汤
html_to_csv=csv.writer（打开（csv_路径，'a'），分隔符='|'，
quoting=csv.QUOTE“无，escapechar=”）
#检索具有页面范围和文件名的字符串
体积=deque（）
fileName=deque（）
summary=deque（）
i=0
对于soup.findAll（'a'）中的标题：
如果标题['href'].startswith（'V'）：
#打印title.string
volume.append（title.string）
i+=1
#打印汤（'a'）[i]['href']
文件名.append（soup（'a'）[i]['href']）
#将html\u打印到\u csv
#html_to_csv.writerow（[卷，文件名]）
#检索每个存档和存储的摘要
#用于汤中的身体。芬达尔（'ul'）或汤。芬达尔（'ol'）：
#摘要.追加（正文）
对于汤中的身体。芬达尔（'h3'）：
body.findNextSibling（text=True）
摘要.追加（正文）
#将每个字段打印到csv文件中
对于范围（i）中的c：
pages=volume.popleft（）
path=fileName.popleft（）
注释=摘要
如果不是摘要：
notes=“帮助”
如果是摘要：
notes=summary.popleft（）
html_to_csv.writerow（[页面、路径、注释]）

如果您试图在lxml中的


标记之间提取数据，它提供了使用CSSSelector

import lxml.html
import urllib
data = urllib.urlopen('file:///C:/Users/ranveer/st.html').read() //contains your html snippet
doc = lxml.html.fromstring(data)
elements = doc.cssselect('ul li') // CSSpath[using firebug extension]
for element in elements:
      print element.text_content()    

执行上述代码后，您将获得ul，li
标记之间的所有文本。它比漂亮的汤干净多了
如果您计划使用lxml，那么您可以用以下方式计算XPath表达式-
import lxml
from lxml import etree
content = etree.HTML(urllib.urlopen("file:///C:/Users/ranveer/st.html").read())
content_text = content.xpath("html/body/h3[1]/a/@href | //ul[1]/li/text() | //ul[2]/li/text() | //h3[2]/a/@href")
print content_text

您可以根据需要更改XPath。
提取
和
标记之间的内容：
from itertools import takewhile

h3s = soup('h3') # find all <h3> elements
for h3, h3next in zip(h3s, h3s[1:]):
  # get elements in between
  between_it = takewhile(lambda el: el is not h3next, h3.nextSiblingGenerator())
  # extract text
  print(''.join(getattr(el, 'text', el) for el in between_it))

从itertools导入takewhile
h3s=汤（'h3'）#查找所有元素
对于h3，H3zip中的下一个（h3s，h3s[1:]）：
#让元素介于两者之间
between_it=takewhile（lambda el:el不是h3next，h3.nextSiblingGenerator（））
#提取文本
打印（“”.join（getattr（el，'text'，el）表示中间的el）

代码假定所有
元素都是同级元素。如果不是这样，那么您可以使用h3.nextGenerator（）
而不是h3.nextSiblingGenerator（）
尝试使用此Xpath表达式html/body/h3[1]/a/@href |//ul ul 1]/li/text（）|//ul ul 2]/li/text（）|//h3[2]/a/@href
不太好，它没有返回任何结果，但我不知道可以在findAll中使用Xpath。我来玩玩这个。谢谢。为什么不试试lxml
，因为BSoup没有维护，速度慢，而且API很难看。@RanRag:维护人员说：tl；dr：改用4.0系列。这一页最初是在2009年3月写的。从那时起，3.2系列已经发布，取代了3.1系列，4.x系列的开发已经开始。出于历史目的，此页面将保持打开状态。我尝试了lxml库，尽管它在语法上可能简洁，但它似乎只是抓住了每个ul元素，而不是标题标签之间的ul元素。我将继续使用xpath输入。我提供的示例只是文档的一部分，如果不清楚，我很抱歉，但是当我在文档上迭代时，我不会被标题标签之间的多个uls绊倒。感谢RanRag的帮助。使用lxml的xpath确实检索特定的元素，但是，迭代似乎不能很好地工作。xpath不支持ac