提取多个“；“下一个兄弟姐妹”；从带有BeautifulSoup的HTML_Html_Python 3.x_Beautifulsoup

提取多个“；“下一个兄弟姐妹”；从带有BeautifulSoup的HTML

html python-3.x

提取多个“；“下一个兄弟姐妹”；从带有BeautifulSoup的HTML,html,python-3.x,beautifulsoup,Html,Python 3.x,Beautifulsoup,我有一组HTML文件，它们共享以下结构： <h1>ITEM NAME</h1> <span class="standardLabel">Place of publication: </span>PLACENAME <br /><span class="standardLabel">Publication dates: </span>DATE <br /><

我有一组HTML文件，它们共享以下结构：

<h1>ITEM NAME</h1>
<span class="standardLabel">Place of publication: </span>PLACENAME
<br /><span class="standardLabel">Publication dates: </span>DATE
<br /><span class="standardLabel">Notes: </span>NOTES
<br /><span class="standardLabel">Frequency: </span>FREQUENCY

输出为：

第1号文件标题：关于城镇
蒂龙邓加农公司
第10号文件标题：Amárach:Guth na Gaeltachta
都柏林公司都柏林
第100号文件标题：贝尔法斯特选举
Belfast，Co.Antrim

您可以使用CSS选择器

span:contains（“”

）查找特定的

标记，然后执行

。下一个兄弟姐妹

例如：

from bs4 import BeautifulSoup

txt = '''<h1>ITEM NAME</h1>
<span class="standardLabel">Place of publication: </span>PLACENAME
<br /><span class="standardLabel">Publication dates: </span>DATE
<br /><span class="standardLabel">Notes: </span>NOTES
<br /><span class="standardLabel">Frequency: </span>FREQUENCY'''

soup = BeautifulSoup(txt, 'html.parser')

title = soup.h1.text
place = soup.select_one('span:contains("Place of publication:")').next_sibling.strip()
dates = soup.select_one('span:contains("Publication dates:")').next_sibling.strip()
notes = soup.select_one('span:contains("Notes:")').next_sibling.strip()
freq = soup.select_one('span:contains("Frequency:")').next_sibling.strip()

print(title)
print(place)
print(dates)
print(notes)
print(freq)

使用Andrej Kesely答案中的代码，我还添加了缺失属性的异常处理：

# import packages

from bs4 import BeautifulSoup
import os
from os.path import dirname, join
directory=("C:\\Users\\mobarget\\Google Drive\\ACADEMIA\\10_Data analysis_PhD\\NLI Newspaper DB")

# read downloaded HTML files

for infile in os.listdir(directory):
    filename=join(directory, infile)
    indata=open(filename,"r", encoding="utf-8", errors="ignore") 
    contents = indata.read()
    soup = BeautifulSoup(contents, 'html.parser')
    newspaper=soup.find('h1')
    if newspaper:
        try:
            # read data from tags
        
            title = soup.h1.text
            place = soup.select_one('span:contains("Place of publication:")').next_sibling.strip()
            dates = soup.select_one('span:contains("Publication dates:")').next_sibling.strip()
            notes = soup.select_one('span:contains("Notes:")').next_sibling.strip()
            freq = soup.select_one('span:contains("Frequency:")').next_sibling.strip()

            # print results

            print("Title of file no.", str(infile), ": ", title)
            print(place)
            print(dates)
            print(notes)
            print(freq)

            # exception handling if attributes are missing

        except AttributeError:
            print("no data")
            
    else:
        continue

谢谢，CSS选择器工作得非常好。但是，

soup=BeautifulSoup（txt，'html.parser'）

在Python3中似乎不起作用。我收到通知说“txt”没有定义，必须坚持使用“内容”。@OnceUponATime是的，我只在我的示例中定义了

txt

。如果程序中有

内容

，请使用它！

ITEM NAME
PLACENAME
DATE
NOTES
FREQUENCY

# import packages

from bs4 import BeautifulSoup
import os
from os.path import dirname, join
directory=("C:\\Users\\mobarget\\Google Drive\\ACADEMIA\\10_Data analysis_PhD\\NLI Newspaper DB")

# read downloaded HTML files

for infile in os.listdir(directory):
    filename=join(directory, infile)
    indata=open(filename,"r", encoding="utf-8", errors="ignore") 
    contents = indata.read()
    soup = BeautifulSoup(contents, 'html.parser')
    newspaper=soup.find('h1')
    if newspaper:
        try:
            # read data from tags
        
            title = soup.h1.text
            place = soup.select_one('span:contains("Place of publication:")').next_sibling.strip()
            dates = soup.select_one('span:contains("Publication dates:")').next_sibling.strip()
            notes = soup.select_one('span:contains("Notes:")').next_sibling.strip()
            freq = soup.select_one('span:contains("Frequency:")').next_sibling.strip()

            # print results

            print("Title of file no.", str(infile), ": ", title)
            print(place)
            print(dates)
            print(notes)
            print(freq)

            # exception handling if attributes are missing

        except AttributeError:
            print("no data")
            
    else:
        continue