解析HTML Python,美化组

解析HTML Python,美化组,python,html,csv,beautifulsoup,Python,Html,Csv,Beautifulsoup,我有几个html文档,其中包括以下类型的信息: <td class="principal-col"> <div class="pr-person"> <div class="name"><span id="pr_person-icon" class="bullet-male-left"></span><span class="person-link">Thomas A /Dumpling/</span></

我有几个html文档,其中包括以下类型的信息:

<td class="principal-col">
<div class="pr-person">
<div class="name"><span id="pr_person-icon" class="bullet-male-left"></span><span class="person-link">Thomas A /Dumpling/</span></div>
<table class="events" border="0">
<tr>
<td class="factLabel">event1:&nbsp;</td>
<td>
4 February 1940          
<br/>
</td>
</tr> 
<tr>
<td class="factLabel">event2:&nbsp;</td>
<td>
9 October 2002   
<br/>Laplata, Md
</td>
</tr>

非常感谢任何指点

查找第一个标记的最简单方法就是使用常规的
find
select
也有效):

在上面的css选择器中,您需要
td
标记的
td class=“factLabel”
兄弟


如果上面的任何一种语法令人困惑,请直接使用。他们有很多很好的例子。

首先,我将示例数据转换为有效的html页面,并对其进行预打印。这样更容易看到正在发生的事情:

<html><body><table><tr>
<td class="principal-col">
  <div class="pr-person">
    <div class="name">
      <span id="pr_person-icon" class="bullet-male-left"></span>
      <span class="person-link">Thomas A /Dumpling/</span>
    </div>
    <table class="events" border="0">
      <tr>
        <td class="factLabel">event1:&nbsp;</td>
        <td>4 February 1940<br/></td>
      </tr> 
      <tr>
        <td class="factLabel">event2:&nbsp;</td>
        <td>9 October 2002<br/>Laplata, Md</td>
      </tr>
    </table>
  </div>
</td>
</tr></table></body></html>
只剩下实际的解析代码

def get_string(node, default=''):
    if node:
        return ', '.join(node.stripped_strings)
    else:
        return default

def get_data(td_princ):
    name = get_string(td_princ.find('span', {'class':'person-link'})).replace('/', '')

    birth = hired = '(missing)'
    for event in td_princ.find('table', {'class': 'events'}).findAll('tr'):
        cnt = [get_string(cell) for cell in event.findAll('td')]
        if len(cnt) == 2:
            if cnt[0] == "event1:":
                birth = cnt[1]
            elif cnt[0] == "event2:":
                hired = cnt[1]
    return (name, birth, hired)
在对示例数据运行时,会生成一个csv文件,如下所示

Name,Born,Hired
Thomas A Dumpling,4 February 1940,"9 October 2002, Laplata, Md"

谢谢你,休!正如你将看到的,我对这件事真的很陌生。。。当我运行你的代码时,我得到了以下错误-你知道这可能是什么原因吗?非常感谢。。。“data.append(get_data(person))name错误:未定义全局名称‘get_data’”我发现了这一点,这对我来说是一个非常愚蠢的错误。解析代码实际上需要先执行。现在很有魅力-再次感谢你,休!
<html><body><table><tr>
<td class="principal-col">
  <div class="pr-person">
    <div class="name">
      <span id="pr_person-icon" class="bullet-male-left"></span>
      <span class="person-link">Thomas A /Dumpling/</span>
    </div>
    <table class="events" border="0">
      <tr>
        <td class="factLabel">event1:&nbsp;</td>
        <td>4 February 1940<br/></td>
      </tr> 
      <tr>
        <td class="factLabel">event2:&nbsp;</td>
        <td>9 October 2002<br/>Laplata, Md</td>
      </tr>
    </table>
  </div>
</td>
</tr></table></body></html>
from bs4 import BeautifulSoup
import csv
import glob
import os

DATA_PATH = "c:\\file_path\\"
FILESPEC  = "*.htm"
OUTFILE   = "data.csv"

def main():
    data = []
    for fname in glob.glob(os.path.join(DATA_PATH, FILESPEC)):
        with open(fname) as inf:
            pg = BeautifulSoup(inf.read())
            for person in pg.findAll('td', {'class':'principal-col'}):
                data.append(get_data(person))
    data.sort()

    with open(os.path.join(DATA_PATH, OUTFILE), 'wb') as outf:
        outcsv = csv.writer(outf)
        outcsv.writerow(["Name", "Born", "Hired"])
        outcsv.writerows(data)

if __name__ == "__main__":
    main()
def get_string(node, default=''):
    if node:
        return ', '.join(node.stripped_strings)
    else:
        return default

def get_data(td_princ):
    name = get_string(td_princ.find('span', {'class':'person-link'})).replace('/', '')

    birth = hired = '(missing)'
    for event in td_princ.find('table', {'class': 'events'}).findAll('tr'):
        cnt = [get_string(cell) for cell in event.findAll('td')]
        if len(cnt) == 2:
            if cnt[0] == "event1:":
                birth = cnt[1]
            elif cnt[0] == "event2:":
                hired = cnt[1]
    return (name, birth, hired)
Name,Born,Hired
Thomas A Dumpling,4 February 1940,"9 October 2002, Laplata, Md"