Python 使用robobrowser和beautifulsoup解析网页_Python_Web Scraping_Beautifulsoup_Robobrowser

Python 使用robobrowser和beautifulsoup解析网页

python web-scraping

Python 使用robobrowser和beautifulsoup解析网页,python,web-scraping,beautifulsoup,robobrowser,Python,Web Scraping,Beautifulsoup,Robobrowser,我不熟悉webscraping，在使用robobrowser提交表单后尝试解析网站。我得到了正确的数据（我可以在打印（browser.parsed）时查看），但在解析时遇到了问题。网页源代码的相关部分如下所示： <div id="ii"> <tr> <td scope="row" id="t1a"> ID (ID Number)</a></td> <td headers="t1a">1234567 &nbsp

我不熟悉webscraping，在使用robobrowser提交表单后尝试解析网站。我得到了正确的数据（我可以在打印（browser.parsed）时查看），但在解析时遇到了问题。网页源代码的相关部分如下所示：

<div id="ii">
<tr>
  <td scope="row" id="t1a"> ID (ID Number)</a></td>
  <td headers="t1a">1234567 &nbsp;</td>
</tr>
<tr>
  <td scope="row" id="t1b">Participant Name</td>
  <td headers="t1b">JONES, JOHN                          &nbsp;</td>
</tr>
<tr>
  <td scope="row" id="t1c">Sex</td>
  <td headers="t1c">MALE   &nbsp;</td>
</tr>
<tr>
  <td scope="row" id="t1d">Date of Birth</td>
  <td headers="t1d">11/25/2016 &nbsp;</td>
</tr>
<tr>
  <td scope="row" id="t1e">Race / Ethnicity</a></td>
  <td headers="t1e">White                  &nbsp;</td>
</tr>

我得到：

out: [<td id="t1b" scope="row">Inmate Name</td>]

这将返回一个列表，其中包含29个选项中的每个选项，每个选项都有“tr”，我可以将其转换为文本并搜索相关信息

我还尝试创建了一个BeautifulSoup对象：

x = browser.select('#ii')
soup = BeautifulSoup(x[0].text, "html.parser")

但是它丢失了所有的标签/ID，所以我不知道如何在其中搜索

有没有一种简单的方法让它用“tr”循环遍历每个元素，并获得实际数据而不是标签，而不是反复转换为字符串变量并搜索它

谢谢

获取所有“标签”

td

元素，并将收集结果保存到dict中：

from pprint import pprint
from bs4 import BeautifulSoup

data = """
<table>
    <tr>
      <td scope="row" id="t1a"> ID (ID Number)</a></td>
      <td headers="t1a">1234567 &nbsp;</td>
    </tr>
    <tr>
      <td scope="row" id="t1b">Participant Name</td>
      <td headers="t1b">JONES, JOHN                          &nbsp;</td>
    </tr>
    <tr>
      <td scope="row" id="t1c">Sex</td>
      <td headers="t1c">MALE   &nbsp;</td>
    </tr>
    <tr>
      <td scope="row" id="t1d">Date of Birth</td>
      <td headers="t1d">11/25/2016 &nbsp;</td>
    </tr>
    <tr>
      <td scope="row" id="t1e">Race / Ethnicity</a></td>
      <td headers="t1e">White                  &nbsp;</td>
    </tr>
</table>
"""

soup = BeautifulSoup(data, 'html5lib')

data = {
    label.get_text(strip=True): label.find_next_sibling("td").get_text(strip=True)
    for label in soup.select("tr > td[scope=row]")
}
pprint(data)

获取所有“标签”

td

元素，并将收集结果写入dict：

from pprint import pprint
from bs4 import BeautifulSoup

data = """
<table>
    <tr>
      <td scope="row" id="t1a"> ID (ID Number)</a></td>
      <td headers="t1a">1234567 &nbsp;</td>
    </tr>
    <tr>
      <td scope="row" id="t1b">Participant Name</td>
      <td headers="t1b">JONES, JOHN                          &nbsp;</td>
    </tr>
    <tr>
      <td scope="row" id="t1c">Sex</td>
      <td headers="t1c">MALE   &nbsp;</td>
    </tr>
    <tr>
      <td scope="row" id="t1d">Date of Birth</td>
      <td headers="t1d">11/25/2016 &nbsp;</td>
    </tr>
    <tr>
      <td scope="row" id="t1e">Race / Ethnicity</a></td>
      <td headers="t1e">White                  &nbsp;</td>
    </tr>
</table>
"""

soup = BeautifulSoup(data, 'html5lib')

data = {
    label.get_text(strip=True): label.find_next_sibling("td").get_text(strip=True)
    for label in soup.select("tr > td[scope=row]")
}
pprint(data)

from pprint import pprint
from bs4 import BeautifulSoup

data = """
<table>
    <tr>
      <td scope="row" id="t1a"> ID (ID Number)</a></td>
      <td headers="t1a">1234567 &nbsp;</td>
    </tr>
    <tr>
      <td scope="row" id="t1b">Participant Name</td>
      <td headers="t1b">JONES, JOHN                          &nbsp;</td>
    </tr>
    <tr>
      <td scope="row" id="t1c">Sex</td>
      <td headers="t1c">MALE   &nbsp;</td>
    </tr>
    <tr>
      <td scope="row" id="t1d">Date of Birth</td>
      <td headers="t1d">11/25/2016 &nbsp;</td>
    </tr>
    <tr>
      <td scope="row" id="t1e">Race / Ethnicity</a></td>
      <td headers="t1e">White                  &nbsp;</td>
    </tr>
</table>
"""

soup = BeautifulSoup(data, 'html5lib')

data = {
    label.get_text(strip=True): label.find_next_sibling("td").get_text(strip=True)
    for label in soup.select("tr > td[scope=row]")
}
pprint(data)

{'Date of Birth': '11/25/2016',
 'ID (ID Number)': '1234567',
 'Participant Name': 'JONES, JOHN',
 'Race / Ethnicity': 'White',
 'Sex': 'MALE'}