Python-BeautifulSoup刮取非标准web表_Python_Html_Web Scraping_Beautifulsoup

Python-BeautifulSoup刮取非标准web表

python html web-scraping

Python-BeautifulSoup刮取非标准web表,python,html,web-scraping,beautifulsoup,Python,Html,Web Scraping,Beautifulsoup,我正试图从几个网页上抓取数据，以便创建数据的CSV。这些数据只是产品的营养信息。我已经生成了访问该网站的代码，但我不能完全得到正确迭代的代码。问题是，该网站使用DIV标签作为产品名称，在DIV的内部，或者在页面之间，它会有所不同。当我尝试迭代它时，产品名称都会立即显示在一个带有标记的列表中，然后我得到我请求的列的内容，没有标记。我正在努力找出我做错了什么源代码示例： <div><strong>Product 1 Name</strong></div&g

我正试图从几个网页上抓取数据，以便创建数据的CSV。这些数据只是产品的营养信息。我已经生成了访问该网站的代码，但我不能完全得到正确迭代的代码。问题是，该网站使用DIV标签作为产品名称，在DIV的内部，或者在页面之间，它会有所不同。当我尝试迭代它时，产品名称都会立即显示在一个带有标记的列表中，然后我得到我请求的列的内容，没有标记。我正在努力找出我做错了什么

源代码示例：

<div><strong>Product 1 Name</strong></div>

<table>
    <tbody>
        <tr>
            <td>Serving Size</td>
            <td>8 (fl. Oz.)</td>
        </tr>
        <tr>
            <td>Calories</td>
            <td>122 Calories</td>
        </tr>
        <tr>
            <td>Fat</td>
            <td>0 (g)</td>
        </tr>
        <tr>
            <td>Sodium</td>
            <td>0.2 (mg)</td>
        </tr>
        <tr>
            <td>Carbs</td>
            <td>8.8 (mg)</td>
        </tr>
        <tr>
            <td>Dietary Fiber</td>
            <td>0 (g)</td>
        </tr>
        <tr>
            <td>Sugar</td>
            <td>8.8 (g)<br />
            &nbsp;</td>
        </tr>
    </tbody>
</table>
&nbsp;

<div><strong>Product 2 Name</strong></div>

<table>
    <tbody>
        <tr>
            <td>Serving Size</td>
            <td>8 (fl. Oz.)</td>
        </tr>
        <tr>
            <td>Calories</td>
            <td>134 Calories</td>
        </tr>
        <tr>
            <td>Fat</td>
            <td>0 (g)</td>
        </tr>
        <tr>
            <td>Sodium</td>
            <td>0.0 (mg)</td>
        </tr>
        <tr>
            <td>Carbs</td>
            <td>8.4 (mg)</td>
        </tr>
        <tr>
            <td>Dietary Fiber</td>
            <td>0 (g)</td>
        </tr>
        <tr>
            <td>Sugar</td>
            <td>8.4 (g)<br />
            &nbsp;</td>
        </tr>
    </tbody>
</table>
&nbsp;

这给了我如下输出：

[<strong>Product 1 Name</strong>, <strong>Product 2 Name</strong>]
8 (fl. Oz.)
101 Calories
0 (g)
0.0 (mg)
0 (mg)
0 (g)
0 (g)
8 (fl. Oz.)
101 Calories
0 (g)
0.0 (mg)
0 (mg)
0 (g)
0 (g)
[]

[产品1名称，产品2名称]
8（液体盎司）
101卡路里
0（克）
0.0（毫克）
0（毫克）
0（克）
0（克）
8（液体盎司）
101卡路里
0（克）
0.0（毫克）
0（毫克）
0（克）
0（克）
[]

找到表格，然后从上一个strong中提取文本，并从每个tr中提取第二个td，将文本拆分一次以删除

（g）

等：

from bs4 import BeautifulSoup

soup = BeautifulSoup(html)

for table in soup.find_all("table"):
    name = [table.find_previous("strong").text]
    amounts = [td.text.split(None, 1)[0] for  td in table.select("tr td + td")])
    print(name + amounts)

这将给你：

['Product 1 Name', '8', '122', '0', '0.2', '8.8', '0', '8.8']
['Product 2 Name', '8', '134', '0', '0.0', '8.4', '0', '8.4']

select（“tr td+td”）使用css选择器从每个tr/行获取第二个td

或者使用find_all和切片将如下所示：

for table in soup.find_all("table"):
    name = [table.find_previous("strong").text]
    amounts = [td.find_all("td")[1].text.split(None, 1)[0] for  td in table.find_all("tr")]
    print(name + amounts)

因为它并不总是一个强项，但有时是一个你想要的粗体标签，只需先寻找强项，然后回到粗体：

from bs4 import BeautifulSoup
import requests
html = requests.get("http://beamsuntory.desk.com/customer/en/portal/articles/1676001-nutrition-information-cruzan").content
soup = BeautifulSoup(html, "html.parser")
for table in soup.select("div.article-content table"):
    name = table.find_previous("strong") or table.find_previous("b")
    amounts = [td.text.split(None, 1)[0] for  td in table.select("tr td + td")]
    print([name.text] + amounts)

In [20]: html = requests.get("http://beamsuntory.desk.com/customer/en/portal/articles/1790163-midori-nutrition-information").content

In [21]: soup = BeautifulSoup(html, "html.parser")

In [22]: for table in soup.select("div.article-content table"):
   ....:         name = table.find_previous("strong") or table.find_previous("b")
   ....:         amounts = [td.text.split(None, 1)[0] for  td in table.select("tr td + td")]
   ....:         print([name.text] + amounts)
   ....:     
[u'Midori', u'1.0', u'62.1', u'0', u'0.3', u'7.5', u'0', u'7.0']

如果table.find_previous（“strong”）未找到任何内容，则将执行or，并将名称设置为table.find_previous（“b”）

现在，它将同时适用于以下两种情况：

In [12]: html = requests.get("http://beamsuntory.desk.com/customer/en/portal/articles/1676001-nutrition-information-cruzan").content

In [13]: soup = BeautifulSoup(html, "html.parser")

In [14]: for table in soup.select("div.article-content table"):
   ....:         name = table.find_previous("strong") or table.find_previous("b")
   ....:         amounts = [td.text.split(None, 1)[0] for  td in table.select("tr td + td")]
   ....:         print([name.text] + amounts)
   ....:     
[u'Cruzan Banana Flavored Rum 42 proof', u'1.5', u'79', u'0', u'0.0', u'6.5', u'0', u'6.5']
[u'Cruzan Banana Flavored Rum 55 proof', u'1.5', u'95', u'0', u'0.0', u'6.5', u'0', u'6.5']
[u'Cruzan Black Cherry Flavored Rum 42 proof', u'1.5', u'80', u'0', u'0.0', u'6.9', u'0', u'6.9']
[u'Cruzan Citrus Flavored Rum 42 proof', u'1.5', u'99', u'0', u'0.0', u'2.8', u'0', u'2.6']
[u'Cruzan Coconut Flavored Rum 42 proof', u'1.5', u'78', u'0', u'0.1', u'6.9', u'0', u'6.5']
[u'Cruzan Coconut Flavored Rum 55 proof', u'1.5', u'95', u'0', u'0.1', u'6.1', u'0', u'0']
[u'Cruzan Guaza Flavored Rum 42 proof', u'1.5', u'78', u'0', u'0.1', u'6.5', u'0', u'6.5']
[u'Cruzan Key Lime Flavored Rum 42 proof', u'1.5', u'81', u'0', u'0.0', u'8.1', u'0', u'6']
[u'Cruzan Mango Flavored Rum 42 proof', u'1.5', u'85', u'0', u'0.0', u'8.5', u'0', u'8.5']
[u'Cruzan Mango Flavored Rum 55 proof', u'1.5', u'101', u'0', u'0.0', u'8.5', u'0', u'8.5']
[u'Cruzan Orange Flavored Rum 42 proof', u'1.5', u'76.77', u'0', u'0', u'6.4', u'0', u'6.4']
[u'Cruzan Passion Fruit Flavored Rum 42 proof', u'1.5', u'77', u'0', u'0.0', u'6.3', u'0', u'6.3']
[u'Cruzan Pineapple Flavored Rum 42 proof', u'1.5', u'78', u'0', u'0.0', u'6.5', u'0', u'6.5']
[u'Cruzan Pineapple Flavored Rum 55 proof', u'1.5', u'94', u'0', u'0.0', u'6.5', u'0', u'6.5']
[u'Cruzan Raspberry Flavored Rum 42 proof', u'1.5', u'92', u'0', u'0.0', u'10.1', u'0', u'10.1']
[u'Cruzan Raspberry Flavored Rum 55 proof', u'1.5', u'108', u'0', u'0.0', u'10.1', u'0', u'10.1']
[u'Cruzan Strawberry Flavored Rum 42 proof', u'1.5', u'76', u'0', u'0.0', u'6.1', u'0', u'6']
[u'Cruzan Vanilla Flavored Rum 42 proof', u'1.5', u'78', u'0', u'0.0', u'6.5', u'0', u'6.5']
[u'Cruzan Vanilla Flavored Rum 55 proof', u'1.5', u'94', u'0', u'0.0', u'6.5', u'0', u'6.5']
[u'Cruzan Estate Dark Rum 80 proof', u'1.5', u'101', u'0', u'0.0', u'0', u'0', u'0']
[u'Cruzan Estate Light Rum 80 proof', u'1.5', u'101', u'0', u'0.0', u'0', u'0', u'0']
[u'Cruzan Estate Single Barrel Rum 80 proof', u'1.5', u'99', u'0', u'0.0', u'0.9', u'0', u'0.9']

粗体字：

from bs4 import BeautifulSoup
import requests
html = requests.get("http://beamsuntory.desk.com/customer/en/portal/articles/1676001-nutrition-information-cruzan").content
soup = BeautifulSoup(html, "html.parser")
for table in soup.select("div.article-content table"):
    name = table.find_previous("strong") or table.find_previous("b")
    amounts = [td.text.split(None, 1)[0] for  td in table.select("tr td + td")]
    print([name.text] + amounts)

In [20]: html = requests.get("http://beamsuntory.desk.com/customer/en/portal/articles/1790163-midori-nutrition-information").content

In [21]: soup = BeautifulSoup(html, "html.parser")

In [22]: for table in soup.select("div.article-content table"):
   ....:         name = table.find_previous("strong") or table.find_previous("b")
   ....:         amounts = [td.text.split(None, 1)[0] for  td in table.select("tr td + td")]
   ....:         print([name.text] + amounts)
   ....:     
[u'Midori', u'1.0', u'62.1', u'0', u'0.3', u'7.5', u'0', u'7.0']

我一定是做错了什么。我得到了这个错误：

Traceback（最近一次调用最后一次）：文件“htmlextraction.py”，第10行，在name=[table.find_previous（“strong”）.text]AttributeError:“NoneType”对象没有属性“text”

甚至尝试添加html5lib@PDGill，您可以共享到实际页面的链接吗？当然可以。[这是其中一个页面的示例。]（）感谢您的帮助。您的代码在该页面上按预期工作。用新鲜的眼光看它有时会有所不同。效果很好。谢谢你，先生。你是一位绅士和学者。