Web scraping 使用BeautifulSoup解析和提取熊猫数据

Web scraping 使用BeautifulSoup解析和提取熊猫数据,web-scraping,beautifulsoup,Web Scraping,Beautifulsoup,我正试图从网站上搜集一些数据,但我对Python/HTML还不熟悉,需要一些帮助 下面是代码中起作用的部分: from bs4 import BeautifulSoup import requests page_link ='http://www.some-website.com' page_response = requests.get(page_link, timeout=5) page_content = BeautifulSoup(page_response.content, "html

我正试图从网站上搜集一些数据,但我对Python/HTML还不熟悉,需要一些帮助

下面是代码中起作用的部分:

from bs4 import BeautifulSoup
import requests
page_link ='http://www.some-website.com'
page_response = requests.get(page_link, timeout=5)
page_content = BeautifulSoup(page_response.content, "html.parser")
data = page_content.find(id='yyy')
print(data)
这成功地抓取了我试图抓取的数据,打印时显示如下

<div class="generalData" id="yyy">
<div class="generalDataBox">

<div class="rowText">
<label class="some-class-here" title="some-title-here">
Title Name
</label>
<span class="" id="">###</span>
</div>

<div class="rowText">
<label class="same-class-here" title="another-title-here">
Another Title Name
</label>
<span class="" id="">###2</span>
</div>

... more rows here ...

</div></div>

书名
###
另一个书名
###2
... 这里有更多的行。。。
将其放入数据帧的最佳方式是什么?理想情况下,它将有两列:一列带有标签名(即上面的“标题名”或“另一个标题名”),另一列带有数据(即上面的####和##2)


谢谢

首先是提取部分:

html = """<div class="generalData" id="yyy">
<div class="generalDataBox">

<div class="rowText">
<label class="same-class-here" title="some-title-here">Title Name</label>
<span class="" id="">###</span>
</div>

<div class="rowText">
<label class="same-class-here" title="another-title-here">Another Title Name</label>
<span class="" id="">###2</span>
</div>"""

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')

hashList = list()
titleList = list()

rangeLen = len(soup.find_all('label', class_="same-class-here"))

for i in range(rangeLen):
    titleList.append(soup.find_all('label', class_="same-class-here")[i].get_text())
    hashList.append(soup.find_all('span')[i].get_text())
输出:

                Title  Hash
0          Title Name   ###
1  Another Title Name  ###2
                Title  Hash
0          Title Name   ###
1  Another Title Name  ###2