Python 从page-Beauty soup中的表中提取URL中的两个文本_Python_Beautifulsoup

Python 从page-Beauty soup中的表中提取URL中的两个文本

python

Python 从page-Beauty soup中的表中提取URL中的两个文本,python,beautifulsoup,Python,Beautifulsoup,我试图从一个网站的表格中提取文本和URL，但我似乎只能得到文本。我猜这和 text.strip在我的代码中，但我不知道如何在不删除url链接的情况下清理html标记。以下是我到目前为止总结的内容： import requests from bs4 import BeautifulSoup start_number = 0 max_number = 5 urls=[] for number in range(start_number, max_number + start_number):

我试图从一个网站的表格中提取文本和URL，但我似乎只能得到文本。我猜这和

text.strip

在我的代码中，但我不知道如何在不删除url链接的情况下清理html标记。以下是我到目前为止总结的内容：

import requests
from bs4 import BeautifulSoup

start_number = 0
max_number = 5

urls=[]

for number in range(start_number, max_number + start_number):
    url = 'http://www.ispo-org.or.id/index.php?option=com_content&view=article&id=79:pengumumanpublik&catid=10&Itemid=233&showall=&limitstart=' + str(number)+ '&lang=en'
    urls.append(url)

data = []

for url in urls:
    r = requests.get(url)
    soup = BeautifulSoup(r.content,"html.parser")
    table = soup.find("table")
    table_body = table.find('tbody')
    rows = table_body.find_all('tr')
    for row in rows:
        cols = row.find_all('td')
        cols = [ele.text.strip() for ele in cols]
        data.append([ele for ele in cols if ele]) # Get rid of empty values

只需从

元素中提取

href

。为了得到答案，我简化了代码，不必担心后续页面

from collections import namedtuple

import requests
from bs4 import BeautifulSoup

url = 'http://www.ispo-org.or.id/index.php?option=com_content&view=article&id=79:pengumumanpublik&catid=10&Itemid=233&showall=&limitstart=0&lang=en'

data = []
Record = namedtuple('Record', 'id company agency date pdf_link')

r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
rows = soup.select('table > tbody > tr')

for row in rows[1:]:  # omit header row
    cols = row.find_all('td')
    fields = [td.text.strip() for td in cols if td.text.strip()]

    if fields:  # if the row is not empty
        pdf_link = row.find('a')['href']
        record = Record(*fields, pdf_link)
        data.append(record)

仅从每个

td

元素中提取文本。如果出现

，您希望得到什么？@Jatimir，我希望获得指向pdf的链接，作为列表中的一个单独元素。它是这样的-

“images/notifikasi/619.%20Pengumuman%20Publik%20PT%20IGP.compressed.pdf”

。然后我会用这个和一个基本url来下载pdf。你能不能更清楚一点你的预期输出@Funkeh Monkeh？@SIM，基本上我想要一个包含表中所有信息的数据框。pdf的链接，这样我就可以迭代并下载到一个文件夹中。非常感谢@Jatimir。不幸的是，当我开始在URL中循环时，我发现这个

'NoneType'对象是不可订阅的

它似乎有时会有一些额外的空行。我更新了答案中的代码，以便它处理这些情况。

>>> data[0].pdf_link
'images/notifikasi/619.%20Pengumuman%20Publik%20PT%20IGP.compressed.pdf'