Python 从html/json页面提取特定部分的最佳方法？_Python_Html_Json_Beautifulsoup_Lxml

Python 从html/json页面提取特定部分的最佳方法？

python html json

Python 从html/json页面提取特定部分的最佳方法？,python,html,json,beautifulsoup,lxml,Python,Html,Json,Beautifulsoup,Lxml,python请求返回了以下内容： {"error":{"ErrorMessage":" <div> <p>To protect your privacy, this form will not display details such as a clinical or assisted collection. If you believe that the information detailed above is incomplete or incorrect, ple

python请求返回了以下内容：

{"error":{"ErrorMessage":"
<div>
<p>To protect your privacy, this form will not display details such as a clinical or assisted collection. If you believe that the information detailed above is incomplete or incorrect, please tell us here 
    <a href=\\"http:\\/\\/www.southhams.gov.uk\\/wastequestion\\">www.southhams.gov.uk\\/wastequestion<\\/a><\\/p><\\/div>","CodeName":"Success","ErrorStatus":0},"calendar":{"calendar":"
        <div class=\\"wsResponse\\">To protect your privacy, this form will not display details such as a clinical or assisted collection. If you believe that the information detailed above is incomplete or incorrect, please tell us here 
            <a href=\\"http:\\/\\/www.southhams.gov.uk\\/wastequestion\\">www.southhams.gov.uk\\/wastequestion<\\/a><\\/div>"},"binCollections":{"tile":[["
                <div class=\'collectionDiv\'>
                    <div class=\'fullwidth\'>
                        <h3>Organic Collection Service (Brown Organic Bin)<\\/h3><\\/div>
                            <div class=\\"collectionImg\\">
                                <img src=\\"https:\\/\\/southhams.fccenvironment.co.uk\\/library\\/images\\/brown bin.png\\" \\/><\\/div>\\n                    
                                <div class=\'wdshDetWrap\'>Your brown organic bin collection is 
                                    <b>Fortnightly<\\/b> on a 
                                        <b>Thursday<\\/b>.
                                            <br\\/> \\n                    Your next scheduled collection is 
                                            <b>Friday, 29 May 2020<\\/b>. 
                                                <br\\/>
                                                <br\\/>
                                                <a href=\\"https:\\/\\/www.southhams.gov.uk\\/article\\/3427\\">Read more about the Organic Collection Service &gt;<\\/a><\\/div><\\/div>"],["
                                                    <div class=\'collectionDiv\'>
                                                        <div class=\'fullwidth\'>
                                                            <h3>Recycling Collection Service (Recycling Sacks)<\\/h3><\\/div>
                                                                <div class=\\"collectionImg\\">
                                                                    <img src=\\"https:\\/\\/southhams.fccenvironment.co.uk\\/library\\/images\\/SH_two_rec_sacks.png\\" \\/><\\/div>\\n                    
                                                                    <div class=\'wdshDetWrap\'>Your recycling sacks collection is 
                                                                        <b>Fortnightly<\\/b> on a 
                                                                            <b>Thursday<\\/b>.
                                                                                <br\\/> \\n                    Your next scheduled collection is 
                                                                                <b>Friday, 29 May 2020<\\/b>. 
                                                                                    <br\\/>
                                                                                    <br\\/>
                                                                                    <a href=\\"https:\\/\\/www.southhams.gov.uk\\/article\\/3383\\">Read more about the Recycling Collection Service &gt;<\\/a><\\/div><\\/div>"],["
                                                                                        <div class=\'collectionDiv\'>
                                                                                            <div class=\'fullwidth\'>
                                                                                                <h3>Refuse Collection Service (Grey Refuse Bin)<\\/h3><\\/div>
                                                                                                    <div class=\\"collectionImg\\">
                                                                                                        <img src=\\"https:\\/\\/southhams.fccenvironment.co.uk\\/library\\/images\\/grey bin.png\\" \\/><\\/div>\\n                    
                                                                                                        <div class=\'wdshDetWrap\'>Your grey refuse bin collection is 
                                                                                                            <b>Fortnightly<\\/b> on a 
                                                                                                                <b>Thursday<\\/b>.
                                                                                                                    <br\\/> \\n                    Your next scheduled collection is 
                                                                                                                    <b>Thursday, 04 June 2020<\\/b>. 
                                                                                                                        <br\\/>
                                                                                                                        <br\\/>
                                                                                                                        <a href=\\"https:\\/\\/www.southhams.gov.uk\\/article\\/3384\\">Read more about the Refuse Collection Service &gt;<\\/a><\\/div><\\/div>"]]}}

特定站点的HTML文档格式不正确。我仍然管理着一个工作环境，在大约1000个标签的范围内，它将是低效的

所以可以改进

headers = soup.find_all('h3')
names = [tag.text[:tag.text.find('<')] for tag in headers]
dates = [tag.find_all('b')[2].text[:tag.find_all('b')[2].text.find('<')] for tag in headers]

print(names)
print(dates)

#Output
['Organic Collection Service (Brown Organic Bin)', 'Recycling Collection Service (Recycling Sacks)', 'Refuse Collection Service (Grey Refuse Bin)']
['Friday, 29 May 2020', 'Friday, 29 May 2020', 'Thursday, 04 June 2020']

特定站点的HTML文档格式不正确。我仍然管理着一个工作环境，在大约1000个标签的范围内，它将是低效的

所以可以改进

headers = soup.find_all('h3')
names = [tag.text[:tag.text.find('<')] for tag in headers]
dates = [tag.find_all('b')[2].text[:tag.find_all('b')[2].text.find('<')] for tag in headers]

print(names)
print(dates)

#Output
['Organic Collection Service (Brown Organic Bin)', 'Recycling Collection Service (Recycling Sacks)', 'Refuse Collection Service (Grey Refuse Bin)']
['Friday, 29 May 2020', 'Friday, 29 May 2020', 'Thursday, 04 June 2020']

直接获取json，然后可以调用该html值。完成此操作后，请使用beautifulsoup解析html，并在找到它的标记内打印出上下文/文本：

import requests
from bs4 import BeautifulSoup

url = "https://southhams.fccenvironment.co.uk/mycollections"

response = requests.get(url)

cookiejar = response.cookies
for cookie in cookiejar:
    print(cookie.name,cookie.value)

url = "https://southhams.fccenvironment.co.uk/ajaxprocessor/getcollectiondetails"

payload = 'fcc_session_token={}&uprn=100040282539'.format(cookie.value)
headers = {
  'X-Requested-With': 'XMLHttpRequest',
  'Content-Type': 'application/x-www-form-urlencoded',
  'Cookie': 'fcc_session_cookie={}'.format(cookie.value)
}

jsonData = requests.post(url, headers=headers, data = payload).json()


data = jsonData['binCollections']['tile']
for each in data:
    soup = BeautifulSoup(each[0], 'html.parser')
    collection = soup.find('div', {'class':'collectionDiv'}).find('h3').text.strip()
    date = soup.find_all('b')[-1].text.strip()

    print (collection, date)

输出：

直接获取json，然后可以调用该html值。完成此操作后，请使用beautifulsoup解析html，并在找到它的标记内打印出上下文/文本：

import requests
from bs4 import BeautifulSoup

url = "https://southhams.fccenvironment.co.uk/mycollections"

response = requests.get(url)

cookiejar = response.cookies
for cookie in cookiejar:
    print(cookie.name,cookie.value)

url = "https://southhams.fccenvironment.co.uk/ajaxprocessor/getcollectiondetails"

payload = 'fcc_session_token={}&uprn=100040282539'.format(cookie.value)
headers = {
  'X-Requested-With': 'XMLHttpRequest',
  'Content-Type': 'application/x-www-form-urlencoded',
  'Cookie': 'fcc_session_cookie={}'.format(cookie.value)
}

jsonData = requests.post(url, headers=headers, data = payload).json()


data = jsonData['binCollections']['tile']
for each in data:
    soup = BeautifulSoup(each[0], 'html.parser')
    collection = soup.find('div', {'class':'collectionDiv'}).find('h3').text.strip()
    date = soup.find_all('b')[-1].text.strip()

    print (collection, date)

输出：

您能提供返回该值的代码部分吗？实际请求部分？@chitown88添加到问题的底部，谢谢。您能提供返回该部分的代码吗？实际请求部分？@chitown88添加到问题的底部，谢谢。汤来自哪里？@SaschaM78它来自使用一个名为。我不知道他为什么不把它包括在解决方案中OP已经在使用它了。所以我只是修改了他的“请求代码：”代码片段。汤是从哪里来的？@SaschaM78它来自使用一个名为。我不知道他为什么不把它包括在解决方案中OP已经在使用它了。所以我只是修改了他的“请求代码”片段。