Html 使用bs4 Python3.8请求从长元素中抓取文本
我正在Ubuntu 20.04上使用Python3.8.5。如何将下面显示的html刮取到数据框中 这是我目前的代码:Html 使用bs4 Python3.8请求从长元素中抓取文本,html,python-3.x,beautifulsoup,python-requests,scrape,Html,Python 3.x,Beautifulsoup,Python Requests,Scrape,我正在Ubuntu 20.04上使用Python3.8.5。如何将下面显示的html刮取到数据框中 这是我目前的代码: import pathlib import sys import lxml import pandas as pd import requests from bs4 import BeautifulSoup response = requests.get('http://nemweb.com.au/Reports/Current/') soup = BeautifulSou
import pathlib
import sys
import lxml
import pandas as pd
import requests
from bs4 import BeautifulSoup
response = requests.get('http://nemweb.com.au/Reports/Current/')
soup = BeautifulSoup(response.text, 'lxml')
names = soup.find('body')
print(
f"Type = {type(names)}\n"
f"Length = {len(names)}\n"
)
name_list = names.find('pre')
print(name_list.text)
for elem in name_list.text:
print(elem)
#Do I need to use regex here?
如果需要数据帧,您可能需要尝试以下方法:
顺便说一下,这适用于来自nemweb.com.au-/Reports/Current的任何报告URL/
注意:我使用.head10显示给定数据帧的前10项
作为pd进口熊猫
导入请求
从bs4导入BeautifulSoup
从表格导入表格
headers=[日期、时间、类型、URL]
def make_soupcatalog_url:str:
返回BeautifulSouprequests.getcatalog_url.text,lxml
def进程\u soupsoup:BeautifulSoup->tuple:
text=soup.getText.split[8:]
对于汤中的a,请遵循[a[href]的URL。查找[U alla,href=True[1:]
catalog=[text[i:i+8]表示0范围内的i,lentext,8]
返回以下URL、目录
def build_dataframeprocessed_soup:tuple->pd.DataFrame:
遵循URL,目录=已处理
帧=[]
对于索引,枚举目录中的项:
*日期、小时、上午、类型\项
frame.append
[
.joindate,
凌晨{hour}{am},
类型,,
fhttp://nemweb.com.au{跟踪URL[index]}]
返回pd.DataFrameframe,columns=headers
def dump_to_csvdataframe:pd.DataFrame,文件名:str=默认名称:
dataframe.to_csvf{file_name}.csv,index=False
打印文件{文件名}已保存!
如果uuuu name uuuuu==\uuuuuuuu main\uuuuuuuu:
目标url=http://nemweb.com.au/Reports/Current/
df=构建\数据框架过程\源制作\源目标\ url
PrintTabledF.head10,headers=headers,showindex=False,tablefmt=pretty
dump_to_csvdf,file_name=target_url.rsplit/[-2]
输出:
+-----------------------------+----------+-------+-------------------------------------------------------------------+
| Date | Time | Type | URL |
+-----------------------------+----------+-------+-------------------------------------------------------------------+
| Saturday, April 3, 2021 | 9:50 AM | <dir> | http://nemweb.com.au/Reports/Current/Adjusted_Prices_Reports/ |
| Monday, April 5, 2021 | 8:00 AM | <dir> | http://nemweb.com.au/Reports/Current/Alt_Limits/ |
| Monday, April 5, 2021 | 1:12 AM | <dir> | http://nemweb.com.au/Reports/Current/Ancillary_Services_Payments/ |
| Monday, April 5, 2021 | 11:30 AM | <dir> | http://nemweb.com.au/Reports/Current/Auction_Units_Reports/ |
| Monday, April 5, 2021 | 4:43 AM | <dir> | http://nemweb.com.au/Reports/Current/Bidmove_Complete/ |
| Thursday, April 1, 2021 | 4:44 AM | <dir> | http://nemweb.com.au/Reports/Current/Bidmove_Summary/ |
| Wednesday, December 2, 2020 | 10:44 AM | <dir> | http://nemweb.com.au/Reports/Current/Billing/ |
| Monday, April 5, 2021 | 7:40 AM | <dir> | http://nemweb.com.au/Reports/Current/Causer_Pays/ |
| Thursday, February 4, 2021 | 9:10 PM | <dir> | http://nemweb.com.au/Reports/Current/Causer_Pays_Elements/ |
| Monday, November 28, 2016 | 7:50 PM | <dir> | http://nemweb.com.au/Reports/Current/Causer_Pays_Rslcpf/ |
+-----------------------------+----------+-------+-------------------------------------------------------------------+
File Current saved!
如果需要数据帧,您可能需要尝试以下方法:
顺便说一下,这适用于来自nemweb.com.au-/Reports/Current的任何报告URL/
注意:我使用.head10显示给定数据帧的前10项
作为pd进口熊猫
导入请求
从bs4导入BeautifulSoup
从表格导入表格
headers=[日期、时间、类型、URL]
def make_soupcatalog_url:str:
返回BeautifulSouprequests.getcatalog_url.text,lxml
def进程\u soupsoup:BeautifulSoup->tuple:
text=soup.getText.split[8:]
对于汤中的a,请遵循[a[href]的URL。查找[U alla,href=True[1:]
catalog=[text[i:i+8]表示0范围内的i,lentext,8]
返回以下URL、目录
def build_dataframeprocessed_soup:tuple->pd.DataFrame:
遵循URL,目录=已处理
帧=[]
对于索引,枚举目录中的项:
*日期、小时、上午、类型\项
frame.append
[
.joindate,
凌晨{hour}{am},
类型,,
fhttp://nemweb.com.au{跟踪URL[index]}]
返回pd.DataFrameframe,columns=headers
def dump_to_csvdataframe:pd.DataFrame,文件名:str=默认名称:
dataframe.to_csvf{file_name}.csv,index=False
打印文件{文件名}已保存!
如果uuuu name uuuuu==\uuuuuuuu main\uuuuuuuu:
目标url=http://nemweb.com.au/Reports/Current/
df=构建\数据框架过程\源制作\源目标\ url
PrintTabledF.head10,headers=headers,showindex=False,tablefmt=pretty
dump_to_csvdf,file_name=target_url.rsplit/[-2]
输出:
+-----------------------------+----------+-------+-------------------------------------------------------------------+
| Date | Time | Type | URL |
+-----------------------------+----------+-------+-------------------------------------------------------------------+
| Saturday, April 3, 2021 | 9:50 AM | <dir> | http://nemweb.com.au/Reports/Current/Adjusted_Prices_Reports/ |
| Monday, April 5, 2021 | 8:00 AM | <dir> | http://nemweb.com.au/Reports/Current/Alt_Limits/ |
| Monday, April 5, 2021 | 1:12 AM | <dir> | http://nemweb.com.au/Reports/Current/Ancillary_Services_Payments/ |
| Monday, April 5, 2021 | 11:30 AM | <dir> | http://nemweb.com.au/Reports/Current/Auction_Units_Reports/ |
| Monday, April 5, 2021 | 4:43 AM | <dir> | http://nemweb.com.au/Reports/Current/Bidmove_Complete/ |
| Thursday, April 1, 2021 | 4:44 AM | <dir> | http://nemweb.com.au/Reports/Current/Bidmove_Summary/ |
| Wednesday, December 2, 2020 | 10:44 AM | <dir> | http://nemweb.com.au/Reports/Current/Billing/ |
| Monday, April 5, 2021 | 7:40 AM | <dir> | http://nemweb.com.au/Reports/Current/Causer_Pays/ |
| Thursday, February 4, 2021 | 9:10 PM | <dir> | http://nemweb.com.au/Reports/Current/Causer_Pays_Elements/ |
| Monday, November 28, 2016 | 7:50 PM | <dir> | http://nemweb.com.au/Reports/Current/Causer_Pays_Rslcpf/ |
+-----------------------------+----------+-------+-------------------------------------------------------------------+
File Current saved!