Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/html/77.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/8/python-3.x/16.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Html 使用bs4 Python3.8请求从长元素中抓取文本_Html_Python 3.x_Beautifulsoup_Python Requests_Scrape - Fatal编程技术网

Html 使用bs4 Python3.8请求从长元素中抓取文本

Html 使用bs4 Python3.8请求从长元素中抓取文本,html,python-3.x,beautifulsoup,python-requests,scrape,Html,Python 3.x,Beautifulsoup,Python Requests,Scrape,我正在Ubuntu 20.04上使用Python3.8.5。如何将下面显示的html刮取到数据框中 这是我目前的代码: import pathlib import sys import lxml import pandas as pd import requests from bs4 import BeautifulSoup response = requests.get('http://nemweb.com.au/Reports/Current/') soup = BeautifulSou

我正在Ubuntu 20.04上使用Python3.8.5。如何将下面显示的html刮取到数据框中

这是我目前的代码:

import pathlib
import sys

import lxml
import pandas as pd
import requests
from bs4 import BeautifulSoup

response = requests.get('http://nemweb.com.au/Reports/Current/')
soup = BeautifulSoup(response.text, 'lxml')
names = soup.find('body')
print(
    f"Type = {type(names)}\n"
    f"Length = {len(names)}\n"
)
name_list = names.find('pre')
print(name_list.text)
for elem in name_list.text:
    print(elem)
#Do I need to use regex here?
如果需要数据帧,您可能需要尝试以下方法:

顺便说一下,这适用于来自nemweb.com.au-/Reports/Current的任何报告URL/

注意:我使用.head10显示给定数据帧的前10项

作为pd进口熊猫 导入请求 从bs4导入BeautifulSoup 从表格导入表格 headers=[日期、时间、类型、URL] def make_soupcatalog_url:str: 返回BeautifulSouprequests.getcatalog_url.text,lxml def进程\u soupsoup:BeautifulSoup->tuple: text=soup.getText.split[8:] 对于汤中的a,请遵循[a[href]的URL。查找[U alla,href=True[1:] catalog=[text[i:i+8]表示0范围内的i,lentext,8] 返回以下URL、目录 def build_dataframeprocessed_soup:tuple->pd.DataFrame: 遵循URL,目录=已处理 帧=[] 对于索引,枚举目录中的项: *日期、小时、上午、类型\项 frame.append [ .joindate, 凌晨{hour}{am}, 类型,, fhttp://nemweb.com.au{跟踪URL[index]}] 返回pd.DataFrameframe,columns=headers def dump_to_csvdataframe:pd.DataFrame,文件名:str=默认名称: dataframe.to_csvf{file_name}.csv,index=False 打印文件{文件名}已保存! 如果uuuu name uuuuu==\uuuuuuuu main\uuuuuuuu: 目标url=http://nemweb.com.au/Reports/Current/ df=构建\数据框架过程\源制作\源目标\ url PrintTabledF.head10,headers=headers,showindex=False,tablefmt=pretty dump_to_csvdf,file_name=target_url.rsplit/[-2] 输出:

+-----------------------------+----------+-------+-------------------------------------------------------------------+
|            Date             |   Time   | Type  |                                URL                                |
+-----------------------------+----------+-------+-------------------------------------------------------------------+
|   Saturday, April 3, 2021   | 9:50 AM  | <dir> |   http://nemweb.com.au/Reports/Current/Adjusted_Prices_Reports/   |
|    Monday, April 5, 2021    | 8:00 AM  | <dir> |         http://nemweb.com.au/Reports/Current/Alt_Limits/          |
|    Monday, April 5, 2021    | 1:12 AM  | <dir> | http://nemweb.com.au/Reports/Current/Ancillary_Services_Payments/ |
|    Monday, April 5, 2021    | 11:30 AM | <dir> |    http://nemweb.com.au/Reports/Current/Auction_Units_Reports/    |
|    Monday, April 5, 2021    | 4:43 AM  | <dir> |      http://nemweb.com.au/Reports/Current/Bidmove_Complete/       |
|   Thursday, April 1, 2021   | 4:44 AM  | <dir> |       http://nemweb.com.au/Reports/Current/Bidmove_Summary/       |
| Wednesday, December 2, 2020 | 10:44 AM | <dir> |           http://nemweb.com.au/Reports/Current/Billing/           |
|    Monday, April 5, 2021    | 7:40 AM  | <dir> |         http://nemweb.com.au/Reports/Current/Causer_Pays/         |
| Thursday, February 4, 2021  | 9:10 PM  | <dir> |    http://nemweb.com.au/Reports/Current/Causer_Pays_Elements/     |
|  Monday, November 28, 2016  | 7:50 PM  | <dir> |     http://nemweb.com.au/Reports/Current/Causer_Pays_Rslcpf/      |
+-----------------------------+----------+-------+-------------------------------------------------------------------+
File Current saved!
如果需要数据帧,您可能需要尝试以下方法:

顺便说一下,这适用于来自nemweb.com.au-/Reports/Current的任何报告URL/

注意:我使用.head10显示给定数据帧的前10项

作为pd进口熊猫 导入请求 从bs4导入BeautifulSoup 从表格导入表格 headers=[日期、时间、类型、URL] def make_soupcatalog_url:str: 返回BeautifulSouprequests.getcatalog_url.text,lxml def进程\u soupsoup:BeautifulSoup->tuple: text=soup.getText.split[8:] 对于汤中的a,请遵循[a[href]的URL。查找[U alla,href=True[1:] catalog=[text[i:i+8]表示0范围内的i,lentext,8] 返回以下URL、目录 def build_dataframeprocessed_soup:tuple->pd.DataFrame: 遵循URL,目录=已处理 帧=[] 对于索引,枚举目录中的项: *日期、小时、上午、类型\项 frame.append [ .joindate, 凌晨{hour}{am}, 类型,, fhttp://nemweb.com.au{跟踪URL[index]}] 返回pd.DataFrameframe,columns=headers def dump_to_csvdataframe:pd.DataFrame,文件名:str=默认名称: dataframe.to_csvf{file_name}.csv,index=False 打印文件{文件名}已保存! 如果uuuu name uuuuu==\uuuuuuuu main\uuuuuuuu: 目标url=http://nemweb.com.au/Reports/Current/ df=构建\数据框架过程\源制作\源目标\ url PrintTabledF.head10,headers=headers,showindex=False,tablefmt=pretty dump_to_csvdf,file_name=target_url.rsplit/[-2] 输出:

+-----------------------------+----------+-------+-------------------------------------------------------------------+
|            Date             |   Time   | Type  |                                URL                                |
+-----------------------------+----------+-------+-------------------------------------------------------------------+
|   Saturday, April 3, 2021   | 9:50 AM  | <dir> |   http://nemweb.com.au/Reports/Current/Adjusted_Prices_Reports/   |
|    Monday, April 5, 2021    | 8:00 AM  | <dir> |         http://nemweb.com.au/Reports/Current/Alt_Limits/          |
|    Monday, April 5, 2021    | 1:12 AM  | <dir> | http://nemweb.com.au/Reports/Current/Ancillary_Services_Payments/ |
|    Monday, April 5, 2021    | 11:30 AM | <dir> |    http://nemweb.com.au/Reports/Current/Auction_Units_Reports/    |
|    Monday, April 5, 2021    | 4:43 AM  | <dir> |      http://nemweb.com.au/Reports/Current/Bidmove_Complete/       |
|   Thursday, April 1, 2021   | 4:44 AM  | <dir> |       http://nemweb.com.au/Reports/Current/Bidmove_Summary/       |
| Wednesday, December 2, 2020 | 10:44 AM | <dir> |           http://nemweb.com.au/Reports/Current/Billing/           |
|    Monday, April 5, 2021    | 7:40 AM  | <dir> |         http://nemweb.com.au/Reports/Current/Causer_Pays/         |
| Thursday, February 4, 2021  | 9:10 PM  | <dir> |    http://nemweb.com.au/Reports/Current/Causer_Pays_Elements/     |
|  Monday, November 28, 2016  | 7:50 PM  | <dir> |     http://nemweb.com.au/Reports/Current/Causer_Pays_Rslcpf/      |
+-----------------------------+----------+-------+-------------------------------------------------------------------+
File Current saved!