Python Web抓取:收集信息后数据集为空
我想创建一个数据集,其中包括从网站上获取的信息。我将在下面解释我所做的工作和预期输出。我得到的是行和列的空数组,然后是整个数据集的空数组,我不理解原因。我希望你能帮助我 1) 创建一个只有一列的空数据框:此列应包含要使用的URL列表Python Web抓取:收集信息后数据集为空,python,pandas,web-scraping,beautifulsoup,Python,Pandas,Web Scraping,Beautifulsoup,我想创建一个数据集,其中包括从网站上获取的信息。我将在下面解释我所做的工作和预期输出。我得到的是行和列的空数组,然后是整个数据集的空数组,我不理解原因。我希望你能帮助我 1) 创建一个只有一列的空数据框:此列应包含要使用的URL列表 data_to_use = pd.DataFrame([], columns=['URL']) 2) 从以前的数据集中选择URL select_urls=dataset.URL.tolist() 这组URL看起来像:
data_to_use = pd.DataFrame([], columns=['URL'])
2) 从以前的数据集中选择URL
select_urls=dataset.URL.tolist()
这组URL看起来像:
URL
0 www.bbc.co.uk
1 www.stackoverflow.com
2 www.who.int
3 www.cnn.com
4 www.cooptrasportiriolo.it
... ...
3) 使用以下URL填充列:
data_to_use['URL']= select_urls
data_to_use['URLcleaned'] = data_to_use['URL'].str.replace('^(www\.)', '')
4) 选择一个随机样本进行测试:列URL
data_to_use = data_to_use.loc[1:50, 'URL']
5) 设法搜集信息
import requests
import time
from bs4 import BeautifulSoup
urls= data_to_use['URLcleaned'].tolist()
ares = []
for u in urls: # in the selection there should be an error. I am not sure that I am selecting the rig
print(u)
url = 'https://www.urlvoid.com/scan/'+ u
r = requests.get(url)
ares.append(r)
rows = []
cols = []
for ar in ares:
soup = BeautifulSoup(ar.content, 'lxml')
tab = soup.select("table.table.table-custom.table-striped")
try:
dat = tab[0].select('tr')
line= []
header=[]
for d in dat:
row = d.select('td')
line.append(row[1].text)
new_header = row[0].text
if not new_header in cols:
cols.append(new_header)
rows.append(line)
except IndexError:
continue
print(rows) # this works fine. It prints the rows. The issue comes from the next line
data_to_use = pd.DataFrame(rows,columns=cols)
不幸的是,上面的步骤出现了一些错误,因为我没有得到任何结果,只有[]
或\uuu
从data\u到\u use=pd.DataFrame(行、列=cols)的错误
我的预期产出是:
URL Website Address Last Analysis Blacklist Status \
bbc.co.uk Bbc.co.uk 9 days ago 0/35
stackoverflow.com Stackoverflow.com 7 days ago 0/35
Domain Registration IP Address Server Location ...
996-08-01 | 24 years ago 151.101.64.81 (US) United States ...
2003-12-26 | 17 years ago ...
最后,我应该将创建的数据集保存在csv文件中 您只能使用pandas来完成。请尝试以下代码
urllist=[ 'bbc.co.uk','stackoverflow.com','who.int','cnn.com']
dffinal=pd.DataFrame()
for url in urllist:
df=pd.read_html("https://www.urlvoid.com/scan/" + url + "/")[0]
list = df.values.tolist()
rows = []
cols = []
for li in list:
rows.append(li[1])
cols.append(li[0])
df1=pd.DataFrame([rows],columns=cols)
dffinal = dffinal.append(df1, ignore_index=True)
print(dffinal)
dffinal.to_csv("domain.csv",index=False)
Csv快照:
快照
Csv文件
使用try..更新
。除了
阻止,因为某些url不返回数据
urllist=['gov.ie','','who.int', 'comune.staranzano.go.it', 'cooptrasportiriolo.it', 'laprovinciadicomo.it', 'asufc.sanita.fvg.it', 'canale7.tv', 'gradenigo.it', 'leggo.it', 'urbanpost.it', 'monitorimmobiliare.it', 'comune.villachiara.bs.it', 'ilcittadinomb.it', 'europamulticlub.com']
dffinal=pd.DataFrame()
for url in urllist:
try:
df=pd.read_html("https://www.urlvoid.com/scan/" + url + "/")[0]
list = df.values.tolist()
rows = []
cols = []
for li in list:
rows.append(li[1])
cols.append(li[0])
df1=pd.DataFrame([rows],columns=cols)
dffinal = dffinal.append(df1, ignore_index=True)
except:
continue
print(dffinal)
dffinal.to_csv("domain.csv",index=False)
控制台:
Website Address ... Region
0 Gov.ie ... Dublin
1 Who.int ... Geneva
2 Comune.staranzano.go.it ... Unknown
3 Cooptrasportiriolo.it ... Unknown
4 Laprovinciadicomo.it ... Unknown
5 Canale7.tv ... Unknown
6 Leggo.it ... Milan
7 Urbanpost.it ... Ile-de-France
8 Monitorimmobiliare.it ... Unknown
9 Comune.villachiara.bs.it ... Unknown
10 Ilcittadinomb.it ... Unknown
[11 rows x 12 columns]
只是添加到@KunduK的解决方案中。您可以使用pandas的
.T
(转置函数)压缩部分代码
因此,您可以将此部分:
df=pd.read_html("https://www.urlvoid.com/scan/" + url + "/")[0]
list = df.values.tolist()
rows = []
cols = []
for li in list:
rows.append(li[1])
cols.append(li[0])
df1=pd.DataFrame([rows],columns=cols)
dffinal = dffinal.append(df1, ignore_index=True)
简而言之:
df=pd.read_html("https://www.urlvoid.com/scan/" + url + "/")[0].set_index(0).T
dffinal = dffinal.append(df, ignore_index=True)
撇开转换为csv不谈,让我们这样做:
urls=['gov.ie', 'who.int', 'comune.staranzano.go.it', 'cooptrasportiriolo.it', 'laprovinciadicomo.it', 'asufc.sanita.fvg.it', 'canale7.tv', 'gradenigo.it', 'leggo.it', 'urbanpost.it', 'monitorimmobiliare.it', 'comune.villachiara.bs.it', 'ilcittadinomb.it', 'europamulticlub.com']
ares = []
for u in urls:
url = 'https://www.urlvoid.com/scan/'+u
r = requests.get(url)
ares.append(r)
请注意,其中3个URL没有数据,因此数据框中应该只有11行。
下一步:
输出:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11 entries, 0 to 10
Data columns (total 12 columns):
Website Address 11 non-null object
Last Analysis 11 non-null object
Blacklist Status 11 non-null object
Domain Registration 11 non-null object
Domain Information 11 non-null object
IP Address 11 non-null object
Reverse DNS 11 non-null object
ASN 11 non-null object
Server Location 11 non-null object
Latitude\Longitude 11 non-null object
City 11 non-null object
Region 11 non-null object
dtypes: object(12)
memory usage: 1.2+ KB
范围索引:11个条目,0到10
数据列(共12列):
网站地址11非空对象
最后分析11非空对象
黑名单状态11非空对象
域注册11非空对象
域信息11非空对象
IP地址11非空对象
反向DNS 11非空对象
ASN 11非空对象
服务器位置11非空对象
纬度\经度11非空对象
城市11非空对象
区域11非空对象
数据类型:对象(12)
内存使用率:1.2+KB
Nay感谢亲爱的KunduK为您提供了这一令人信服的解决方案。为了再次检查,我也尝试了另一个列表:urlist=['gov.ie','who.int','comune.staranzano.go.it','cooptrasportiriolo.it']
,我得到了以下错误:ValueError:未找到任何表。你能检查一下它是否适合你吗?谢谢。@val:是的,我得到了所有的价值。@val:我看不出我这边有任何问题。你必须在你这边检查。我只能指导你如何解决这个问题。谢谢昆都克。我会继续检查。也许这是一个bug,因为对于初始数据集,它工作得很好。非常感谢你。你已经做了很多。
rows = []
cols = []
for ar in ares:
soup = bs(ar.content, 'lxml')
tab = soup.select("table.table.table-custom.table-striped")
if len(tab)>0:
dat = tab[0].select('tr')
line= []
header=[]
for d in dat:
row = d.select('td')
line.append(row[1].text)
new_header = row[0].text
if not new_header in cols:
cols.append(new_header)
rows.append(line)
my_df = pd.DataFrame(rows,columns=cols)
my_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11 entries, 0 to 10
Data columns (total 12 columns):
Website Address 11 non-null object
Last Analysis 11 non-null object
Blacklist Status 11 non-null object
Domain Registration 11 non-null object
Domain Information 11 non-null object
IP Address 11 non-null object
Reverse DNS 11 non-null object
ASN 11 non-null object
Server Location 11 non-null object
Latitude\Longitude 11 non-null object
City 11 non-null object
Region 11 non-null object
dtypes: object(12)
memory usage: 1.2+ KB