Python Web抓取:收集信息后数据集为空

Python Web抓取:收集信息后数据集为空,python,pandas,web-scraping,beautifulsoup,Python,Pandas,Web Scraping,Beautifulsoup,我想创建一个数据集,其中包括从网站上获取的信息。我将在下面解释我所做的工作和预期输出。我得到的是行和列的空数组,然后是整个数据集的空数组,我不理解原因。我希望你能帮助我 1) 创建一个只有一列的空数据框:此列应包含要使用的URL列表 data_to_use = pd.DataFrame([], columns=['URL']) 2) 从以前的数据集中选择URL select_urls=dataset.URL.tolist() 这组URL看起来像:

我想创建一个数据集,其中包括从网站上获取的信息。我将在下面解释我所做的工作和预期输出。我得到的是行和列的空数组,然后是整个数据集的空数组,我不理解原因。我希望你能帮助我

1) 创建一个只有一列的空数据框:此列应包含要使用的URL列表

data_to_use = pd.DataFrame([], columns=['URL'])
2) 从以前的数据集中选择URL

select_urls=dataset.URL.tolist()
这组URL看起来像:

                             URL
0                     www.bbc.co.uk
1             www.stackoverflow.com           
2                       www.who.int
3                       www.cnn.com
4         www.cooptrasportiriolo.it
...                             ...
3) 使用以下URL填充列:

data_to_use['URL']= select_urls
data_to_use['URLcleaned'] = data_to_use['URL'].str.replace('^(www\.)', '')
4) 选择一个随机样本进行测试:列
URL

data_to_use = data_to_use.loc[1:50, 'URL']
5) 设法搜集信息

import requests
import time
from bs4 import BeautifulSoup

urls= data_to_use['URLcleaned'].tolist()

ares = []

for u in urls: # in the selection there should be an error. I am not sure that I am selecting the rig
    print(u)
    url = 'https://www.urlvoid.com/scan/'+ u
    r = requests.get(url)
    ares.append(r)   

rows = []
cols = []

for ar in ares:
    soup = BeautifulSoup(ar.content, 'lxml')
    tab = soup.select("table.table.table-custom.table-striped")   
    try:
            dat = tab[0].select('tr')
            line= []
            header=[]
            for d in dat:
                row = d.select('td')
                line.append(row[1].text)
            new_header = row[0].text
            if not new_header in cols:
                cols.append(new_header)
            rows.append(line)
    except IndexError:
        continue

print(rows) # this works fine. It prints the rows. The issue comes from the next line

data_to_use = pd.DataFrame(rows,columns=cols)  
不幸的是,上面的步骤出现了一些错误,因为我没有得到任何结果,只有
[]
\uuu

data\u到\u use=pd.DataFrame(行、列=cols)的错误

我的预期产出是:

URL          Website Address   Last Analysis   Blacklist Status \  
bbc.co.uk          Bbc.co.uk         9 days ago       0/35
stackoverflow.com Stackoverflow.com  7 days ago      0/35

Domain Registration               IP Address       Server Location    ...
996-08-01 | 24 years ago       151.101.64.81    (US) United States    ...
2003-12-26 | 17 years ago      ...

最后,我应该将创建的数据集保存在csv文件中

您只能使用pandas来完成。请尝试以下代码

urllist=[ 'bbc.co.uk','stackoverflow.com','who.int','cnn.com']

dffinal=pd.DataFrame()
for url in urllist:
    df=pd.read_html("https://www.urlvoid.com/scan/" + url + "/")[0]
    list = df.values.tolist()
    rows = []
    cols = []
    for li in list:
        rows.append(li[1])
        cols.append(li[0])
    df1=pd.DataFrame([rows],columns=cols)
    dffinal = dffinal.append(df1, ignore_index=True)

print(dffinal)
dffinal.to_csv("domain.csv",index=False)
Csv快照:

快照

Csv文件


使用
try..更新。除了
阻止,因为某些url不返回数据

urllist=['gov.ie','','who.int', 'comune.staranzano.go.it', 'cooptrasportiriolo.it', 'laprovinciadicomo.it', 'asufc.sanita.fvg.it', 'canale7.tv', 'gradenigo.it', 'leggo.it', 'urbanpost.it', 'monitorimmobiliare.it', 'comune.villachiara.bs.it', 'ilcittadinomb.it', 'europamulticlub.com']

dffinal=pd.DataFrame()
for url in urllist:
    try:
        df=pd.read_html("https://www.urlvoid.com/scan/" + url + "/")[0]
        list = df.values.tolist()
        rows = []
        cols = []
        for li in list:
            rows.append(li[1])
            cols.append(li[0])
        df1=pd.DataFrame([rows],columns=cols)
        dffinal = dffinal.append(df1, ignore_index=True)

    except:
        continue

print(dffinal)
dffinal.to_csv("domain.csv",index=False)
控制台

            Website Address  ...         Region
0                     Gov.ie  ...         Dublin
1                    Who.int  ...         Geneva
2    Comune.staranzano.go.it  ...        Unknown
3      Cooptrasportiriolo.it  ...        Unknown
4       Laprovinciadicomo.it  ...        Unknown
5                 Canale7.tv  ...        Unknown
6                   Leggo.it  ...          Milan
7               Urbanpost.it  ...  Ile-de-France
8      Monitorimmobiliare.it  ...        Unknown
9   Comune.villachiara.bs.it  ...        Unknown
10          Ilcittadinomb.it  ...        Unknown

[11 rows x 12 columns]

只是添加到@KunduK的解决方案中。您可以使用pandas的
.T
(转置函数)压缩部分代码

因此,您可以将此部分:

df=pd.read_html("https://www.urlvoid.com/scan/" + url + "/")[0]
list = df.values.tolist()
rows = []
cols = []
for li in list:
    rows.append(li[1])
    cols.append(li[0])
df1=pd.DataFrame([rows],columns=cols)
dffinal = dffinal.append(df1, ignore_index=True)
简而言之:

df=pd.read_html("https://www.urlvoid.com/scan/" + url + "/")[0].set_index(0).T
dffinal = dffinal.append(df, ignore_index=True)

撇开转换为csv不谈,让我们这样做:

urls=['gov.ie', 'who.int', 'comune.staranzano.go.it', 'cooptrasportiriolo.it', 'laprovinciadicomo.it', 'asufc.sanita.fvg.it', 'canale7.tv', 'gradenigo.it', 'leggo.it', 'urbanpost.it', 'monitorimmobiliare.it', 'comune.villachiara.bs.it', 'ilcittadinomb.it', 'europamulticlub.com']
ares = []
for u in urls:
    url = 'https://www.urlvoid.com/scan/'+u
    r = requests.get(url)
    ares.append(r)
请注意,其中3个URL没有数据,因此数据框中应该只有11行。 下一步:

输出:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11 entries, 0 to 10
Data columns (total 12 columns):
Website Address        11 non-null object
Last Analysis          11 non-null object
Blacklist Status       11 non-null object
Domain Registration    11 non-null object
Domain Information     11 non-null object
IP Address             11 non-null object
Reverse DNS            11 non-null object
ASN                    11 non-null object
Server Location        11 non-null object
Latitude\Longitude     11 non-null object
City                   11 non-null object
Region                 11 non-null object
dtypes: object(12)
memory usage: 1.2+ KB

范围索引:11个条目,0到10
数据列(共12列):
网站地址11非空对象
最后分析11非空对象
黑名单状态11非空对象
域注册11非空对象
域信息11非空对象
IP地址11非空对象
反向DNS 11非空对象
ASN 11非空对象
服务器位置11非空对象
纬度\经度11非空对象
城市11非空对象
区域11非空对象
数据类型:对象(12)
内存使用率:1.2+KB

Nay感谢亲爱的KunduK为您提供了这一令人信服的解决方案。为了再次检查,我也尝试了另一个列表:
urlist=['gov.ie','who.int','comune.staranzano.go.it','cooptrasportiriolo.it']
,我得到了以下错误:
ValueError:未找到任何表。你能检查一下它是否适合你吗?谢谢。@val:是的,我得到了所有的价值。@val:我看不出我这边有任何问题。你必须在你这边检查。我只能指导你如何解决这个问题。谢谢昆都克。我会继续检查。也许这是一个bug,因为对于初始数据集,它工作得很好。非常感谢你。你已经做了很多。
rows = []
cols = []
for ar in ares:
    soup = bs(ar.content, 'lxml')
    tab = soup.select("table.table.table-custom.table-striped")        
    if len(tab)>0:
        dat = tab[0].select('tr')
        line= []
        header=[]
        for d in dat:
            row = d.select('td')
            line.append(row[1].text)
            new_header = row[0].text
            if not new_header in cols:
                cols.append(new_header)
        rows.append(line)

my_df = pd.DataFrame(rows,columns=cols)   
my_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11 entries, 0 to 10
Data columns (total 12 columns):
Website Address        11 non-null object
Last Analysis          11 non-null object
Blacklist Status       11 non-null object
Domain Registration    11 non-null object
Domain Information     11 non-null object
IP Address             11 non-null object
Reverse DNS            11 non-null object
ASN                    11 non-null object
Server Location        11 non-null object
Latitude\Longitude     11 non-null object
City                   11 non-null object
Region                 11 non-null object
dtypes: object(12)
memory usage: 1.2+ KB