Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/307.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 从抓取内容创建数据帧_Python_Pandas_Web Scraping - Fatal编程技术网

Python 从抓取内容创建数据帧

Python 从抓取内容创建数据帧,python,pandas,web-scraping,Python,Pandas,Web Scraping,我需要创建一个显示URL和图像列表的数据集。 数据集应具有两列,行数应与链接数相同: Links Images 从网站上抓取图像的代码如下: import requests from bs4 import BeautifulSoup list_images=[] for link in list_websites: res=requests.get(link) bs = BeautifulSoup(res.text, 'html.parser') images =

我需要创建一个显示URL和图像列表的数据集。 数据集应具有两列,行数应与链接数相同:

Links Images
从网站上抓取图像的代码如下:

import requests
from bs4 import BeautifulSoup

list_images=[]

for link in list_websites:

    res=requests.get(link)
    bs = BeautifulSoup(res.text, 'html.parser')
    images =bs.find_all('img')

    for image in images:
         list_images.append(image['src'])
为了测试代码,我使用以下网站列表:list_websites=
[”http://news.m.istella.it/cluster?originalClust…","https://www.optimagazine.com/2020/03/25/"," https://www.playhitmusic.it/2020/03/","https://www.zazoom.it/2020-03-26/","https://oggiscienza.it/2015/11/17/","https://www.msn.com/it-it/video/amici/italias-...","https://www.quotidiano.net“]

我曾尝试使用
df['name\u col']=…
,但没有成功(数据帧为空)


您能告诉我这样做有什么不对吗?

您可以将包含网站链接和图像信息的元组附加到您的
列表\u图像
,然后根据此值列表创建一个数据帧

import requests
from bs4 import BeautifulSoup
import pandas as pd

list_images=[]

for link in list_websites:

    res=requests.get(link)
    bs = BeautifulSoup(res.text, 'html.parser')
    images =bs.find_all('img')

    # if you want to have all image links in a row
    list_images.append((link, [image['src'] for image in images]))

    # of if you want to have one row per link and url
    # for image in images:
    #   list_images.append((link, image['src']))

df = pd.DataFrame(list_images, columns = ['Link', 'Images'])

在上面找到问题的解决方案时,您忘了初始化数据帧。

非常感谢@lux7。请问,对于已经存在的数据帧(已经有一个链接列),只添加一个新列“Images”是否相同?如果我理解正确,您希望执行以下操作:df['Images']=[image['src']对于图像中的图像]?您需要传递一个索引来表示现有数据帧的索引行,您要在其中分配一个新的值列表。谢谢lux7。因此,如果我需要替换
list\u图像,请使用一个新列(
df['images']
)追加(…)
),不是吗?假设您有一个带列的数据帧“链接”,然后每一行都有一个索引。现在,您必须添加一个新的空列,如“df['url']=np.nan”,然后使用行中的索引分配图像列表,例如df['url'].iloc[0]=[image['src']用于图像中的图像]
import requests
from bs4 import BeautifulSoup
import pandas as pd

list_images=[]

for link in list_websites:

    res=requests.get(link)
    bs = BeautifulSoup(res.text, 'html.parser')
    images =bs.find_all('img')

    # if you want to have all image links in a row
    list_images.append((link, [image['src'] for image in images]))

    # of if you want to have one row per link and url
    # for image in images:
    #   list_images.append((link, image['src']))

df = pd.DataFrame(list_images, columns = ['Link', 'Images'])
output = pd.DataFrame()
for url in ['url', 'url2', 'url3']:
    list_img = ['img1', 'img2', 'img'] #Result of your get
    df_image = pd.DataFrame({'img': list_img})
    df_image['url'] = url
    output = output.append(df_image)
output