Pandas Webscraping:抓取页面并将内容存储在数据框中

Pandas Webscraping:抓取页面并将内容存储在数据框中,pandas,dataframe,web-scraping,beautifulsoup,Pandas,Dataframe,Web Scraping,Beautifulsoup,以下代码可用于为三个给定的示例URL重现web抓取任务: 代码: import pandas as pd import requests import urllib.request from bs4 import BeautifulSoup # Would otherwise load a csv file with 100+ urls into a DataFrame # Example data: links = {'url': ['https://www.apple.com/educat

以下代码可用于为三个给定的示例URL重现web抓取任务:

代码:

import pandas as pd
import requests
import urllib.request
from bs4 import BeautifulSoup

# Would otherwise load a csv file with 100+ urls into a DataFrame
# Example data:
links = {'url': ['https://www.apple.com/education/', 'https://www.apple.com/business/', 'https://www.apple.com/environment/']}
urls = pd.DataFrame(data=links)

def scrape_content(url):

    r = requests.get(url)
    html = r.content
    soup = BeautifulSoup(html,"lxml")

    # Get page title
    title = soup.find("meta",attrs={"property":"og:title"})["content"].strip()
    # Get content from paragraphs
    content = soup.find("div", {"class":"section-content"}).find_all('p')

    print(title)

    for p in content:
        p = p.get_text(strip=True)
        print(p)
Education
Every child is born full of creativity. Nurturing it is one of the most important things educators do. Creativity makes your students better communicators and problem solvers. It prepares them to thrive in today’s world — and to shape tomorrow’s. For 40 years, Apple has helped teachers unleash the creative potential in every student. And today, we do that in more ways than ever. Not only with powerful products, but also with tools, inspiration, and curricula to help you create magical learning experiences.
Watch the keynote
Business
Apple products have always been designed for the way we work as much as for the way we live. Today they help employees to work more simply and productively, solve problems creatively, and collaborate with a shared purpose. And they’re all designed to work together beautifully. When people have access to iPhone, iPad, and Mac, they can do their best work and reimagine the future of their business.
Environment
We strive to create products that are the best in the world and the best for the world. And we continue to make progress toward our environmental priorities. Like powering all Apple facilities worldwide with 100% renewable energy. Creating the next innovation in recycling with Daisy, our newest disassembly robot. And leading the industry in making our materials safer for people and for the earth. In every product we make, in every innovation we create, our goal is to leave the planet better than we found it. Read the 2018 Progress Report

0    None
1    None
2    None
Name: url, dtype: object
对每个url应用刮取:

urls['url'].apply(scrape_content)
Out:

import pandas as pd
import requests
import urllib.request
from bs4 import BeautifulSoup

# Would otherwise load a csv file with 100+ urls into a DataFrame
# Example data:
links = {'url': ['https://www.apple.com/education/', 'https://www.apple.com/business/', 'https://www.apple.com/environment/']}
urls = pd.DataFrame(data=links)

def scrape_content(url):

    r = requests.get(url)
    html = r.content
    soup = BeautifulSoup(html,"lxml")

    # Get page title
    title = soup.find("meta",attrs={"property":"og:title"})["content"].strip()
    # Get content from paragraphs
    content = soup.find("div", {"class":"section-content"}).find_all('p')

    print(title)

    for p in content:
        p = p.get_text(strip=True)
        print(p)
Education
Every child is born full of creativity. Nurturing it is one of the most important things educators do. Creativity makes your students better communicators and problem solvers. It prepares them to thrive in today’s world — and to shape tomorrow’s. For 40 years, Apple has helped teachers unleash the creative potential in every student. And today, we do that in more ways than ever. Not only with powerful products, but also with tools, inspiration, and curricula to help you create magical learning experiences.
Watch the keynote
Business
Apple products have always been designed for the way we work as much as for the way we live. Today they help employees to work more simply and productively, solve problems creatively, and collaborate with a shared purpose. And they’re all designed to work together beautifully. When people have access to iPhone, iPad, and Mac, they can do their best work and reimagine the future of their business.
Environment
We strive to create products that are the best in the world and the best for the world. And we continue to make progress toward our environmental priorities. Like powering all Apple facilities worldwide with 100% renewable energy. Creating the next innovation in recycling with Daisy, our newest disassembly robot. And leading the industry in making our materials safer for people and for the earth. In every product we make, in every innovation we create, our goal is to leave the planet better than we found it. Read the 2018 Progress Report

0    None
1    None
2    None
Name: url, dtype: object
问题:

import pandas as pd
import requests
import urllib.request
from bs4 import BeautifulSoup

# Would otherwise load a csv file with 100+ urls into a DataFrame
# Example data:
links = {'url': ['https://www.apple.com/education/', 'https://www.apple.com/business/', 'https://www.apple.com/environment/']}
urls = pd.DataFrame(data=links)

def scrape_content(url):

    r = requests.get(url)
    html = r.content
    soup = BeautifulSoup(html,"lxml")

    # Get page title
    title = soup.find("meta",attrs={"property":"og:title"})["content"].strip()
    # Get content from paragraphs
    content = soup.find("div", {"class":"section-content"}).find_all('p')

    print(title)

    for p in content:
        p = p.get_text(strip=True)
        print(p)
Education
Every child is born full of creativity. Nurturing it is one of the most important things educators do. Creativity makes your students better communicators and problem solvers. It prepares them to thrive in today’s world — and to shape tomorrow’s. For 40 years, Apple has helped teachers unleash the creative potential in every student. And today, we do that in more ways than ever. Not only with powerful products, but also with tools, inspiration, and curricula to help you create magical learning experiences.
Watch the keynote
Business
Apple products have always been designed for the way we work as much as for the way we live. Today they help employees to work more simply and productively, solve problems creatively, and collaborate with a shared purpose. And they’re all designed to work together beautifully. When people have access to iPhone, iPad, and Mac, they can do their best work and reimagine the future of their business.
Environment
We strive to create products that are the best in the world and the best for the world. And we continue to make progress toward our environmental priorities. Like powering all Apple facilities worldwide with 100% renewable energy. Creating the next innovation in recycling with Daisy, our newest disassembly robot. And leading the industry in making our materials safer for people and for the earth. In every product we make, in every innovation we create, our goal is to leave the planet better than we found it. Read the 2018 Progress Report

0    None
1    None
2    None
Name: url, dtype: object
  • 代码当前仅输出每页第一段的内容。我喜欢获取给定选择器中每个
    p
    的数据
  • 对于最后的数据,我需要一个包含url、标题和内容的数据框架。因此,我想知道如何将收集到的信息写入数据框

  • 谢谢您的帮助。

    您的问题在这一行:

    content = soup.find("div", {"class":"section-content"}).find_all('p')
    
    find\u all()。因此,您将在第一个
    div.section\u content
    中获得所有
    标记。对于您的用例,正确的标准是什么还不清楚,但是如果您只是想要所有的
    标记,您可以使用:

    content = soup.find_all('p')
    
    然后,您可以创建
    scrape_url()
    合并
    标记文本,并将其与标题一起返回:

    content = '\r'.join([p.get_text(strip=True) for p in content])
    return title, content
    
    在函数外部,您可以构建数据帧:

    url_list = urls['url'].tolist()
    results = [scrape_url(url) for url in url_list]
    title_list = [r[0] for r in results]
    content_list = [r[1] for r in results]
    df = pd.DataFrame({'url': url_list, 'title': title_list, 'content': content_list})
    

    标准是我希望所有的p都在节内容中。如果我扩展到所有p元素,它也会刮去页脚和一些其他不需要的元素。我也尝试了代码,但它返回了
    MissingSchema:Invalid URL'URL':没有提供架构。也许你的意思是http://url?
    。问题必须出现在
    results=[scrape_content(url)for url in url]
    网页上的
    尝试
    查看源代码
    ——只有一个
    部分内容
    div,您已经获得了其中的所有段落。此外,我还修改了代码,使之与存储URL的方式相匹配。啊,太好了,这一个很有效。最后一个问题:在我自己的url示例中,内容有\n和\xa0。我怎样才能摆脱它?这些是unicode字符-请看,我可以通过添加
    content=content.replace(u“\n”,”)