Python 从URL获取文本将返回空数据帧_Python_Web Scraping_Beautifulsoup_Urllib

Python 从URL获取文本将返回空数据帧

python web-scraping

Python 从URL获取文本将返回空数据帧,python,web-scraping,beautifulsoup,urllib,Python,Web Scraping,Beautifulsoup,Urllib,我试图使用for循环从几个网站获取所有段落，但得到的是一个空数据框。程序的逻辑是 urls=[] texts = [] for r in my_list: try: # Get text url = urllib.urlopen(r) content = url.read() soup

我试图使用for循环从几个网站获取所有段落，但得到的是一个空数据框。程序的逻辑是

urls=[]
texts = []        

for r in my_list:
                try:
                    # Get text
                    url = urllib.urlopen(r)
                    content = url.read()
                    soup = BeautifulSoup(content, 'lxml')
                    # Find all of the text between paragraph tags and strip out the html
                    page = soup.find('p').getText()
                    texts.append(page)
                    urls.append(r)
                    
                except Exception as e:
                    print(e)
                    continue

df = pd.DataFrame({"Urls" : urls, "Texts:" : texts})

URL（my_列表）的一个示例可能是：

我如何才能正确地存储该特定页面上的链接和文本（这样就不会影响整个网站！）

预期产出：

Urls                                                       Texts

https://www.ford.com.au/performance/mustang/         Nothing else offers the unique combination of classic style and exhilarating performance quite like the Ford Mustang. Whether it’s the Fastback or Convertible, 5.0L V8 or High Performance 2.3L, the Mustang has a heritage few other cars can match.
https://soperth.com.au/perths-best-fish-and-chips-46154 
https://www.tripadvisor.com.au/Restaurants-g255103-zfd10901-Perth_Greater_Perth_Western_Australia-Fish_and_Chips.html 
https://www.bbc.co.uk/programmes/b07d2wy4

在文本中，我应该为每个url提供包含在该页面中的段落（即所有元素）。即使是一个伪代码（所以不完全是我的代码），理解我的错误在哪里也会很有帮助。我想我当前的错误可能在这一步：

url=urllib.urlopen（r）

，因为我没有文本。

我尝试了以下代码（python3：因此是urllib.request），它可以工作。在urlopen挂起时添加了用户代理

import pandas as pd
import urllib
from bs4 import BeautifulSoup

urls = []
texts = []
my_list = ["https://www.ford.com.au/performance/mustang/", "https://soperth.com.au/perths-best-fish-and-chips-46154",
           "https://www.tripadvisor.com.au/Restaurants-g255103-zfd10901-Perth_Greater_Perth_Western_Australia-Fish_and_Chips.html", "https://www.bbc.co.uk/programmes/b07d2wy4"]

for r in my_list:
    try:
        # Get text
        req = urllib.request.Request(
            r,
            data=None,
            headers={
                'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
            }
        )
        url = urllib.request.urlopen(req)
        content = url.read()
        soup = BeautifulSoup(content, 'lxml')

        # Find all of the text between paragraph tags and strip out the html
        page = ''
        for para in soup.find_all('p'):
            page += para.get_text()
        print(page)
        texts.append(page)
        urls.append(r)
    except Exception as e:
        print(e)
        continue

df = pd.DataFrame({"Urls": urls, "Texts:": texts})
print(df)

有输出吗？请参阅以及如何创建。这个问题中没有问题。嗨，Peter。没有输出：只有URL文本作为标题，没有行。我提供了创建数据框架的部分代码，其中包含了链接和段落的粗略信息。我补充了这个问题。它是关于正确存储链接和文本的。我将提供一个预期输出的示例，如果您想查找所有段落，您不应该使用

find_all

methodRight，@BhavyaParikh。但是，即使我用

find_all

替换

find

，我也会得到空的数据帧。谢谢@SubhashR。那么，你认为问题出在urllib.request和userAgent中吗？我相信是用户代理的问题