如何使用Python从网站中提取引用和作者？_Python_Html_Beautifulsoup

如何使用Python从网站中提取引用和作者？

python html

如何使用Python从网站中提取引用和作者？,python,html,beautifulsoup,Python,Html,Beautifulsoup,我编写了以下代码来从网页中提取引用： #importing python libraries from bs4 import BeautifulSoup as bs import pandas as pd pd.set_option('display.max_colwidth', 500) import time import requests import random from lxml import html #collect first page of quotes page =

我编写了以下代码来从网页中提取引用：

#importing python libraries

from bs4 import BeautifulSoup as bs
import pandas as pd
pd.set_option('display.max_colwidth', 500)
import time
import requests
import random
from lxml import html

#collect first page of quotes

page = requests.get("https://www.kdnuggets.com/2017/05/42-essential-quotes-data-science-thought-leaders.html")

#create a BeautifulSoup object

soup=BeautifulSoup(page.content, 'html.parser')
soup

print(soup.prettify())

#find all quotes on the page

soup.find_all('ol')

#pull just the quotes and not the superfluous data

Quote=soup.find(id='post-')
Quote_list=Quote.find_all('ol')
quote_list

此时，我只想在列表中显示文本，而不想看到

或

标记

.get_text（）

find_all（）

get\u text（）

find_all（）

get_text（）

使用引号[0]、引号[1]、。。。获取第一、第二等报价
即使是我认为现在正确的代码，'html.parser'
似乎也有点问题。但在切换到使用后（这不是在使用），它现在似乎起了作用：
from bs4 import BeautifulSoup as bs
import requests


page = requests.get("https://www.kdnuggets.com/2017/05/42-essential-quotes-data-science-thought-leaders.html")
soup=bs(page.content, 'lxml')
quotes = []
post_id = soup.find(id='post-')
ordered_lists = post_id.find_all('ol')
quotes.extend([li.get_text()
               for li in ordered_list.find_all('li')
                   for ordered_list in ordered_lists
              ])
print(len(quotes))
print(quotes[0]) # Get the first quote
print('-' * 80)
print(quotes[-1]) #print last quote

印刷品：
22
“By definition all scientists are data scientists. In my opinion, they are half hacker, half analyst, they use data to build products and find insights. It’s Columbus meet Columbo―starry-eyed explorers and skeptical detectives.”
--------------------------------------------------------------------------------
“Once you have a certain amount of math/stats and hacking skills, it is much better to acquire a grounding in one or more subjects than in adding yet another programming language to your hacking skills, or yet another machine learning algorithm to your math/stats portfolio…. Clients will rather work with some data scientist A who understands their specific field than with another data scientist B who first needs to learn the basics―even if B is better in math/stats/hacking.”

交替编码
from bs4 import BeautifulSoup as bs
import requests


page = requests.get("https://www.kdnuggets.com/2017/05/42-essential-quotes-data-science-thought-leaders.html")
soup=bs(page.content, 'lxml')
quotes = []
post_id = soup.find(id='post-')
ordered_lists = post_id.find_all('ol')
for ordered_list in ordered_lists:
    for li in ordered_list.find_all('li'):
        quotes.append(li.get_text())
print(len(quotes))
print(quotes[0]) # Get the first quote
print('-' * 80)
print(quotes[-1]) #print last quote

这回答了你的问题吗？这不会返回任何内容引号列表包含该站点上的所有引号！再跑一次，它工作得很好我跑了好几次。。。我没有犯错误，只是什么都没发生。我不确定这是否是我的python的问题？您是否尝试过print（引号）
或print（引号[0]）
？您使用什么代码编辑器？闲散的、闲散的、迷人的、原子的、崇高的等等？这通常不会发生，但请重新启动您的机器一次！我尝试了打印（引号）和打印（引号[0]）-使用打印（引号）我什么也没有得到，使用打印（引号[0]）我返回了两个空方括号[]我正在使用jupyterI，我现在得到了。。。。。。。。。。。。。。。。。。。。。。。名称错误：名称“有序列表”未定义代码适用于我。是否确实已正确复制和粘贴？我已经包括了我的Jupyter笔记本单元和输出。我直接从代码中复制并粘贴了它-我觉得我的Jupyter笔记本有点问题。。。我之前在笔记本上写了一堆代码，这些代码正在提取结果，现在我重新运行时才返回[]。。。我在上面添加了一张我运行你的代码时得到的图片，这样你就可以看到这不是输入代码的错误。我明白你的意思了。您的Jupyter笔记本似乎有问题，因为ordered_list中ordered_list的行定义了名称ordered_list
。尝试将源代码复制到“.py”文件中，并使用python命令执行它。我还将使用备用编码更新答案。从命令行运行或使用备用编码是否幸运？
from bs4 import BeautifulSoup as bs
import requests


page = requests.get("https://www.kdnuggets.com/2017/05/42-essential-quotes-data-science-thought-leaders.html")
soup=bs(page.content, 'lxml')
quotes = []
post_id = soup.find(id='post-')
ordered_lists = post_id.find_all('ol')
for ordered_list in ordered_lists:
    for li in ordered_list.find_all('li'):
        quotes.append(li.get_text())
print(len(quotes))
print(quotes[0]) # Get the first quote
print('-' * 80)
print(quotes[-1]) #print last quote