Warning: file_get_contents(/data/phpspider/zhask/data//catemap/8/python-3.x/19.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 从Wikipedia参考部分抓取URL_Python_Python 3.x_Beautifulsoup_Bs4 - Fatal编程技术网

Python 从Wikipedia参考部分抓取URL

Python 从Wikipedia参考部分抓取URL,python,python-3.x,beautifulsoup,bs4,Python,Python 3.x,Beautifulsoup,Bs4,我正在尝试创建一个程序,从Wikipedia页面的引用部分刮取URL,但是,我在隔离该标记/类时遇到了问题 ## Import required packages ## from urllib.request import urlopen from urllib.error import HTTPError from bs4 import BeautifulSoup import re selectWikiPage = input(print("Please enter the Wikiped

我正在尝试创建一个程序,从Wikipedia页面的引用部分刮取URL,但是,我在隔离该标记/类时遇到了问题

## Import required packages ##
from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup
import re

selectWikiPage = input(print("Please enter the Wikipedia page you wish to scrape from"))
isWikiFound = re.findall(selectWikiPage, 'wikipedia')
if "wikipedia" in selectWikiPage:
    print("Input accepted")
    html = urlopen(selectWikiPage)
    bsObj = BeautifulSoup(html, "lxml")
    findReferences = bsObj.findAll("#References")
    for wikiReferences in findReferences:
        print(wikiReferences.get_text())

else:
    print("Error: Please enter a valid Wikipedia URL")
这是程序的输出

    Please enter the Wikipedia page you wish to scrape from
Nonehttp://wikipedia.org/wiki/randomness
Input accepted

我稍微更改了您的代码以使用请求库

我将此链接用作测试用例“”

如果只希望检索作为wiki页面中使用的文本源的链接,请执行以下操作:

import requests
from bs4 import BeautifulSoup

session = requests.Session()    
selectWikiPage = input(print("Please enter the Wikipedia page you wish to scrape from"))

if "wikipedia" in selectWikiPage:
    html = session.post(selectWikiPage)
    bsObj = BeautifulSoup(html.text, "html.parser")
    findReferences = bsObj.findAll('span', {'class':'reference-text'})
    href = BeautifulSoup(str(findReferences), "html.parser")
    references = href.findAll('a', href=True)
    links = [a["href"] for a in soup.find_all("a", href=True)]     
    print i in links:
else:
    print("Error: Please enter a valid Wikipedia URL")
输出:

Please enter the Wikipedia page you wish to scrape from
Nonehttps://en.wikipedia.org/wiki/Randomness
Link: /wiki/Oxford_English_Dictionary
Link: http://www.people.fas.harvard.edu/~junliu/Workshops/workshop2007/
Link: /wiki/International_Standard_Book_Number_(identifier)
Link: /wiki/Special:BookSources/0-19-512332-8
Link: /wiki/International_Standard_Book_Number_(identifier)
Link: /wiki/Special:BookSources/0-674-01517-7
Link: /wiki/International_Standard_Book_Number_(identifier)
Link: /wiki/Special:BookSources/0-387-98844-0
Link: http://www.nature.com/nature/journal/v446/n7138/abs/nature05677.html
Link: /w/index.php?title=Bell%27s_aspect_experiment&action=edit&redlink=1
Link: /wiki/Nature_(journal)
Link: /wiki/John_Gribbin
Link: https://www.academia.edu/11720588/No_entailing_laws_but_enablement_in_the_evolution_of_the_biosphere
Link: /wiki/International_Standard_Book_Number
Link: /wiki/Special:BookSources/9781450311786
Link: /wiki/Digital_object_identifier
Link: //doi.org/10.1145%2F2330784.2330946
Link: https://www.academia.edu/11720575/Extended_criticality_phase_spaces_and_enablement_in_biology
Link: /wiki/Digital_object_identifier
Link: //doi.org/10.1016%2Fj.chaos.2013.03.008
Link: /wiki/PubMed_Identifier
Link: //www.ncbi.nlm.nih.gov/pubmed/7059501
Link: /wiki/Digital_object_identifier
Link: //doi.org/10.1111%2Fj.1365-2133.1982.tb00897.x
Link: http://webpages.uncc.edu/yonwang/papers/thesis.pdf
Link: http://www.lbl.gov/Science-Articles/Archive/pi-random.html
Link: http://www.ciphersbyritter.com/RES/RANDTEST.HTM
Link: http://dx.doi.org/10.1038/nature09008
Link: https://www.nytimes.com/2008/06/08/books/review/Johnson-G-t.html?_r=1
Please enter the Wikipedia page you wish to scrape from
Nonehttps://en.wikipedia.org/wiki/Randomness
Link: #cite_ref-1
Link: /wiki/Oxford_English_Dictionary
Link: #cite_ref-2
Link: http://www.people.fas.harvard.edu/~junliu/Workshops/workshop2007/
Link: #cite_ref-3
Link: /wiki/International_Standard_Book_Number_(identifier)
Link: /wiki/Special:BookSources/0-19-512332-8
Link: #cite_ref-4
Link: /wiki/International_Standard_Book_Number_(identifier)
Link: /wiki/Special:BookSources/0-674-01517-7
Link: #cite_ref-5
Link: /wiki/International_Standard_Book_Number_(identifier)
Link: /wiki/Special:BookSources/0-387-98844-0
Link: #cite_ref-6
Link: http://www.nature.com/nature/journal/v446/n7138/abs/nature05677.html
Link: /w/index.php?title=Bell%27s_aspect_experiment&action=edit&redlink=1
Link: /wiki/Nature_(journal)
Link: #cite_ref-7
Link: /wiki/John_Gribbin
Link: #cite_ref-8
Link: https://www.academia.edu/11720588/No_entailing_laws_but_enablement_in_the_evolution_of_the_biosphere
Link: /wiki/International_Standard_Book_Number
Link: /wiki/Special:BookSources/9781450311786
Link: /wiki/Digital_object_identifier
Link: //doi.org/10.1145%2F2330784.2330946
Link: #cite_ref-9
Link: https://www.academia.edu/11720575/Extended_criticality_phase_spaces_and_enablement_in_biology
Link: /wiki/Digital_object_identifier
Link: //doi.org/10.1016%2Fj.chaos.2013.03.008
Link: #cite_ref-10
Link: /wiki/PubMed_Identifier
Link: //www.ncbi.nlm.nih.gov/pubmed/7059501
Link: /wiki/Digital_object_identifier
Link: //doi.org/10.1111%2Fj.1365-2133.1982.tb00897.x
Link: #cite_ref-11
Link: http://webpages.uncc.edu/yonwang/papers/thesis.pdf
Link: #cite_ref-12
Link: http://www.lbl.gov/Science-Articles/Archive/pi-random.html
Link: #cite_ref-13
Link: #cite_ref-14
Link: #cite_ref-15
Link: http://www.ciphersbyritter.com/RES/RANDTEST.HTM
Link: #cite_ref-16
Link: http://dx.doi.org/10.1038/nature09008
Link: #cite_ref-NYOdds_17-0
Link: #cite_ref-NYOdds_17-1
Link: https://www.nytimes.com/2008/06/08/books/review/Johnson-G-t.html?_r=1
如果要检索引用页中的所有url链接:

import requests
from bs4 import BeautifulSoup

session = requests.Session()
selectWikiPage = input(print("Please enter the Wikipedia page you wish to scrape from"))

if "wikipedia" in selectWikiPage:
    html = session.post(selectWikiPage)
    bsObj = BeautifulSoup(html.text, "html.parser")
    findReferences = bsObj.find('ol', {'class': 'references'})
    href = BeautifulSoup(str(findReferences), "html.parser")
    links = [a["href"] for a in href.find_all("a", href=True)]
    for link in links:
        print("Link: " + link)
else:
    print("Error: Please enter a valid Wikipedia URL")
输出:

Please enter the Wikipedia page you wish to scrape from
Nonehttps://en.wikipedia.org/wiki/Randomness
Link: /wiki/Oxford_English_Dictionary
Link: http://www.people.fas.harvard.edu/~junliu/Workshops/workshop2007/
Link: /wiki/International_Standard_Book_Number_(identifier)
Link: /wiki/Special:BookSources/0-19-512332-8
Link: /wiki/International_Standard_Book_Number_(identifier)
Link: /wiki/Special:BookSources/0-674-01517-7
Link: /wiki/International_Standard_Book_Number_(identifier)
Link: /wiki/Special:BookSources/0-387-98844-0
Link: http://www.nature.com/nature/journal/v446/n7138/abs/nature05677.html
Link: /w/index.php?title=Bell%27s_aspect_experiment&action=edit&redlink=1
Link: /wiki/Nature_(journal)
Link: /wiki/John_Gribbin
Link: https://www.academia.edu/11720588/No_entailing_laws_but_enablement_in_the_evolution_of_the_biosphere
Link: /wiki/International_Standard_Book_Number
Link: /wiki/Special:BookSources/9781450311786
Link: /wiki/Digital_object_identifier
Link: //doi.org/10.1145%2F2330784.2330946
Link: https://www.academia.edu/11720575/Extended_criticality_phase_spaces_and_enablement_in_biology
Link: /wiki/Digital_object_identifier
Link: //doi.org/10.1016%2Fj.chaos.2013.03.008
Link: /wiki/PubMed_Identifier
Link: //www.ncbi.nlm.nih.gov/pubmed/7059501
Link: /wiki/Digital_object_identifier
Link: //doi.org/10.1111%2Fj.1365-2133.1982.tb00897.x
Link: http://webpages.uncc.edu/yonwang/papers/thesis.pdf
Link: http://www.lbl.gov/Science-Articles/Archive/pi-random.html
Link: http://www.ciphersbyritter.com/RES/RANDTEST.HTM
Link: http://dx.doi.org/10.1038/nature09008
Link: https://www.nytimes.com/2008/06/08/books/review/Johnson-G-t.html?_r=1
Please enter the Wikipedia page you wish to scrape from
Nonehttps://en.wikipedia.org/wiki/Randomness
Link: #cite_ref-1
Link: /wiki/Oxford_English_Dictionary
Link: #cite_ref-2
Link: http://www.people.fas.harvard.edu/~junliu/Workshops/workshop2007/
Link: #cite_ref-3
Link: /wiki/International_Standard_Book_Number_(identifier)
Link: /wiki/Special:BookSources/0-19-512332-8
Link: #cite_ref-4
Link: /wiki/International_Standard_Book_Number_(identifier)
Link: /wiki/Special:BookSources/0-674-01517-7
Link: #cite_ref-5
Link: /wiki/International_Standard_Book_Number_(identifier)
Link: /wiki/Special:BookSources/0-387-98844-0
Link: #cite_ref-6
Link: http://www.nature.com/nature/journal/v446/n7138/abs/nature05677.html
Link: /w/index.php?title=Bell%27s_aspect_experiment&action=edit&redlink=1
Link: /wiki/Nature_(journal)
Link: #cite_ref-7
Link: /wiki/John_Gribbin
Link: #cite_ref-8
Link: https://www.academia.edu/11720588/No_entailing_laws_but_enablement_in_the_evolution_of_the_biosphere
Link: /wiki/International_Standard_Book_Number
Link: /wiki/Special:BookSources/9781450311786
Link: /wiki/Digital_object_identifier
Link: //doi.org/10.1145%2F2330784.2330946
Link: #cite_ref-9
Link: https://www.academia.edu/11720575/Extended_criticality_phase_spaces_and_enablement_in_biology
Link: /wiki/Digital_object_identifier
Link: //doi.org/10.1016%2Fj.chaos.2013.03.008
Link: #cite_ref-10
Link: /wiki/PubMed_Identifier
Link: //www.ncbi.nlm.nih.gov/pubmed/7059501
Link: /wiki/Digital_object_identifier
Link: //doi.org/10.1111%2Fj.1365-2133.1982.tb00897.x
Link: #cite_ref-11
Link: http://webpages.uncc.edu/yonwang/papers/thesis.pdf
Link: #cite_ref-12
Link: http://www.lbl.gov/Science-Articles/Archive/pi-random.html
Link: #cite_ref-13
Link: #cite_ref-14
Link: #cite_ref-15
Link: http://www.ciphersbyritter.com/RES/RANDTEST.HTM
Link: #cite_ref-16
Link: http://dx.doi.org/10.1038/nature09008
Link: #cite_ref-NYOdds_17-0
Link: #cite_ref-NYOdds_17-1
Link: https://www.nytimes.com/2008/06/08/books/review/Johnson-G-t.html?_r=1

我稍微更改了您的代码以使用请求库

我将此链接用作测试用例“”

如果只希望检索作为wiki页面中使用的文本源的链接,请执行以下操作:

import requests
from bs4 import BeautifulSoup

session = requests.Session()    
selectWikiPage = input(print("Please enter the Wikipedia page you wish to scrape from"))

if "wikipedia" in selectWikiPage:
    html = session.post(selectWikiPage)
    bsObj = BeautifulSoup(html.text, "html.parser")
    findReferences = bsObj.findAll('span', {'class':'reference-text'})
    href = BeautifulSoup(str(findReferences), "html.parser")
    references = href.findAll('a', href=True)
    links = [a["href"] for a in soup.find_all("a", href=True)]     
    print i in links:
else:
    print("Error: Please enter a valid Wikipedia URL")
输出:

Please enter the Wikipedia page you wish to scrape from
Nonehttps://en.wikipedia.org/wiki/Randomness
Link: /wiki/Oxford_English_Dictionary
Link: http://www.people.fas.harvard.edu/~junliu/Workshops/workshop2007/
Link: /wiki/International_Standard_Book_Number_(identifier)
Link: /wiki/Special:BookSources/0-19-512332-8
Link: /wiki/International_Standard_Book_Number_(identifier)
Link: /wiki/Special:BookSources/0-674-01517-7
Link: /wiki/International_Standard_Book_Number_(identifier)
Link: /wiki/Special:BookSources/0-387-98844-0
Link: http://www.nature.com/nature/journal/v446/n7138/abs/nature05677.html
Link: /w/index.php?title=Bell%27s_aspect_experiment&action=edit&redlink=1
Link: /wiki/Nature_(journal)
Link: /wiki/John_Gribbin
Link: https://www.academia.edu/11720588/No_entailing_laws_but_enablement_in_the_evolution_of_the_biosphere
Link: /wiki/International_Standard_Book_Number
Link: /wiki/Special:BookSources/9781450311786
Link: /wiki/Digital_object_identifier
Link: //doi.org/10.1145%2F2330784.2330946
Link: https://www.academia.edu/11720575/Extended_criticality_phase_spaces_and_enablement_in_biology
Link: /wiki/Digital_object_identifier
Link: //doi.org/10.1016%2Fj.chaos.2013.03.008
Link: /wiki/PubMed_Identifier
Link: //www.ncbi.nlm.nih.gov/pubmed/7059501
Link: /wiki/Digital_object_identifier
Link: //doi.org/10.1111%2Fj.1365-2133.1982.tb00897.x
Link: http://webpages.uncc.edu/yonwang/papers/thesis.pdf
Link: http://www.lbl.gov/Science-Articles/Archive/pi-random.html
Link: http://www.ciphersbyritter.com/RES/RANDTEST.HTM
Link: http://dx.doi.org/10.1038/nature09008
Link: https://www.nytimes.com/2008/06/08/books/review/Johnson-G-t.html?_r=1
Please enter the Wikipedia page you wish to scrape from
Nonehttps://en.wikipedia.org/wiki/Randomness
Link: #cite_ref-1
Link: /wiki/Oxford_English_Dictionary
Link: #cite_ref-2
Link: http://www.people.fas.harvard.edu/~junliu/Workshops/workshop2007/
Link: #cite_ref-3
Link: /wiki/International_Standard_Book_Number_(identifier)
Link: /wiki/Special:BookSources/0-19-512332-8
Link: #cite_ref-4
Link: /wiki/International_Standard_Book_Number_(identifier)
Link: /wiki/Special:BookSources/0-674-01517-7
Link: #cite_ref-5
Link: /wiki/International_Standard_Book_Number_(identifier)
Link: /wiki/Special:BookSources/0-387-98844-0
Link: #cite_ref-6
Link: http://www.nature.com/nature/journal/v446/n7138/abs/nature05677.html
Link: /w/index.php?title=Bell%27s_aspect_experiment&action=edit&redlink=1
Link: /wiki/Nature_(journal)
Link: #cite_ref-7
Link: /wiki/John_Gribbin
Link: #cite_ref-8
Link: https://www.academia.edu/11720588/No_entailing_laws_but_enablement_in_the_evolution_of_the_biosphere
Link: /wiki/International_Standard_Book_Number
Link: /wiki/Special:BookSources/9781450311786
Link: /wiki/Digital_object_identifier
Link: //doi.org/10.1145%2F2330784.2330946
Link: #cite_ref-9
Link: https://www.academia.edu/11720575/Extended_criticality_phase_spaces_and_enablement_in_biology
Link: /wiki/Digital_object_identifier
Link: //doi.org/10.1016%2Fj.chaos.2013.03.008
Link: #cite_ref-10
Link: /wiki/PubMed_Identifier
Link: //www.ncbi.nlm.nih.gov/pubmed/7059501
Link: /wiki/Digital_object_identifier
Link: //doi.org/10.1111%2Fj.1365-2133.1982.tb00897.x
Link: #cite_ref-11
Link: http://webpages.uncc.edu/yonwang/papers/thesis.pdf
Link: #cite_ref-12
Link: http://www.lbl.gov/Science-Articles/Archive/pi-random.html
Link: #cite_ref-13
Link: #cite_ref-14
Link: #cite_ref-15
Link: http://www.ciphersbyritter.com/RES/RANDTEST.HTM
Link: #cite_ref-16
Link: http://dx.doi.org/10.1038/nature09008
Link: #cite_ref-NYOdds_17-0
Link: #cite_ref-NYOdds_17-1
Link: https://www.nytimes.com/2008/06/08/books/review/Johnson-G-t.html?_r=1
如果要检索引用页中的所有url链接:

import requests
from bs4 import BeautifulSoup

session = requests.Session()
selectWikiPage = input(print("Please enter the Wikipedia page you wish to scrape from"))

if "wikipedia" in selectWikiPage:
    html = session.post(selectWikiPage)
    bsObj = BeautifulSoup(html.text, "html.parser")
    findReferences = bsObj.find('ol', {'class': 'references'})
    href = BeautifulSoup(str(findReferences), "html.parser")
    links = [a["href"] for a in href.find_all("a", href=True)]
    for link in links:
        print("Link: " + link)
else:
    print("Error: Please enter a valid Wikipedia URL")
输出:

Please enter the Wikipedia page you wish to scrape from
Nonehttps://en.wikipedia.org/wiki/Randomness
Link: /wiki/Oxford_English_Dictionary
Link: http://www.people.fas.harvard.edu/~junliu/Workshops/workshop2007/
Link: /wiki/International_Standard_Book_Number_(identifier)
Link: /wiki/Special:BookSources/0-19-512332-8
Link: /wiki/International_Standard_Book_Number_(identifier)
Link: /wiki/Special:BookSources/0-674-01517-7
Link: /wiki/International_Standard_Book_Number_(identifier)
Link: /wiki/Special:BookSources/0-387-98844-0
Link: http://www.nature.com/nature/journal/v446/n7138/abs/nature05677.html
Link: /w/index.php?title=Bell%27s_aspect_experiment&action=edit&redlink=1
Link: /wiki/Nature_(journal)
Link: /wiki/John_Gribbin
Link: https://www.academia.edu/11720588/No_entailing_laws_but_enablement_in_the_evolution_of_the_biosphere
Link: /wiki/International_Standard_Book_Number
Link: /wiki/Special:BookSources/9781450311786
Link: /wiki/Digital_object_identifier
Link: //doi.org/10.1145%2F2330784.2330946
Link: https://www.academia.edu/11720575/Extended_criticality_phase_spaces_and_enablement_in_biology
Link: /wiki/Digital_object_identifier
Link: //doi.org/10.1016%2Fj.chaos.2013.03.008
Link: /wiki/PubMed_Identifier
Link: //www.ncbi.nlm.nih.gov/pubmed/7059501
Link: /wiki/Digital_object_identifier
Link: //doi.org/10.1111%2Fj.1365-2133.1982.tb00897.x
Link: http://webpages.uncc.edu/yonwang/papers/thesis.pdf
Link: http://www.lbl.gov/Science-Articles/Archive/pi-random.html
Link: http://www.ciphersbyritter.com/RES/RANDTEST.HTM
Link: http://dx.doi.org/10.1038/nature09008
Link: https://www.nytimes.com/2008/06/08/books/review/Johnson-G-t.html?_r=1
Please enter the Wikipedia page you wish to scrape from
Nonehttps://en.wikipedia.org/wiki/Randomness
Link: #cite_ref-1
Link: /wiki/Oxford_English_Dictionary
Link: #cite_ref-2
Link: http://www.people.fas.harvard.edu/~junliu/Workshops/workshop2007/
Link: #cite_ref-3
Link: /wiki/International_Standard_Book_Number_(identifier)
Link: /wiki/Special:BookSources/0-19-512332-8
Link: #cite_ref-4
Link: /wiki/International_Standard_Book_Number_(identifier)
Link: /wiki/Special:BookSources/0-674-01517-7
Link: #cite_ref-5
Link: /wiki/International_Standard_Book_Number_(identifier)
Link: /wiki/Special:BookSources/0-387-98844-0
Link: #cite_ref-6
Link: http://www.nature.com/nature/journal/v446/n7138/abs/nature05677.html
Link: /w/index.php?title=Bell%27s_aspect_experiment&action=edit&redlink=1
Link: /wiki/Nature_(journal)
Link: #cite_ref-7
Link: /wiki/John_Gribbin
Link: #cite_ref-8
Link: https://www.academia.edu/11720588/No_entailing_laws_but_enablement_in_the_evolution_of_the_biosphere
Link: /wiki/International_Standard_Book_Number
Link: /wiki/Special:BookSources/9781450311786
Link: /wiki/Digital_object_identifier
Link: //doi.org/10.1145%2F2330784.2330946
Link: #cite_ref-9
Link: https://www.academia.edu/11720575/Extended_criticality_phase_spaces_and_enablement_in_biology
Link: /wiki/Digital_object_identifier
Link: //doi.org/10.1016%2Fj.chaos.2013.03.008
Link: #cite_ref-10
Link: /wiki/PubMed_Identifier
Link: //www.ncbi.nlm.nih.gov/pubmed/7059501
Link: /wiki/Digital_object_identifier
Link: //doi.org/10.1111%2Fj.1365-2133.1982.tb00897.x
Link: #cite_ref-11
Link: http://webpages.uncc.edu/yonwang/papers/thesis.pdf
Link: #cite_ref-12
Link: http://www.lbl.gov/Science-Articles/Archive/pi-random.html
Link: #cite_ref-13
Link: #cite_ref-14
Link: #cite_ref-15
Link: http://www.ciphersbyritter.com/RES/RANDTEST.HTM
Link: #cite_ref-16
Link: http://dx.doi.org/10.1038/nature09008
Link: #cite_ref-NYOdds_17-0
Link: #cite_ref-NYOdds_17-1
Link: https://www.nytimes.com/2008/06/08/books/review/Johnson-G-t.html?_r=1

你的FindAll一无所获。一种方法是首先选择引用节,然后在该节内搜索,即
bsObj.find(“ol”,“{class:“references”})。findAll('a')
您的findAll没有返回任何内容。一种方法是首先选择引用部分,然后在该部分中搜索,即
bsObj.find(“ol”,“{class:“references”})。findAll('a')
如果OP只需要所有超链接,则不需要查看
span
链接=[a[“href”]表示汤中的a。find_ALL(“a”,href=True)]应该足够了。嗨,托尼,谢谢你的回复。我刚刚编辑了我的答案,以便它符合OP问题的范围。如果OP只需要所有超链接,你不需要查看
span
,链接=[a[“href”]表示汤中的a。全部查找(“a”,href=True)]就足够了。嗨,托尼,谢谢你的回复。我只是编辑了我的答案,使之符合OP问题的范围。