用python解析URL

用python解析URL,python,pandas,url,beautifulsoup,Python,Pandas,Url,Beautifulsoup,所以我用这段代码来获得一个URL列表,问题是我需要一个包含URL的列,另一个包含标签或文本 import requests from bs4 import BeautifulSoup getpage= requests.get getpage_soup= BeautifulSoup(getpage.text, 'html.parser') all_links= getpage_soup.findAll('a') for link in all_links: print (link

所以我用这段代码来获得一个URL列表,问题是我需要一个包含URL的列,另一个包含标签或文本

import requests
from bs4 import BeautifulSoup

getpage= requests.get

getpage_soup= BeautifulSoup(getpage.text, 'html.parser')

all_links= getpage_soup.findAll('a')

for link in all_links:
    print (link)
我期待的是一个类似于此的数据帧

pd.DataFrame({'link': 'https://drive.google.com/file/d/1t1hLPvUkfCde1wglfjAh--r8NpLONbRf/view?usp=sharing', 'tag': 'Estatal 2020'})

使用您需要的第一个示例,这可能会帮助您:

导入请求
从bs4导入BeautifulSoup
作为pd进口熊猫
url=”https://www.gob.mx/sesnsp/acciones-y-programas/incidencia-delictiva-del-fuero-comun-nueva-metodologia?state=published"
数据=[]
r=请求。获取(url)
soup=BeautifulSoup(r.text'html.parser')
div=soup.find('div',{'class':'article body'})#获取div“article body”
对于div.findAll(“ul”)中的ul:#获取div“article body”中的所有“ul”标记
对于ul中的li.findAll('li'):#将所有的“li”放在“ul”中
对于li.findAll中的链接('a',href=True):#获取li中的'a'
data.append([link['href'],link.text])#link['href']=url | link.text=“Estatal 2020”
dataframe=pd.dataframe(数据,列=['link','tag'])
打印(数据帧)
你可以试试这个:

import requests
from bs4 import BeautifulSoup
import re
import pandas as pd
getpage= requests.get('https://www.gob.mx/sesnsp/acciones-y-programas/incidencia-delictiva-del-fuero-comun-nueva-metodologia?state=published')

getpage_soup= BeautifulSoup(getpage.text, 'html.parser')

all_links= getpage_soup.findAll('a', attrs={'href': re.compile("(^http://)|(^https://)")})   #get all the urls with protocols http or https

data=[]
for link in all_links:
    if link.text.strip()=='':   #if the link doesn't have text, add the id
        data.append([link['href'], link.get('id')])
    else:
        data.append([link['href'], link.text.strip()])   #add the text without trailing and leading whitespaces

df=pd.DataFrame(data, columns=['link', 'tag'])   #create the dataframe
print(df)
输出:

df
                                                 link                                                tag
0                         https://coronavirus.gob.mx/        Información importante Coronavirus COVID-19
1                  https://www.gob.mx/busqueda?utf8=✓                                           botbusca
2   https://www.gob.mx/sesnsp/acciones-y-programas...                                      Transparencia
3   https://drive.google.com/file/d/1t1hLPvUkfCde1...                                       Estatal 2020
4   https://drive.google.com/open?id=17MnLmvY_YW5Z...                                       Estatal 2019
5   https://drive.google.com/open?id=11DcfF4Pvp_21...                                       Estatal 2018
6   https://drive.google.com/open?id=1Y0aqq6w2EQij...                                       Estatal 2017
7   https://drive.google.com/open?id=1mgFsF3rdoYLE...                                       Estatal 2016
8   https://drive.google.com/open?id=1RQhk58-fHNPr...                                       Estatal 2015
9   https://drive.google.com/file/d/1WIzrjJTF24DCX...                                       Estatal 2020
10  https://drive.google.com/open?id=1QtjDM7pczeST...                                       Estatal 2019
11  https://drive.google.com/open?id=15l9hl4eUmFCM...                                       Estatal 2018
12  https://drive.google.com/open?id=1FO4W0HK8cdPk...                                       Estatal 2017
13  https://drive.google.com/open?id=1tDEjJ1XLdFP8...                                       Estatal 2016
14  https://drive.google.com/open?id=1lCeFrMi_D-Gr...                                       Estatal 2015
15  https://drive.google.com/file/d/1q8AdhfxpLdF_l...                                Estatal 2015 - 2020
16  https://drive.google.com/file/d/1jopZOChRppi6Q...                                          Mayo 2020
17  https://drive.google.com/open?id=1CvHXHC48SYWT...                                       Febrero 2020
18  https://drive.google.com/open?id=1QxUe0HwLNNZH...                                         Enero 2020
19  https://drive.google.com/open?id=1KZzHGdTlH5ya...                                     Diciembre 2019
20  https://drive.google.com/open?id=119VQ5-1JPnWZ...                                     Noviembre 2019
21  https://drive.google.com/open?id=1CbNV3sTkSn3t...                                       Octubre 2019
22  https://drive.google.com/open?id=1gpMM2pi6Ta-r...                                    Septiembre 2019
23  https://drive.google.com/open?id=1dHUhpr-DbOPx...                                        Agosto 2019
24  https://drive.google.com/open?id=18CQlwY07tTaa...                                         Julio 2019
25  https://drive.google.com/open?id=1EnhF4IOFxqLr...                                         Junio 2019
26  https://drive.google.com/open?id=1wrTEwP5Q3xwZ...                                          Mayo 2019
27  https://drive.google.com/open?id=1ZuY20S-5Gi8l...                                         Abril 2019
28  https://drive.google.com/open?id=1P2Xvs7kLLclg...                                         Marzo 2019
29  https://drive.google.com/open?id=16FWEKbbJ83KL...                                       Febrero 2019
30  https://drive.google.com/open?id=1mIw1XKJBY8ZV...                                         Enero 2019
31  https://drive.google.com/open?id=1iTGBC1Ge4UWP...                                     Diciembre 2018
32  https://drive.google.com/open?id=1Kmtir0rhQLf7...                                     Noviembre 2018
33  https://drive.google.com/open?id=1r7SHNfKVXGfe...                                       Octubre 2018
34  https://drive.google.com/open?id=1IKpGJbJuNQKW...                                    Septiembre 2018
35  https://drive.google.com/open?id=1spqdNT0T0pen...                                        Agosto 2018
36  https://drive.google.com/open?id=1k07ZSk2c4irk...                                         Julio 2018
37  https://drive.google.com/open?id=1HX4SlChjRbMm...                                         Junio 2018
38  https://drive.google.com/open?id=1ErSyO9-rfHi3...                                          Mayo 2018
39  https://drive.google.com/open?id=1cK5lR33-mA6-...                                         Abril 2018
40  https://drive.google.com/open?id=1MaqJaSfq2KxB...                                         Marzo 2018
41  https://drive.google.com/open?id=1GaoDPWud-2Iy...                                       Febrero 2018
42  https://drive.google.com/open?id=1OXITYyRrUBwj...                                         Enero 2018
43  https://drive.google.com/file/d/1KwjGdNYez72_z...                                Estatal 2015 - 2020
44  https://drive.google.com/file/d/14fDk5sBry1DOo...                              Municipal 2015 - 2020
45  https://www.gob.mx/sesnsp/acciones-y-programas...  Regresar al menú principal de Incidencia Delic...
46  https://www.facebook.com/sharer/sharer.php?u=h...                                          Compartir
47                        http://www.participa.gob.mx                                          Participa
48                              https://datos.gob.mx/                                              Datos
49                   https://www.gob.mx/publicaciones                            Publicaciones Oficiales
50  https://www.infomex.org.mx/gobiernofederal/hom...                                    Sistema Infomex
51                             http://www.inai.org.mx                                               INAI
52                    http://www.ordenjuridico.gob.mx                                     Marco Jurídico
53                 https://www.facebook.com/gobmexico                                           Facebook
54                     https://twitter.com/GobiernoMX                                            Twitter
                                                 link                  tag
3   https://drive.google.com/file/d/1t1hLPvUkfCde1...         Estatal 2020
4   https://drive.google.com/open?id=17MnLmvY_YW5Z...         Estatal 2019
5   https://drive.google.com/open?id=11DcfF4Pvp_21...         Estatal 2018
6   https://drive.google.com/open?id=1Y0aqq6w2EQij...         Estatal 2017
7   https://drive.google.com/open?id=1mgFsF3rdoYLE...         Estatal 2016
8   https://drive.google.com/open?id=1RQhk58-fHNPr...         Estatal 2015
9   https://drive.google.com/file/d/1WIzrjJTF24DCX...         Estatal 2020
10  https://drive.google.com/open?id=1QtjDM7pczeST...         Estatal 2019
11  https://drive.google.com/open?id=15l9hl4eUmFCM...         Estatal 2018
12  https://drive.google.com/open?id=1FO4W0HK8cdPk...         Estatal 2017
13  https://drive.google.com/open?id=1tDEjJ1XLdFP8...         Estatal 2016
14  https://drive.google.com/open?id=1lCeFrMi_D-Gr...         Estatal 2015
15  https://drive.google.com/file/d/1q8AdhfxpLdF_l...  Estatal 2015 - 2020
43  https://drive.google.com/file/d/1KwjGdNYez72_z...  Estatal 2015 - 2020
如果您只需要以“
”Estatal“
”开头的代码,可以将其添加到上面的代码中:

import numpy as np
mask=np.where(df.tag.str.startswith('Estatal'), True, False)
print(df[mask])
输出:

df
                                                 link                                                tag
0                         https://coronavirus.gob.mx/        Información importante Coronavirus COVID-19
1                  https://www.gob.mx/busqueda?utf8=✓                                           botbusca
2   https://www.gob.mx/sesnsp/acciones-y-programas...                                      Transparencia
3   https://drive.google.com/file/d/1t1hLPvUkfCde1...                                       Estatal 2020
4   https://drive.google.com/open?id=17MnLmvY_YW5Z...                                       Estatal 2019
5   https://drive.google.com/open?id=11DcfF4Pvp_21...                                       Estatal 2018
6   https://drive.google.com/open?id=1Y0aqq6w2EQij...                                       Estatal 2017
7   https://drive.google.com/open?id=1mgFsF3rdoYLE...                                       Estatal 2016
8   https://drive.google.com/open?id=1RQhk58-fHNPr...                                       Estatal 2015
9   https://drive.google.com/file/d/1WIzrjJTF24DCX...                                       Estatal 2020
10  https://drive.google.com/open?id=1QtjDM7pczeST...                                       Estatal 2019
11  https://drive.google.com/open?id=15l9hl4eUmFCM...                                       Estatal 2018
12  https://drive.google.com/open?id=1FO4W0HK8cdPk...                                       Estatal 2017
13  https://drive.google.com/open?id=1tDEjJ1XLdFP8...                                       Estatal 2016
14  https://drive.google.com/open?id=1lCeFrMi_D-Gr...                                       Estatal 2015
15  https://drive.google.com/file/d/1q8AdhfxpLdF_l...                                Estatal 2015 - 2020
16  https://drive.google.com/file/d/1jopZOChRppi6Q...                                          Mayo 2020
17  https://drive.google.com/open?id=1CvHXHC48SYWT...                                       Febrero 2020
18  https://drive.google.com/open?id=1QxUe0HwLNNZH...                                         Enero 2020
19  https://drive.google.com/open?id=1KZzHGdTlH5ya...                                     Diciembre 2019
20  https://drive.google.com/open?id=119VQ5-1JPnWZ...                                     Noviembre 2019
21  https://drive.google.com/open?id=1CbNV3sTkSn3t...                                       Octubre 2019
22  https://drive.google.com/open?id=1gpMM2pi6Ta-r...                                    Septiembre 2019
23  https://drive.google.com/open?id=1dHUhpr-DbOPx...                                        Agosto 2019
24  https://drive.google.com/open?id=18CQlwY07tTaa...                                         Julio 2019
25  https://drive.google.com/open?id=1EnhF4IOFxqLr...                                         Junio 2019
26  https://drive.google.com/open?id=1wrTEwP5Q3xwZ...                                          Mayo 2019
27  https://drive.google.com/open?id=1ZuY20S-5Gi8l...                                         Abril 2019
28  https://drive.google.com/open?id=1P2Xvs7kLLclg...                                         Marzo 2019
29  https://drive.google.com/open?id=16FWEKbbJ83KL...                                       Febrero 2019
30  https://drive.google.com/open?id=1mIw1XKJBY8ZV...                                         Enero 2019
31  https://drive.google.com/open?id=1iTGBC1Ge4UWP...                                     Diciembre 2018
32  https://drive.google.com/open?id=1Kmtir0rhQLf7...                                     Noviembre 2018
33  https://drive.google.com/open?id=1r7SHNfKVXGfe...                                       Octubre 2018
34  https://drive.google.com/open?id=1IKpGJbJuNQKW...                                    Septiembre 2018
35  https://drive.google.com/open?id=1spqdNT0T0pen...                                        Agosto 2018
36  https://drive.google.com/open?id=1k07ZSk2c4irk...                                         Julio 2018
37  https://drive.google.com/open?id=1HX4SlChjRbMm...                                         Junio 2018
38  https://drive.google.com/open?id=1ErSyO9-rfHi3...                                          Mayo 2018
39  https://drive.google.com/open?id=1cK5lR33-mA6-...                                         Abril 2018
40  https://drive.google.com/open?id=1MaqJaSfq2KxB...                                         Marzo 2018
41  https://drive.google.com/open?id=1GaoDPWud-2Iy...                                       Febrero 2018
42  https://drive.google.com/open?id=1OXITYyRrUBwj...                                         Enero 2018
43  https://drive.google.com/file/d/1KwjGdNYez72_z...                                Estatal 2015 - 2020
44  https://drive.google.com/file/d/14fDk5sBry1DOo...                              Municipal 2015 - 2020
45  https://www.gob.mx/sesnsp/acciones-y-programas...  Regresar al menú principal de Incidencia Delic...
46  https://www.facebook.com/sharer/sharer.php?u=h...                                          Compartir
47                        http://www.participa.gob.mx                                          Participa
48                              https://datos.gob.mx/                                              Datos
49                   https://www.gob.mx/publicaciones                            Publicaciones Oficiales
50  https://www.infomex.org.mx/gobiernofederal/hom...                                    Sistema Infomex
51                             http://www.inai.org.mx                                               INAI
52                    http://www.ordenjuridico.gob.mx                                     Marco Jurídico
53                 https://www.facebook.com/gobmexico                                           Facebook
54                     https://twitter.com/GobiernoMX                                            Twitter
                                                 link                  tag
3   https://drive.google.com/file/d/1t1hLPvUkfCde1...         Estatal 2020
4   https://drive.google.com/open?id=17MnLmvY_YW5Z...         Estatal 2019
5   https://drive.google.com/open?id=11DcfF4Pvp_21...         Estatal 2018
6   https://drive.google.com/open?id=1Y0aqq6w2EQij...         Estatal 2017
7   https://drive.google.com/open?id=1mgFsF3rdoYLE...         Estatal 2016
8   https://drive.google.com/open?id=1RQhk58-fHNPr...         Estatal 2015
9   https://drive.google.com/file/d/1WIzrjJTF24DCX...         Estatal 2020
10  https://drive.google.com/open?id=1QtjDM7pczeST...         Estatal 2019
11  https://drive.google.com/open?id=15l9hl4eUmFCM...         Estatal 2018
12  https://drive.google.com/open?id=1FO4W0HK8cdPk...         Estatal 2017
13  https://drive.google.com/open?id=1tDEjJ1XLdFP8...         Estatal 2016
14  https://drive.google.com/open?id=1lCeFrMi_D-Gr...         Estatal 2015
15  https://drive.google.com/file/d/1q8AdhfxpLdF_l...  Estatal 2015 - 2020
43  https://drive.google.com/file/d/1KwjGdNYez72_z...  Estatal 2015 - 2020

你能添加一个吗?@MrNobody33抱歉,修复请阅读我附加的链接。您应该从给定的示例输入中添加一个预期的输出,以及您迄今为止尝试的内容。请尝试第2项@Nobody33先生,太干净了!我刚刚添加了一个答案@AngelSerrano。希望它对你有用!