Python 如何将beautifulsoup的输出附加到数据帧_Python_Beautifulsoup

Python 如何将beautifulsoup的输出附加到数据帧

python

Python 如何将beautifulsoup的输出附加到数据帧,python,beautifulsoup,Python,Beautifulsoup,我对python比较陌生。我打算 a）从以下url（）获取url列表，其中包含1919年以后的数据（） b）获取1919年至当年的数据（日期、类型、注册、运营商、fat、地点、类别）然而，我遇到了一些问题，仍然陷入了困境。）非常感谢任何形式的帮助，非常感谢 #import packages import numpy as np import pandas as pd from bs4 import BeautifulSoup #start of code mainurl = "http

我对

python

比较陌生。我打算

a）从以下url（）获取url列表，其中包含1919年以后的数据（）

b）获取1919年至当年的数据（日期、类型、注册、运营商、fat、地点、类别）

然而，我遇到了一些问题，仍然陷入了困境。）

非常感谢任何形式的帮助，非常感谢

#import packages
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup

#start of code
mainurl = "https://aviation-safety.net/database/"
def getAndParseURL(mainurl):
   result = requests.get(mainurl)
   soup = BeautifulSoup(result.content, 'html.parser')
   datatable = soup.find('a', href = True)


#try clause to go through the content and grab the URLs
try:
   for row in datatable:
      cols = row.find_all("|")
      if len(cols) > 1:
         links.append(x, cols = cols)
         except: pass


#place links into numpy array
links_array = np.asarray(links)
len(links_array)


#check if links are in dataframe
df = pd.DataFrame(links_array)

df.columns = ['url']
df.head(10)

我似乎无法获取URL

如果我能得到以下信息，那就太好了

序列号URL 1. 2.

您没有从正在提取的标签中提取

href

属性。您要做的是查找所有带有链接的

标记（您这样做了，但需要使用

find\u all

，因为

find

只返回它找到的第一个1），然后遍历这些标记。我选择只让它查找子字符串

'Year'

，如果是，则将其放入列表中

#import packages
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
import requests

#start of code
mainurl = "https://aviation-safety.net/database/"
def getAndParseURL(mainurl):
   result = requests.get(mainurl)
   soup = BeautifulSoup(result.content, 'html.parser')
   datatable = soup.find_all('a', href = True)
   return datatable

datatable = getAndParseURL(mainurl)

#go through the content and grab the URLs
links = []
for link in datatable:
    if 'Year' in link['href']:
        url = link['href']
        links.append(mainurl + url)


#check if links are in dataframe
df = pd.DataFrame(links, columns=['url'])

df.head(10)

输出：

df.head(10)
Out[24]: 
                                                 url
0  https://aviation-safety.net/database/dblist.ph...
1  https://aviation-safety.net/database/dblist.ph...
2  https://aviation-safety.net/database/dblist.ph...
3  https://aviation-safety.net/database/dblist.ph...
4  https://aviation-safety.net/database/dblist.ph...
5  https://aviation-safety.net/database/dblist.ph...
6  https://aviation-safety.net/database/dblist.ph...
7  https://aviation-safety.net/database/dblist.ph...
8  https://aviation-safety.net/database/dblist.ph...
9  https://aviation-safety.net/database/dblist.ph...