Python 3.x 如何使用bs4提取html表
我正在尝试从以下位置刮取第二个HTML表:'https://www.realclearpolitics.com/epolls/2020/president/pa/pennsylvania_trump_vs_biden-6861.html" 当我运行脚本时,我得到第一个Html表。如何从上面的URL中提取第二个表 我尝试使用以下代码:Python 3.x 如何使用bs4提取html表,python-3.x,beautifulsoup,request,Python 3.x,Beautifulsoup,Request,我正在尝试从以下位置刮取第二个HTML表:'https://www.realclearpolitics.com/epolls/2020/president/pa/pennsylvania_trump_vs_biden-6861.html" 当我运行脚本时,我得到第一个Html表。如何从上面的URL中提取第二个表 我尝试使用以下代码: from bs4 import BeautifulSoup import requests import pandas as pd import re Res =
from bs4 import BeautifulSoup
import requests
import pandas as pd
import re
Res = requests.get('https://www.realclearpolitics.com/epolls/2020/president/pa/pennsylvania_trump_vs_biden-6861.html')
Soup = BeautifulSoup(Res.text , 'html.parser')
table = Soup.find('table',{'class':'data large'})
list_of_rows = []
for row in table.findAll('tr'):
list_of_cells = []
for cell in row.findAll(["td"]):
try:
Cell = cell.find('a', {'class': 'mobile_pollster_name'})
text = re.sub(r'\n\t+', '', Cell.text)
list_of_cells.append(text)
except:
text = re.sub(r'\n\t+', '', cell.text)
list_of_cells.append(text)
pass
list_of_rows.append(list_of_cells)
for item in list_of_rows:
' '.join(item)
data = pd.DataFrame(list_of_rows)
下面的脚本将两个表作为列表中的数据帧提供给您
from bs4 import BeautifulSoup
import requests
import pandas as pd
import re
Res = requests.get('https://www.realclearpolitics.com/epolls/2020/president/pa/pennsylvania_trump_vs_biden-6861.html')
Soup = BeautifulSoup(Res.text , 'html.parser')
tables = Soup.find_all('table',{'class':'data large'})
list_of_dfs = []
for table in tables:
columns = [i.get_text(strip=True) for i in table.find_all('th')]
list_of_rows = []
for row in table.findAll('tr'):
list_of_rows.append([i.get_text(strip=True) for i in row.find_all('td')])
df = pd.DataFrame(list_of_rows, columns=columns)
print(df)
list_of_dfs.append(df)
print("---" * 10)
输出:
Poll Date Sample MoE Biden (D) Trump (R) Spread
0 None None None None None None None
1 RCP Average 7/15 - 7/26 -- -- 49.4 43.4 Biden +6.0
2 CNBC/Change Research (D)CNBC 7/24 - 7/26 382 LV -- 48 46 Biden +2
3 GravisGravis 7/22 - 7/24 1006 LV 3.1 48 45 Biden +3
4 Franklin & MarshallF&M 7/20 - 7/26 667 RV 5.5 50 41 Biden +9
5 FOX NewsFOX News 7/18 - 7/20 793 RV 3.5 50 39 Biden +11
6 Rasmussen ReportsRasmussen 7/15 - 7/16 750 LV 3.5 51 46 Biden +5
------------------------------
Poll Date Sample MoE Biden (D) Trump (R) Spread
0 None None None None None None None
1 RCP Average 7/15 - 7/26 -- -- 49.4 43.4 Biden +6.0
2 CNBC/Change Research (D)CNBC 7/24 - 7/26 382 LV -- 48 46 Biden +2
3 GravisGravis 7/22 - 7/24 1006 LV 3.1 48 45 Biden +3
4 Franklin & MarshallF&M 7/20 - 7/26 667 RV 5.5 50 41 Biden +9
5 FOX NewsFOX News 7/18 - 7/20 793 RV 3.5 50 39 Biden +11
6 Rasmussen ReportsRasmussen 7/15 - 7/16 750 LV 3.5 51 46 Biden +5
7 CNBC/Change Research (D)CNBC 7/10 - 7/12 743 LV -- 50 42 Biden +8
8 MonmouthMonmouth 7/9 - 7/13 401 LV 4.9 52 42 Biden +10
9 Trafalgar Group (R)Trafalgar 6/29 - 7/2 1062 LV 2.9 48 43 Biden +5
10 CNBC/Change Research (D)CNBC 6/26 - 6/28 760 LV -- 50 44 Biden +6
11 SusquehannaSusquehanna 6/15 - 6/23 715 LV 3.8 46 41 Biden +5
12 CNBC/Change Research (D)CNBC 6/12 - 6/14 491 LV -- 49 46 Biden +3
13 NY Times/SienaNYT/Siena 6/8 - 6/16 651 RV 4.2 50 40 Biden +10
14 CNBC/Change Research (D)CNBC 5/29 - 5/31 579 LV -- 46 50 Trump +4
15 Harper (R)Harper (R) 4/21 - 4/26 644 LV 3.9 49 43 Biden +6
16 FOX NewsFOX News 4/18 - 4/21 803 RV 3.5 50 42 Biden +8
17 SusquehannaSusquehanna 4/14 - 4/20 693 LV 3.7 48 42 Biden +6
18 Yahoo News/YouGovYouGov 3/6 - 3/8 RV -- 46 40 Biden +6
19 Morning CallMorning Call 2/12 - 2/20 424 RV 5.5 47 47 Tie
20 Univ. of Wis/State JournalU. of Wisc. 2/11 - 2/20 1249 LV 3.5 46 45 Biden +1
21 QuinnipiacQuinnipiac 2/12 - 2/18 845 RV 3.4 50 42 Biden +8
22 Morning CallMorning Call 11/4 - 11/9 410 RV 6.0 52 43 Biden +9
23 NY Times/SienaNYT/Siena 10/13 - 10/26 661 LV 4.4 46 45 Biden +1
24 QuinnipiacQuinnipiac 5/9 - 5/14 978 RV 4.2 53 42 Biden +11
25 EmersonEmerson 3/26 - 3/28 808 RV 3.4 55 45 Biden +10
------------------------------
嗨@bigbounty有没有办法忽略重复的文本。例如:Emerson Emerson,这里重复Emerson。我们如何消除这种情况一旦获得数据帧,就可以保存到csv文件,然后在数据帧上应用函数。这样,您就不会丢失已刮取的数据