Python 从网站中删除表格:无法寻址正确的表格
我是一名Python新手,刚刚开始学习它,但有以下问题:我想从网站上刮取公文包数据(;向下滚动,单击“公文包”);但我无法说明正确的tr类别“c-投资组合”,但始终以右侧第一个表格“Erstemission 20.09.2019”的值结束 我在reddit/stackoverflow上尝试了超过15种网络教程和问题/答案,但无法解决它,我想这在这个网站上很特别。下面是我最高级的代码 如果有任何建议,我将不胜感激!:) 最好的, 朱利安 其他尝试:Python 从网站中删除表格:无法寻址正确的表格,python,html,web-scraping,html-table,Python,Html,Web Scraping,Html Table,我是一名Python新手,刚刚开始学习它,但有以下问题:我想从网站上刮取公文包数据(;向下滚动,单击“公文包”);但我无法说明正确的tr类别“c-投资组合”,但始终以右侧第一个表格“Erstemission 20.09.2019”的值结束 我在reddit/stackoverflow上尝试了超过15种网络教程和问题/答案,但无法解决它,我想这在这个网站上很特别。下面是我最高级的代码 如果有任何建议,我将不胜感激!:) 最好的, 朱利安 其他尝试: import pandas as pd impo
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from urllib.request import urlopen
from bs4 import BeautifulSoup
url ='https://www.wikifolio.com/de/de/w/wffalkinve'
html = urlopen(url)
soup = BeautifulSoup(html, 'lxml')
type(soup)
soup.find_all('tr')
# Print the first 10 rows for sanity check
rows = soup.find_all('tr')
print(rows[:10])
for row in rows:
row_td = row.find_all('td')
print(row_td)
type(row_td)
str_cells = str(row_td)
cleantext = BeautifulSoup(str_cells, "lxml").get_text()
print(cleantext)
import re
list_rows = []
for row in rows:
cells = row.find_all('td')
str_cells = str(cells)
clean = re.compile('<.*?>')
clean2 = (re.sub(clean, '',str_cells))
list_rows.append(clean2)
print(clean2)
type(clean2)
df = pd.DataFrame(list_rows)
df.head(10)
df1 = df[0].str.split(',', expand=True)
df1.head(10)
编辑:更容易找到相应的表格:tr c类投资组合
如果您试图抓取的内容依赖于javascript加载,则Pandas、BS解决方案将无法工作。我建议使用硒铬无头解决方案。我会试试——现在我不知道硒是什么:D非常感谢!您或其他任何人是否知道有哪些资源可以尝试在桌子上涂硒?我找不到合适的解决方案,但我可能会看看相关的答案。
#Create empty list
col=[]
i=0
#For each row, store each first element (header) and an empty list
for t in tr_elements[0]:
i+=1
name=t.text_content()
print('%d: %s' % (i,name))
col.append((name,[]))
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from urllib.request import urlopen
from bs4 import BeautifulSoup
url ='https://www.wikifolio.com/de/de/w/wffalkinve'
html = urlopen(url)
soup = BeautifulSoup(html, 'lxml')
type(soup)
soup.find_all('tr')
# Print the first 10 rows for sanity check
rows = soup.find_all('tr')
print(rows[:10])
for row in rows:
row_td = row.find_all('td')
print(row_td)
type(row_td)
str_cells = str(row_td)
cleantext = BeautifulSoup(str_cells, "lxml").get_text()
print(cleantext)
import re
list_rows = []
for row in rows:
cells = row.find_all('td')
str_cells = str(cells)
clean = re.compile('<.*?>')
clean2 = (re.sub(clean, '',str_cells))
list_rows.append(clean2)
print(clean2)
type(clean2)
df = pd.DataFrame(list_rows)
df.head(10)
df1 = df[0].str.split(',', expand=True)
df1.head(10)
from bs4 import BeautifulSoup
import requests
a = requests.get("https://www.wikifolio.com/de/de/w/wffalkinve")
soup = BeautifulSoup(a.text, 'lxml')
# searching for the rows directly
rows = soup.find_all('tr', {'class': 'c-portfolio'})
print(rows[:100])