Python 如何将html切片到数据框架中
在导入一些HTML(例如)之后,我意识到应该是“表”的东西实际上不是表,因此我似乎需要自己重新构建表Python 如何将html切片到数据框架中,python,python-3.x,pandas,beautifulsoup,Python,Python 3.x,Pandas,Beautifulsoup,在导入一些HTML(例如)之后,我意识到应该是“表”的东西实际上不是表,因此我似乎需要自己重新构建表 from bs4 import BeautifulSoup import re import pandas as pd import os soup_level1=BeautifulSoup(driver.page_source, 'lxml') 让你开始的东西: Batsmen R B 4s 6s SR 0 S Dhawan 4 8 0 0 50.0
from bs4 import BeautifulSoup
import re
import pandas as pd
import os
soup_level1=BeautifulSoup(driver.page_source, 'lxml')
让你开始的东西:
Batsmen R B 4s 6s SR
0 S Dhawan 4 8 0 0 50.0
哪个输出
from bs4 import BeautifulSoup
import numpy as np
import requests
html_doc= requests.get(r'http://www.espncricinfo.com/series/18886/scorecard/1157372/').content
soup = BeautifulSoup(html_doc, 'html.parser')
data = []
for div in soup.find_all('div',class_="cell runs"):
data.append(div.text)
np.array(data).reshape(-1,5)
数组([['R',B',4s',6s',SR'],
['14', '12', '2', '0', '116.66'],
['68', '55', '5', '1', '123.63'],
['39', '30', '2', '2', '130.00'],
['2', '3', '0', '0', '66.66'],
['9', '6', '1', '0', '150.00'],
['0', '1', '0', '0', '0.00'],
['0', '2', '0', '0', '0.00'],
['1', '2', '0', '0', '50.00'],
['0', '1', '0', '0', '0.00'],
['17', '8', '2', '1', '212.50'],
[R',B',4s',6s',SR'],
['0', '3', '0', '0', '0.00'],
['4', '2', '1', '0', '200.00'],
['14', '15', '2', '0', '93.33'],
['2', '7', '0', '0', '28.57'],
['0', '2', '0', '0', '0.00'],
['1', '3', '0', '0', '33.33'],
['19', '23', '2', '0', '82.60'],
['34', '29', '6', '0', '117.24'],
['3', '7', '0', '0', '42.85'],
['6', '3', '1', '0', '200.00'],
['2','7','0','0','28.57']],dtype='你能链接到网站吗?当然,bs4可以为你从标签中提取文本,所以soup\u level1.查找所有('div',class=“cell runs”)
可能有一个方法或属性,可以只获取没有标签的内容。@Dan谢谢Dan,下面是一个例子:查看文档的第一页,类似于汤中div的。\u all('div',class=“cell runs”):print(div.string)
啊,谢谢!就这样。我应该可以从那里操纵它。谢谢Dan!太完美了。它可以同时拉动两个记分卡,但我可以排序。干杯!
Stats = soup_level1.find_all('div',class_="cell runs")
pd.Series(Stats)
0 <div class="cell runs" data-reactid="184">R</div>
1 <div class="cell runs" data-reactid="185">B</div>
2 <div class="cell runs" data-reactid="186">4s</...
3 <div class="cell runs" data-reactid="187">6s</...
4 <div class="cell runs" data-reactid="188">SR</...
5 <div class="cell runs" data-reactid="194">4</div>
6 <div class="cell runs" data-reactid="195">8</div>
7 <div class="cell runs" data-reactid="196">1</div>
8 <div class="cell runs" data-reactid="197">0</div>
9 <div class="cell runs" data-reactid="198">50.0...
...
94 <div class="cell runs" data-reactid="548">-</div>
Length: 95, dtype: object
Batsmen R B 4s 6s SR
0 S Dhawan 4 8 0 0 50.0
from bs4 import BeautifulSoup
import numpy as np
import requests
html_doc= requests.get(r'http://www.espncricinfo.com/series/18886/scorecard/1157372/').content
soup = BeautifulSoup(html_doc, 'html.parser')
data = []
for div in soup.find_all('div',class_="cell runs"):
data.append(div.text)
np.array(data).reshape(-1,5)
array([['R', 'B', '4s', '6s', 'SR'],
['14', '12', '2', '0', '116.66'],
['68', '55', '5', '1', '123.63'],
['39', '30', '2', '2', '130.00'],
['2', '3', '0', '0', '66.66'],
['9', '6', '1', '0', '150.00'],
['0', '1', '0', '0', '0.00'],
['0', '2', '0', '0', '0.00'],
['1', '2', '0', '0', '50.00'],
['0', '1', '0', '0', '0.00'],
['17', '8', '2', '1', '212.50'],
['R', 'B', '4s', '6s', 'SR'],
['0', '3', '0', '0', '0.00'],
['4', '2', '1', '0', '200.00'],
['14', '15', '2', '0', '93.33'],
['2', '7', '0', '0', '28.57'],
['0', '2', '0', '0', '0.00'],
['1', '3', '0', '0', '33.33'],
['19', '23', '2', '0', '82.60'],
['34', '29', '6', '0', '117.24'],
['3', '7', '0', '0', '42.85'],
['6', '3', '1', '0', '200.00'],
['2', '7', '0', '0', '28.57']], dtype='<U6')