Python 带下标<;sub>;数据
我目前正在玩scrapy和Python 3.6。我的目标是使用以下html代码从表中提取所有数据:Python 带下标<;sub>;数据,python,scrapy,Python,Scrapy,我目前正在玩scrapy和Python 3.6。我的目标是使用以下html代码从表中提取所有数据: <table class="table table-a"> <tbody><tr> <td colspan="2"> <h2 class="text-center no-margin">Geome
<table class="table table-a">
<tbody><tr>
<td colspan="2">
<h2 class="text-center no-margin">Geometry</h2>
</td>
</tr>
<tr>
<td title="Depth of section">h = 267 mm</td>
<td rowspan="8" class="text-center">
<a href="http://www.staticstools.eu/assets/image/profile-ipea.png" target="_blank">
<img src="http://www.staticstools.eu/assets/image/profile-ipea-thumb.png" alt="Section IPEA" class="img-responsive">
</a>
</td>
</tr>
<tr>
<td title="Width of section">b = 135 mm</td>
</tr>
<tr>
<td title="Flange thickness">t<sub>f</sub> = 8.7 mm</td>
</tr>
<tr>
<td title="Web thickness">t<sub>w</sub> = 5.5 mm</td>
</tr>
<tr>
<td title="Radius of root fillet">r<sub>1</sub> = 15 mm</td>
</tr>
<tr>
<td title="Distance of centre of gravity along y-axis">y<sub>s</sub> = 67.5 mm</td>
</tr>
<tr>
<td title="Depth of straight portion of web">d = 219.6 mm</td>
</tr>
<tr>
<td title="Area of section">A = 3915 mm<sup>2</sup></td>
</tr>
<tr>
<td title="Painting surface per unit lenght">A<sub>L</sub> = 1.04 m<sup>2</sup>.m<sup>-1</sup></td>
<td title="Mass per unit lenght">G = 30.7 kg.m<sup>-1</sup></td>
</tr>
</tbody></table>
输出为:
['\n ',
'\n ',
'h = 267 mm',
'\n ',
'\n ',
'b = 135 mm',
't',
' = 8.7 mm',
't',
' = 5.5 mm',
'r',
' = 15 mm',
'y',
' = 67.5 mm',
'd = 219.6 mm',
'A = 3915 mm',
'A',
' = 1.04 m',
'.m',
'G = 30.7 kg.m']
所以一切都有点混乱。我还可以使用以下命令包含嵌套标记:
response.css('table.table.table-a td *::text').extract()
输出如下:
['\n ',
'Geometry',
'\n ',
'h = 267 mm',
'\n ',
'\n ',
'\n ',
'\n ',
'b = 135 mm',
't',
'f',
' = 8.7 mm',
't',
'w',
' = 5.5 mm',
'r',
'1',
' = 15 mm',
'y',
's',
' = 67.5 mm',
'd = 219.6 mm',
'A = 3915 mm',
'2',
'A',
'L',
' = 1.04 m',
'2',
'.m',
'-1',
'G = 30.7 kg.m',
'-1']
我当然可以对这些数据进行后期处理,但我想知道是否有可能在刮削过程中实现它?我希望我的输出数据如下所示:
['h = 267 mm',
'b = 135 mm',
'tf = 8.7 mm',
'tw = 5.5 mm',
'r1 = 15 mm',
'ys = 67.5 mm',
'd = 219.6 mm',
'A = 3915 mm2',
'AL = 1.04 m2.m-1',
'G = 30.7 kg.m-1']
是的,您可以在spider类的解析方法中任意处理数据。类似于以下内容的内容在这里起作用:
import scrapy
class MySpider(scrapy.Spider):
name = "myspider"
def start_requests(self):
urls = [
'www.example.com'
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
# perform data below
data = response.xpath("//table").extract()
data = pd.read_html(data[0])[0]
# perform data processing above
yield {'data':data}
运行以下命令将结果df保存为json:
scrapy crawl myscraper -o table.json
如果要更仔细地查看要插入到解析方法中的某些代码,请查看以下内容:
df = pd.read_html(html)[0]
df
0 1
0 Geometry NaN
1 h = 267 mm NaN
2 b = 135 mm NaN
3 tf = 8.7 mm NaN
4 tw = 5.5 mm NaN
5 r1 = 15 mm NaN
6 ys = 67.5 mm NaN
7 d = 219.6 mm NaN
8 A = 3915 mm2 NaN
9 AL = 1.04 m2.m-1 G = 30.7 kg.m-1
df = pd.DataFrame([i.split(r' ') for i in df[0].map(str)])
df.drop([1,3], axis=1, inplace=True)
df
0 2
0 Geometry None
1 h 267
2 b 135
3 tf 8.7
4 tw 5.5
5 r1 15
6 ys 67.5
7 d 219.6
8 A 3915
9 AL 1.04
我当然可以,但这不是重点。我想知道是否可以通过.css或.xpath实现。那么,你认为熊猫能做到吗?你的问题是:“但我想知道在刮削过程中是否有可能做到?”。我的回答是肯定的。你必须使用xpath和css来隔离原始html,然后你可以用你的scraper类的解析方法用pandas来处理。我越早回到pandas,我就越高兴,这就是为什么我建议在有意义的时候尽快使用pandas。好吧,这让我很满意。谢谢
df = pd.read_html(html)[0]
df
0 1
0 Geometry NaN
1 h = 267 mm NaN
2 b = 135 mm NaN
3 tf = 8.7 mm NaN
4 tw = 5.5 mm NaN
5 r1 = 15 mm NaN
6 ys = 67.5 mm NaN
7 d = 219.6 mm NaN
8 A = 3915 mm2 NaN
9 AL = 1.04 m2.m-1 G = 30.7 kg.m-1
df = pd.DataFrame([i.split(r' ') for i in df[0].map(str)])
df.drop([1,3], axis=1, inplace=True)
df
0 2
0 Geometry None
1 h 267
2 b 135
3 tf 8.7
4 tw 5.5
5 r1 15
6 ys 67.5
7 d 219.6
8 A 3915
9 AL 1.04