Python 带下标<;sub>;数据

Python 带下标<;sub>;数据,python,scrapy,Python,Scrapy,我目前正在玩scrapy和Python 3.6。我的目标是使用以下html代码从表中提取所有数据: <table class="table table-a"> <tbody><tr> <td colspan="2"> <h2 class="text-center no-margin">Geome

我目前正在玩scrapy和Python 3.6。我的目标是使用以下html代码从表中提取所有数据:

<table class="table table-a">
                    <tbody><tr>
                        <td colspan="2">
                            <h2 class="text-center no-margin">Geometry</h2>
                        </td>
                    </tr>
                    <tr>
                        <td title="Depth of section">h = 267 mm</td>
                        <td rowspan="8" class="text-center">
                            <a href="http://www.staticstools.eu/assets/image/profile-ipea.png" target="_blank">
                                <img src="http://www.staticstools.eu/assets/image/profile-ipea-thumb.png" alt="Section IPEA" class="img-responsive">
                            </a>
                        </td>
                    </tr>
                    <tr>
                        <td title="Width of section">b = 135 mm</td>
                    </tr>
                    <tr>
                        <td title="Flange thickness">t<sub>f</sub> = 8.7 mm</td>
                    </tr>
                    <tr>
                        <td title="Web thickness">t<sub>w</sub> = 5.5 mm</td>
                    </tr>
                    <tr>
                        <td title="Radius of root fillet">r<sub>1</sub> = 15 mm</td>
                    </tr>
                    <tr>
                        <td title="Distance of centre of gravity along y-axis">y<sub>s</sub> = 67.5 mm</td>
                    </tr>
                    <tr>
                        <td title="Depth of straight portion of web">d = 219.6 mm</td>
                    </tr>
                    <tr>
                        <td title="Area of section">A = 3915 mm<sup>2</sup></td>
                    </tr>
                    <tr>
                        <td title="Painting surface per unit lenght">A<sub>L</sub> = 1.04 m<sup>2</sup>.m<sup>-1</sup></td>
                        <td title="Mass per unit lenght">G = 30.7 kg.m<sup>-1</sup></td>
                    </tr>
                </tbody></table>
输出为:

['\n                            ',
 '\n                        ',
 'h = 267 mm',
 '\n                            ',
 '\n                        ',
 'b = 135 mm',
 't',
 ' = 8.7 mm',
 't',
 ' = 5.5 mm',
 'r',
 ' = 15 mm',
 'y',
 ' = 67.5 mm',
 'd = 219.6 mm',
 'A = 3915 mm',
 'A',
 ' = 1.04 m',
 '.m',
 'G = 30.7 kg.m']
所以一切都有点混乱。我还可以使用以下命令包含嵌套标记:

response.css('table.table.table-a td *::text').extract()
输出如下:

['\n                            ',
 'Geometry',
 '\n                        ',
 'h = 267 mm',
 '\n                            ',
 '\n                                ',
 '\n                            ',
 '\n                        ',
 'b = 135 mm',
 't',
 'f',
 ' = 8.7 mm',
 't',
 'w',
 ' = 5.5 mm',
 'r',
 '1',
 ' = 15 mm',
 'y',
 's',
 ' = 67.5 mm',
 'd = 219.6 mm',
 'A = 3915 mm',
 '2',
 'A',
 'L',
 ' = 1.04 m',
 '2',
 '.m',
 '-1',
 'G = 30.7 kg.m',
 '-1']
我当然可以对这些数据进行后期处理,但我想知道是否有可能在刮削过程中实现它?我希望我的输出数据如下所示:

 ['h = 267 mm',
     'b = 135 mm',
     'tf = 8.7 mm',
     'tw = 5.5 mm',
     'r1 = 15 mm', 
     'ys = 67.5 mm',
     'd = 219.6 mm',
     'A = 3915 mm2',
     'AL = 1.04 m2.m-1',
     'G = 30.7 kg.m-1']

是的,您可以在spider类的解析方法中任意处理数据。类似于以下内容的内容在这里起作用:

import scrapy

class MySpider(scrapy.Spider):
    name = "myspider"

    def start_requests(self):
        urls = [
            'www.example.com'
        ]

        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        # perform data below

        data = response.xpath("//table").extract()

        data = pd.read_html(data[0])[0]

        # perform data processing above

        yield {'data':data}
运行以下命令将结果df保存为json:

scrapy crawl myscraper -o table.json
如果要更仔细地查看要插入到解析方法中的某些代码,请查看以下内容:

df = pd.read_html(html)[0]

df

    0               1
0   Geometry        NaN
1   h = 267 mm      NaN
2   b = 135 mm      NaN
3   tf = 8.7 mm     NaN
4   tw = 5.5 mm     NaN
5   r1 = 15 mm      NaN
6   ys = 67.5 mm    NaN
7   d = 219.6 mm    NaN
8   A = 3915 mm2    NaN
9   AL = 1.04 m2.m-1    G = 30.7 kg.m-1

df = pd.DataFrame([i.split(r' ') for i in df[0].map(str)])
df.drop([1,3], axis=1, inplace=True)

df

    0   2
0   Geometry    None
1   h   267
2   b   135
3   tf  8.7
4   tw  5.5
5   r1  15
6   ys  67.5
7   d   219.6
8   A   3915
9   AL  1.04

我当然可以,但这不是重点。我想知道是否可以通过.css或.xpath实现。那么,你认为熊猫能做到吗?你的问题是:“但我想知道在刮削过程中是否有可能做到?”。我的回答是肯定的。你必须使用xpath和css来隔离原始html,然后你可以用你的scraper类的解析方法用pandas来处理。我越早回到pandas,我就越高兴,这就是为什么我建议在有意义的时候尽快使用pandas。好吧,这让我很满意。谢谢
df = pd.read_html(html)[0]

df

    0               1
0   Geometry        NaN
1   h = 267 mm      NaN
2   b = 135 mm      NaN
3   tf = 8.7 mm     NaN
4   tw = 5.5 mm     NaN
5   r1 = 15 mm      NaN
6   ys = 67.5 mm    NaN
7   d = 219.6 mm    NaN
8   A = 3915 mm2    NaN
9   AL = 1.04 m2.m-1    G = 30.7 kg.m-1

df = pd.DataFrame([i.split(r' ') for i in df[0].map(str)])
df.drop([1,3], axis=1, inplace=True)

df

    0   2
0   Geometry    None
1   h   267
2   b   135
3   tf  8.7
4   tw  5.5
5   r1  15
6   ys  67.5
7   d   219.6
8   A   3915
9   AL  1.04