Python 如何使用lxml获取元素

Python 如何使用lxml获取元素,python,parsing,xpath,lxml,Python,Parsing,Xpath,Lxml,如何从每列中获取文本,即从最后三个块中获取,每个块的类都是?我拿到桌子了,但下一步怎么办 >>> tree.xpath('//div[@class="table-currency"]/div[@class="row"]') [<Element div at 0x7fcac2a47ba8>, <Element div at 0x7fcac2a47c00>, <Element div at 0x7fcac2a47c58>, <Element

如何从每列中获取文本,即从最后三个块中获取,每个块的类都是
?我拿到桌子了,但下一步怎么办

>>> tree.xpath('//div[@class="table-currency"]/div[@class="row"]')
[<Element div at 0x7fcac2a47ba8>, <Element div at 0x7fcac2a47c00>, <Element div at 0x7fcac2a47c58>, <Element div at 0x7fcac2a47cb0>, <Element div at 0x7fcac2a47d08>, <Element div at 0x7fcac2a47d60>, <Element div at 0x7fcac2a47db8>, <Element div at 0x7fcac2a47e10>, <Element div at 0x7fcac2a47e68>, <Element div at 0x7fcac2a47ec0>, <Element div at 0x7fcac2a47f18>, <Element div at 0x7fcac2a47f70>, <Element div at 0x7fcac2a47fc8>, <Element div at 0x7fcac2a4e050>, <Element div at 0x7fcac2a4e0a8>, <Element div at 0x7fcac2a4e100>, <Element div at 0x7fcac2a4e158>, <Element div at 0x7fcac2a4e1b0>, <Element div at 0x7fcac2a4e208>, <Element div at 0x7fcac2a4e260>, <Element div at 0x7fcac2a4e2b8>, <Element div at 0x7fcac2a4e310>, <Element div at 0x7fcac2a4e368>, <Element div at 0x7fcac2a4e3c0>, <Element div at 0x7fcac2a4e418>, <Element div at 0x7fcac2a4e470>, <Element div at 0x7fcac2a4e4c8>, <Element div at 0x7fcac2a4e520>]
>>> len(tree.xpath('//div[@class="table-currency"]/div[@class="row"]'))
28
>>tree.xpath('//div[@class=“table currency”]/div[@class=“row”]'))
[, , , , , , , , , , , , , , ]
>>>len(tree.xpath('//div[@class=“table currency”]/div[@class=“row”]'))
28
html

<div class="table-currency">
    <div class="row"><div class="col col-currency">
    2.&nbsp; &nbsp;
    <img rel="nofollow" src="https://st6.prosto.im/cache/st6/1/0/5/5/1055/1055.jpg" width="16" height="16" alt="">
    <a target="_blank" href="/spravochniki/reytingi_banka/2/1057">
    ForteBank
    </a></div><div class="col col-headery col-currency-rate"><p>Активы банков, тыс. тенге</p></div><div class="col col-headery col-currency-rate"><p>Прирост за июль 2019 года,  тыс. тенге</p></div><div class="col col-headery col-currency-rate"><p>Прирост с начала 2019 года,  тыс. тенге</p></div><div class="col col-currency-rate"><p>1 985 956 865</p></div><div class="col col-currency-rate"><p></p><p class="arrow-up">+89 298 547</p><p></p></div><div class="col col-currency-rate"><p></p><p class="arrow-up">+390 999 868</p><p></p></div></div>

    <div class="row"><div class="col col-currency">
    3.&nbsp; &nbsp;
    <img rel="nofollow" src="https://st6.prosto.im/cache/st6/1/0/9/5/1095/1095.png" width="16" height="16" alt="">
    <a target="_blank" href="/spravochniki/reytingi_banka/2/1076">
    Сбербанк России
    </a></div><div class="col col-headery col-currency-rate"><p>Активы банков, тыс. тенге</p></div><div class="col col-headery col-currency-rate"><p>Прирост за июль 2019 года,  тыс. тенге</p></div><div class="col col-headery col-currency-rate"><p>Прирост с начала 2019 года,  тыс. тенге</p></div><div class="col col-currency-rate"><p>1 983 840 092</p></div><div class="col col-currency-rate"><p></p><p class="arrow-up">+88 853 745</p><p></p></div><div class="col col-currency-rate"><p></p><p class="arrow-up">+119 145 827</p><p></p></div></div>
</div>

2.
Абббббббб。цззазаа2019аа,а。ччаа2019ааа,а。1985956865

+89298547

+390999868

3. Абббббббб。цззазаа2019аа,а。ччаа2019ааа,а。1983 840 092

+88 853 745

+119 145 827


具有特定Xpath表达式的复杂解决方案:

from lxml import html
import requests

url  = 'https://bankchart.kz/spravochniki/reytingi_cbr/2/2019/7'
doc = html.document_fromstring(requests.get(url).content)

for row in doc.xpath('//div[@class="table-currency"]/div[@class="row"]'):
    bank_name = row.xpath('descendant::a/text()')[0].strip()
    print(bank_name)
    for cur_rate in row.xpath('div[contains(@class, "col-currency-rate")][position() > last() - 3]'):
        print('-', cur_rate.text_content())
    print()
详情:

  • substant::a/text()
    -xpath提取
    元素的文本节点,该元素是带下划线行的子节点/子节点
  • div[contains(@class,“col currency rate”)][position()>last()-3]
    -xpath选择
    div
    元素,该元素具有特定的
    class
    属性部分值,位置从3rd last位置开始到末尾(
    last()
    -最后一个元素的位置,
    last())-3
    指向最后第三个位置)
输出:

Народный банк Казахстана
- 8 729 518 087
- +101 401 107
- -190 957 466

ForteBank
- 1 985 956 865
- +89 298 547
- +390 999 868

Сбербанк России
- 1 983 840 092
- +88 853 745
- +119 145 827

Kaspi Bank
- 1 907 391 103
- +12 378 770
- +233 318 909

Банк ЦентрКредит
- 1 495 599 542
- +34 795 443
- -14 202 851

АТФБанк
- 1 314 405 536
- +1 661 967
- -19 558 254

First Heartland Jýsan Bank
- 1 217 617 065
- +52 641 777
- -553 564 176

Жилстройсбербанк Казахстана
- 1 148 974 349
- +7 721 823
- +261 041 394

Евразийский банк
- 1 040 820 999
- -25 910 447
- -25 911 373

Ситибанк Казахстан
- 758 117 020
- +48 724 924
- +82 877 576

Банк "Bank RBK"
- 618 310 738
- +21 856 874
- +62 626 834

Альфа-Банк
- 504 777 556
- +17 401 839
- +51 157 130

Altyn Bank («Народный банк Казахстана»)
- 421 018 633
- -20 058 555
- +33 720 048

Нурбанк
- 408 442 557
- +7 065 511
- -18 282 545

Хоум Кредит энд Финанс Банк
- 372 901 871
- -2 127 105
- +33 983 288

Банк Китая в Казахстане
- 324 386 349
- +11 609 880
- +4 997 316

Банк ВТБ
- 184 247 490
- +5 800 194
- +40 725 927

First Heartland Bank (Банк ЭкспоКреди)
- 173 058 018
- -17 261 535
- +16 047 168

Торгово-промышленный Банк Китая в Алматы
- 140 792 847
- +6 365 348
- -26 137 736

Банк Kassa Nova
- 133 910 512
- +954 985
- +4 039 523

Tengri Bank (Punjab National Bank)
- 133 721 602
- +1 136 896
- -485 570

Азия Кредит Банк
- 99 659 306
- -3 790 116
- -21 420 844

Capital Bank Kazakhstan
- 85 702 895
- -3 165 322
- +4 469 187

KZI Bank (Казахстан Зират Интернешнл)
- 65 240 704
- -3 412 060
- -126 750

Шинхан Банк Казахстан
- 43 323 406
- -7 588 366
- +722 399

Исламский Банк "Al-Hilal"
- 30 562 279
- +2 411 098
- -1 430 198

Заман-Банк
- 22 969 984
- -168 105
- +5 544 675

Национальный Банк Пакистана
- 4 705 084
- -20 113
- -131 233

具有特定Xpath表达式的复杂解决方案:

from lxml import html
import requests

url  = 'https://bankchart.kz/spravochniki/reytingi_cbr/2/2019/7'
doc = html.document_fromstring(requests.get(url).content)

for row in doc.xpath('//div[@class="table-currency"]/div[@class="row"]'):
    bank_name = row.xpath('descendant::a/text()')[0].strip()
    print(bank_name)
    for cur_rate in row.xpath('div[contains(@class, "col-currency-rate")][position() > last() - 3]'):
        print('-', cur_rate.text_content())
    print()
详情:

  • substant::a/text()
    -xpath提取
    元素的文本节点,该元素是带下划线行的子节点/子节点
  • div[contains(@class,“col currency rate”)][position()>last()-3]
    -xpath选择
    div
    元素,该元素具有特定的
    class
    属性部分值,位置从3rd last位置开始到末尾(
    last()
    -最后一个元素的位置,
    last())-3
    指向最后第三个位置)
输出:

Народный банк Казахстана
- 8 729 518 087
- +101 401 107
- -190 957 466

ForteBank
- 1 985 956 865
- +89 298 547
- +390 999 868

Сбербанк России
- 1 983 840 092
- +88 853 745
- +119 145 827

Kaspi Bank
- 1 907 391 103
- +12 378 770
- +233 318 909

Банк ЦентрКредит
- 1 495 599 542
- +34 795 443
- -14 202 851

АТФБанк
- 1 314 405 536
- +1 661 967
- -19 558 254

First Heartland Jýsan Bank
- 1 217 617 065
- +52 641 777
- -553 564 176

Жилстройсбербанк Казахстана
- 1 148 974 349
- +7 721 823
- +261 041 394

Евразийский банк
- 1 040 820 999
- -25 910 447
- -25 911 373

Ситибанк Казахстан
- 758 117 020
- +48 724 924
- +82 877 576

Банк "Bank RBK"
- 618 310 738
- +21 856 874
- +62 626 834

Альфа-Банк
- 504 777 556
- +17 401 839
- +51 157 130

Altyn Bank («Народный банк Казахстана»)
- 421 018 633
- -20 058 555
- +33 720 048

Нурбанк
- 408 442 557
- +7 065 511
- -18 282 545

Хоум Кредит энд Финанс Банк
- 372 901 871
- -2 127 105
- +33 983 288

Банк Китая в Казахстане
- 324 386 349
- +11 609 880
- +4 997 316

Банк ВТБ
- 184 247 490
- +5 800 194
- +40 725 927

First Heartland Bank (Банк ЭкспоКреди)
- 173 058 018
- -17 261 535
- +16 047 168

Торгово-промышленный Банк Китая в Алматы
- 140 792 847
- +6 365 348
- -26 137 736

Банк Kassa Nova
- 133 910 512
- +954 985
- +4 039 523

Tengri Bank (Punjab National Bank)
- 133 721 602
- +1 136 896
- -485 570

Азия Кредит Банк
- 99 659 306
- -3 790 116
- -21 420 844

Capital Bank Kazakhstan
- 85 702 895
- -3 165 322
- +4 469 187

KZI Bank (Казахстан Зират Интернешнл)
- 65 240 704
- -3 412 060
- -126 750

Шинхан Банк Казахстан
- 43 323 406
- -7 588 366
- +722 399

Исламский Банк "Al-Hilal"
- 30 562 279
- +2 411 098
- -1 430 198

Заман-Банк
- 22 969 984
- -168 105
- +5 544 675

Национальный Банк Пакистана
- 4 705 084
- -20 113
- -131 233
试试这个

import requests
import bs4 as bs
base_url = 'https://bankchart.kz/spravochniki/reytingi_cbr/2/2019/7'
soup = bs.BeautifulSoup(requests.get(base_url).text, 'lxml')
res = soup.find_all('div', {'class': 'row'})

final = list()
# res[1:] to skip the header of the columns
for bank in res[1:]:
    bank_data = list()
    # Bank name
    bank_data.append(bank.find('a').text.strip('\n'))
    # Image
    bank_data.append(bank.find('img')['src'])
    res = bank.find_all('div', {'class': 'col col-currency-rate'})
    for values in res:
        data = values.find_all('p')
        for x in data:
            if x.text:
                # All the three values
                bank_data.append(x.text)
    final.append(bank_data)
for x in final:
    print(x)
检查此选项是否适用。

尝试使用此选项

import requests
import bs4 as bs
base_url = 'https://bankchart.kz/spravochniki/reytingi_cbr/2/2019/7'
soup = bs.BeautifulSoup(requests.get(base_url).text, 'lxml')
res = soup.find_all('div', {'class': 'row'})

final = list()
# res[1:] to skip the header of the columns
for bank in res[1:]:
    bank_data = list()
    # Bank name
    bank_data.append(bank.find('a').text.strip('\n'))
    # Image
    bank_data.append(bank.find('img')['src'])
    res = bank.find_all('div', {'class': 'col col-currency-rate'})
    for values in res:
        data = values.find_all('p')
        for x in data:
            if x.text:
                # All the three values
                bank_data.append(x.text)
    final.append(bank_data)
for x in final:
    print(x)

检查这是否对您有效。

因此您基本上希望
Fortebank
的第三个值字段包含所有行的
390 999 868
。我说得对吗?@Nitin不,我想得到价值观​​从三列中,即
1 985 956 865
+89 298 547
+390 999 868
,当然,对于每一行(有28行这样的行),还需要一个指向银行形象及其名称的链接。但是我想我会自己解决的。所以你基本上想要
Fortebank
的值第三个字段,它包含所有行的
390 999 868
。我说得对吗?@Nitin不,我想得到价值观​​从三列中,即
1 985 956 865
+89 298 547
+390 999 868
,当然,对于每一行(有28行这样的行),还需要一个指向银行形象及其名称的链接。但我想我会自己解决的。非常感谢。您能否对这些表达式发表评论:
'degenant::a/text()'
&
'div[contains(@class,“col currency rate”)][position()>last()-3]'
?帮助。为什么我在尝试获取图像的src时获得
[]
row.xpath('div[@class=“col currency”]/img/@src')
@pythoner,不要动态更改条件。你有什么新情况吗创建一个新问题非常感谢。您能否对这些表达式发表评论:
'degenant::a/text()'
&
'div[contains(@class,“col currency rate”)][position()>last()-3]'
?帮助。为什么我在尝试获取图像的src时获得
[]
row.xpath('div[@class=“col currency”]/img/@src')
@pythoner,不要动态更改条件。你有什么新情况吗创建一个新问题
bs
-它不是seriously@pythoner你能详细说明一下吗。我不明白。
bs
-不是seriously@pythoner你能详细说明一下吗。我不明白。