Python 用美丽的乌苏刮成一行_Python_Python 3.x_Web Scraping_Beautifulsoup

Python 用美丽的乌苏刮成一行

python python-3.x web-scraping

Python 用美丽的乌苏刮成一行,python,python-3.x,web-scraping,beautifulsoup,Python,Python 3.x,Web Scraping,Beautifulsoup,我是python的初学者，所以我想做的是用BeautifulSoup创建一个网站。在页面源代码的一小部分中，这是html： <table class="swift" width="100%"> <tr> <th class="no">ID</th> <th>Bank or Institution</th> <th>City</th> <th clas

我是python的初学者，所以我想做的是用BeautifulSoup创建一个网站。在页面源代码的一小部分中，这是html：

<table class="swift" width="100%">
   <tr>
     <th class="no">ID</th>
     <th>Bank or Institution</th>
     <th>City</th>
     <th class="branch">Branch</th>
     <th>Swift Code</th>
   </tr>   <tr>
     <td align="center">101</td>
     <td>BANK LEUMI ROMANIA S.A.</td>
     <td>CONSTANTA</td>
     <td>(CONSTANTA BRANCH)</td>
     <td align="center"><a href="/romania/dafbro22cta/">DAFBRO22CTA</a></td>
   </tr>
   <tr>
     <td align="center">102</td>
     <td>BANK LEUMI ROMANIA S.A.</td>
     <td>ORADEA</td>
     <td>(ORADEA BRANCH)</td>
     <td align="center"><a href="/romania/dafbro22ora/">DAFBRO22ORA</a></td>
   </tr>

当我真的想要这样的时候：

ID, Bank or Institution, City, Branch, Swift Code

101, BANK LEUMI ROMANIA S.A., CONSTANTA, (CONSTANTA BRANCH) ,DAFBRO22CTA

102, BANK LEUMI ROMANIA S.A., ORADEA, (ORADEA BRANCH), DAFBRO22ORA

这是我的代码：

base_url = "https://www.theswiftcodes.com/"
nr = 0
page = 'page'
country = 'Romania'
while nr < 4:
    url_country = base_url + country + '/' + 'page' + "/" + str(nr) + "/"
    pages = requests.get(url_country)
    soup = BeautifulSoup(pages.text, 'html.parser')

    for script in soup.find_all('script'):
        script.extract()

    tabel = soup.find_all("table")
    text = ("".join([p.get_text() for p in tabel]))
    nr += 1
    print(text)

    file = open('swiftcodes.txt', 'a')
    file.write(text)
    file.close()

    file = open('swiftcodes.txt', 'r')
    for item in file:
        print(item)
    file.close()

base\u url=”https://www.theswiftcodes.com/"
nr=0
页面='page'
国家=‘罗马尼亚’
当nr<4时：
url_country=base_url+country+'/'+'第'+“/“+str（nr）+“/”页
pages=requests.get（url\u国家/地区）
soup=BeautifulSoup（pages.text，'html.parser'）
对于汤中的脚本。查找所有（“脚本”）：
script.extract（）
tabel=汤。查找所有（“表格”）
text=（“”.join（[p.get_text（）表示表中的p]））
nr+=1
打印（文本）
文件=打开（'swiftcodes.txt'，'a'）
file.write（文本）
file.close（）文件
文件=打开（'swiftcodes.txt'，'r'）
对于文件中的项目：
打印（项目）
file.close（）文件

这应该可以做到

from bs4 import BeautifulSoup

str = """<table class="swift" width="100%">
   <tr>
     <th class="no">ID</th>
     <th>Bank or Institution</th>
     <th>City</th>
     <th class="branch">Branch</th>
     <th>Swift Code</th>
   </tr>   <tr>
     <td align="center">101</td>
     <td>BANK LEUMI ROMANIA S.A.</td>
     <td>CONSTANTA</td>
     <td>(CONSTANTA BRANCH)</td>
     <td align="center"><a href="/romania/dafbro22cta/">DAFBRO22CTA</a></td>
   </tr>
   <tr>
     <td align="center">102</td>
     <td>BANK LEUMI ROMANIA S.A.</td>
     <td>ORADEA</td>
     <td>(ORADEA BRANCH)</td>
     <td align="center"><a href="/romania/dafbro22ora/">DAFBRO22ORA</a></td>
   </tr>"""

soup = BeautifulSoup(str)

for i in soup.find_all("tr"):
    result = ""
    for j in i.find_all("th"): # find all the header tags
        result += j.text + ", "
    for j in i.find_all("td"): # find the cell tags
        result += j.text + ", "
    print(result.rstrip(', '))

这应该能奏效

from bs4 import BeautifulSoup

str = """<table class="swift" width="100%">
   <tr>
     <th class="no">ID</th>
     <th>Bank or Institution</th>
     <th>City</th>
     <th class="branch">Branch</th>
     <th>Swift Code</th>
   </tr>   <tr>
     <td align="center">101</td>
     <td>BANK LEUMI ROMANIA S.A.</td>
     <td>CONSTANTA</td>
     <td>(CONSTANTA BRANCH)</td>
     <td align="center"><a href="/romania/dafbro22cta/">DAFBRO22CTA</a></td>
   </tr>
   <tr>
     <td align="center">102</td>
     <td>BANK LEUMI ROMANIA S.A.</td>
     <td>ORADEA</td>
     <td>(ORADEA BRANCH)</td>
     <td align="center"><a href="/romania/dafbro22ora/">DAFBRO22ORA</a></td>
   </tr>"""

soup = BeautifulSoup(str)

for i in soup.find_all("tr"):
    result = ""
    for j in i.find_all("th"): # find all the header tags
        result += j.text + ", "
    for j in i.find_all("td"): # find the cell tags
        result += j.text + ", "
    print(result.rstrip(', '))

输出：

你能试着在代码中更新它吗？像这样理解它有点困难。代码中只有两件事。迭代所有

tr

标记。在

tr

标记内，迭代

td

标记或

th

标记，并将文本值存储在

result

变量中。然后在

tr

迭代的每一端打印出来。

strip

只是一个字符串操作来删除逗号，所以你的代码应该放在print（text）和file=open（'swiftcodes.txt'，'a'）之间。我不知道最终目标是什么。您指出了所需的输出。此代码输出所需的代码。如果您想将其存储在文件中，您可以很容易地找到解决方案。如果您想将标题和另一个标题存储在单独的文件中，可以通过不将值存储在

结果中，而是将其打印到文件中来轻松完成。是否可以尝试在代码中更新它？像这样理解它有点困难。代码中只有两件事。迭代所有tr
标记。在tr
标记内，迭代td
标记或th
标记，并将文本值存储在result
变量中。然后在tr
迭代的每一端打印出来。strip
只是一个字符串操作来删除逗号，所以你的代码应该放在print（text）和file=open（'swiftcodes.txt'，'a'）之间。我不知道最终目标是什么。您指出了所需的输出。此代码输出所需的代码。如果您想将其存储在文件中，您可以很容易地找到解决方案。如果要将头和另一个存储在单独的文件中，则可以通过不将值存储在结果中，而是将其打印到文件中来轻松完成。
ID, Bank or Institution, City, Branch, Swift Code
101, BANK LEUMI ROMANIA S.A., CONSTANTA, (CONSTANTA BRANCH), DAFBRO22CTA
102, BANK LEUMI ROMANIA S.A., ORADEA, (ORADEA BRANCH), DAFBRO22ORA

from bs4 import BeautifulSoup
import requests
r = requests.get('https://www.theswiftcodes.com/united-states/')
soup = BeautifulSoup(r.text, 'lxml')
rows = soup.find(class_="swift").find_all('tr')
th = [th.text for th in rows[0].find_all('th')]
print(th)
for row in rows[1:]:
    cell = [i.text for i in row.find_all('td', colspan=False)]
    print(cell)

['ID', 'Bank or Institution', 'City', 'Branch', 'Swift Code']
['1', '1ST CENTURY BANK, N.A.', 'LOS ANGELES,CA', '', 'CETYUS66']
['2', '1ST PMF BANCORP', 'LOS ANGELES,CA', '', 'PMFAUS66']
['3', '1ST PMF BANCORP', 'LOS ANGELES,CA', '', 'PMFAUS66HKG']
['4', '3M COMPANY', 'ST. PAUL,MN', '', 'MMMCUS44']
['5', 'ABACUS FEDERAL SAVINGS BANK', 'NEW YORK,NY', '', 'AFSBUS33']
[]
['6', 'ABBEY NATIONAL TREASURY SERVICES LTD US BRANCH', 'STAMFORD,CT', '', 'ANTSUS33']
['7', 'ABBOTT LABORATORIES', 'ABBOTT PARK,IL', '', 'ABTTUS44']
['8', 'ABBVIE, INC.', 'CHICAGO,IL', '', 'ABBVUS44']
['9', 'ABEL/NOSER CORP', 'NEW YORK,NY', '', 'ABENUS3N']