Python 3.x 如何为每组链接保存txt文件？_Python 3.x_Web Scraping_Beautifulsoup_Python Requests

Python 3.x 如何为每组链接保存txt文件？

python-3.x web-scraping

Python 3.x 如何为每组链接保存txt文件？,python-3.x,web-scraping,beautifulsoup,python-requests,Python 3.x,Web Scraping,Beautifulsoup,Python Requests,我正在尝试刮取多页黄页，并将打印输出存储在txt文件中。我知道获取这些页面上的数据不需要登录，我只是尝试进行一些使用requests.Session（）登录的练习我想将set_1中每个url的标题存储在一个txt文件YP_set_1.txt中。集合2中的url也是如此这是我的密码 import requests from bs4 import BeautifulSoup import requests.cookies import time s = requests.Session()

我正在尝试刮取多页黄页，并将打印输出存储在txt文件中。我知道获取这些页面上的数据不需要登录，我只是尝试进行一些使用requests.Session（）登录的练习

我想将set_1中每个url的标题存储在一个txt文件YP_set_1.txt中。集合2中的url也是如此

这是我的密码

import requests
from bs4 import BeautifulSoup
import requests.cookies
import time



s = requests.Session()

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36',
           'Referer': "https://accounts.yellowpages.com/login?next=https%3A%2F%2Faccounts.yellowpages.com%2Fdialog%2Foauth&client_id=590d26ff-34f1-447e-ace1-97d075dd7421&response_type=code&app_id=WEB&source=ypu_login&vrid=63dbd394-afff-4794-aeb0-51dd19957ebc&merge_history=true"}

url = "https://accounts.yellowpages.com/login?next=https%3A%2F%2Faccounts.yellowpages.com%2Fdialog%2Foauth&client_id=590d26ff-34f1-447e-ace1-97d075dd7421&response_type=code&app_id=WEB&source=ypu_register&vrid=cc9cb936-50d8-493b-83c6-842ec2f068ed&register=true"
r = s.get(url).content
page = s.get(url)
soup = BeautifulSoup(page.content, "lxml")
soup.prettify()

csrf = soup.find("input", value=True)["value"]

USERNAME = '****.*****@*****.***'
PASSWORD = '*******'

cj = s.cookies
requests.utils.dict_from_cookiejar(cj)

login_data = dict(email=USERNAME, password=PASSWORD, _csrf=csrf)

s.post(url, data=login_data, headers=headers)

set_1 = "This is the first set."

targeted_pages = ['https://www.yellowpages.com/brookfield-wi/business',
                  'https://www.yellowpages.com/bronx-ny/cheap-party-halls',
                  'https://www.yellowpages.com/bronx-ny/24-hour-liquor-store',
                  'https://www.yellowpages.com/bronx-ny/24-hour-oil-change',
                  'https://www.yellowpages.com/bronx-ny/auto-insurance',
                  'https://www.yellowpages.com/bronx-ny/awnings-canopies',
                  'https://www.yellowpages.com/bronx-ny/golden-corral',
                  'https://www.yellowpages.com/bronx-ny/concrete-contractors',
                  'https://www.yellowpages.com/bronx-ny/automobile-salvage',
                  'https://www.yellowpages.com/bronx-ny/24-hour-daycare-centers',
                  'https://www.yellowpages.com/bronx-ny/movers',
                  'https://www.yellowpages.com/bronx-ny/nursing-homes',
                  'https://www.yellowpages.com/bronx-ny/signs'
                  ]
for target_urls in targeted_pages:
    targeted_page = s.get(target_urls, headers=headers, cookies=cj)
    targeted_soup = BeautifulSoup(targeted_page.content, "lxml")

    for record in targeted_soup.findAll('title'):
        with open("YP_Set_1.txt", "w") as text_file:
            print(set_1 + '\n' + record.text, file=text_file)
time.sleep(5)

set_2 = "This is the second set."

targeted_pages_2 = ['https://www.yellowpages.com/north-miami-beach-fl/attorneys',
                    'https://www.yellowpages.com/north-miami-beach-fl/employment-agencies',
                    'https://www.yellowpages.com/north-miami-beach-fl/dentists',
                    'https://www.yellowpages.com/north-miami-beach-fl/general-contractors',
                    'https://www.yellowpages.com/north-miami-beach-fl/electricians',
                    'https://www.yellowpages.com/north-miami-beach-fl/pawnbrokers',
                    'https://www.yellowpages.com/north-miami-beach-fl/lighting-fixtures',
                    'https://www.yellowpages.com/north-miami-beach-fl/towing'
                    ]
for target_urls_2 in targeted_pages_2:
    targeted_page_2 = s.get(target_urls_2, headers=headers, cookies=cj)
    targeted_soup_2 = BeautifulSoup(targeted_page_2.content, "lxml")

    for record in targeted_soup_2.findAll('title'):
        with open("YP_Set_2.txt", "w") as text_file:
            print(set_2 + '\n' + record.text, file=text_file)

当我运行代码时，这是YP_Set_1.txt的打印输出

This is the first set.
Signs in Bronx, New York with Reviews & Ratings - YP.com

YP_Set_2.txt的打印输出

This is the second set.
Towing in North Miami Beach, Florida with Reviews & Ratings - YP.com

是否有一种快速修复方法可以让我将集合中每个url的所有标题存储在文本文件中，而不是仅获取集合中最后一个url的标题？感谢您的输入。

如果您一直在循环中打开文件，那么您可以继续覆盖内容，您可以使用

“a”

进行追加，而不是使用

“w”

进行重新打开，后者会被覆盖，但在循环之外只打开一次会更容易：

with open("YP_Set_2.txt", "w") as text_file:
    for target_urls_2 in targeted_pages_2:
        targeted_page_2 = s.get(target_urls_2, headers=headers, cookies=cj)
        targeted_soup_2 = BeautifulSoup(targeted_page_2.content, "lxml")

        for record in targeted_soup_2.find_all('title'):            
                text_file.write(set_2 + '\n' + record.text)

对这两个块执行相同的操作。

如果继续在循环中打开文件，以便继续覆盖内容，则可以继续使用附加的

“a”

来重新打开文件，而不是使用覆盖的

“w”

，但在循环之外只打开一次会更容易：

with open("YP_Set_2.txt", "w") as text_file:
    for target_urls_2 in targeted_pages_2:
        targeted_page_2 = s.get(target_urls_2, headers=headers, cookies=cj)
        targeted_soup_2 = BeautifulSoup(targeted_page_2.content, "lxml")

        for record in targeted_soup_2.find_all('title'):            
                text_file.write(set_2 + '\n' + record.text)

对这两个模块都做同样的操作。

再次感谢您的帮助。再次感谢您的帮助。