Python 3.x 如何为每组链接保存txt文件?
我正在尝试刮取多页黄页,并将打印输出存储在txt文件中。我知道获取这些页面上的数据不需要登录,我只是尝试进行一些使用requests.Session()登录的练习 我想将set_1中每个url的标题存储在一个txt文件YP_set_1.txt中。集合2中的url也是如此 这是我的密码Python 3.x 如何为每组链接保存txt文件?,python-3.x,web-scraping,beautifulsoup,python-requests,Python 3.x,Web Scraping,Beautifulsoup,Python Requests,我正在尝试刮取多页黄页,并将打印输出存储在txt文件中。我知道获取这些页面上的数据不需要登录,我只是尝试进行一些使用requests.Session()登录的练习 我想将set_1中每个url的标题存储在一个txt文件YP_set_1.txt中。集合2中的url也是如此 这是我的密码 import requests from bs4 import BeautifulSoup import requests.cookies import time s = requests.Session()
import requests
from bs4 import BeautifulSoup
import requests.cookies
import time
s = requests.Session()
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36',
'Referer': "https://accounts.yellowpages.com/login?next=https%3A%2F%2Faccounts.yellowpages.com%2Fdialog%2Foauth&client_id=590d26ff-34f1-447e-ace1-97d075dd7421&response_type=code&app_id=WEB&source=ypu_login&vrid=63dbd394-afff-4794-aeb0-51dd19957ebc&merge_history=true"}
url = "https://accounts.yellowpages.com/login?next=https%3A%2F%2Faccounts.yellowpages.com%2Fdialog%2Foauth&client_id=590d26ff-34f1-447e-ace1-97d075dd7421&response_type=code&app_id=WEB&source=ypu_register&vrid=cc9cb936-50d8-493b-83c6-842ec2f068ed®ister=true"
r = s.get(url).content
page = s.get(url)
soup = BeautifulSoup(page.content, "lxml")
soup.prettify()
csrf = soup.find("input", value=True)["value"]
USERNAME = '****.*****@*****.***'
PASSWORD = '*******'
cj = s.cookies
requests.utils.dict_from_cookiejar(cj)
login_data = dict(email=USERNAME, password=PASSWORD, _csrf=csrf)
s.post(url, data=login_data, headers=headers)
set_1 = "This is the first set."
targeted_pages = ['https://www.yellowpages.com/brookfield-wi/business',
'https://www.yellowpages.com/bronx-ny/cheap-party-halls',
'https://www.yellowpages.com/bronx-ny/24-hour-liquor-store',
'https://www.yellowpages.com/bronx-ny/24-hour-oil-change',
'https://www.yellowpages.com/bronx-ny/auto-insurance',
'https://www.yellowpages.com/bronx-ny/awnings-canopies',
'https://www.yellowpages.com/bronx-ny/golden-corral',
'https://www.yellowpages.com/bronx-ny/concrete-contractors',
'https://www.yellowpages.com/bronx-ny/automobile-salvage',
'https://www.yellowpages.com/bronx-ny/24-hour-daycare-centers',
'https://www.yellowpages.com/bronx-ny/movers',
'https://www.yellowpages.com/bronx-ny/nursing-homes',
'https://www.yellowpages.com/bronx-ny/signs'
]
for target_urls in targeted_pages:
targeted_page = s.get(target_urls, headers=headers, cookies=cj)
targeted_soup = BeautifulSoup(targeted_page.content, "lxml")
for record in targeted_soup.findAll('title'):
with open("YP_Set_1.txt", "w") as text_file:
print(set_1 + '\n' + record.text, file=text_file)
time.sleep(5)
set_2 = "This is the second set."
targeted_pages_2 = ['https://www.yellowpages.com/north-miami-beach-fl/attorneys',
'https://www.yellowpages.com/north-miami-beach-fl/employment-agencies',
'https://www.yellowpages.com/north-miami-beach-fl/dentists',
'https://www.yellowpages.com/north-miami-beach-fl/general-contractors',
'https://www.yellowpages.com/north-miami-beach-fl/electricians',
'https://www.yellowpages.com/north-miami-beach-fl/pawnbrokers',
'https://www.yellowpages.com/north-miami-beach-fl/lighting-fixtures',
'https://www.yellowpages.com/north-miami-beach-fl/towing'
]
for target_urls_2 in targeted_pages_2:
targeted_page_2 = s.get(target_urls_2, headers=headers, cookies=cj)
targeted_soup_2 = BeautifulSoup(targeted_page_2.content, "lxml")
for record in targeted_soup_2.findAll('title'):
with open("YP_Set_2.txt", "w") as text_file:
print(set_2 + '\n' + record.text, file=text_file)
当我运行代码时,这是YP_Set_1.txt的打印输出
This is the first set.
Signs in Bronx, New York with Reviews & Ratings - YP.com
YP_Set_2.txt的打印输出
This is the second set.
Towing in North Miami Beach, Florida with Reviews & Ratings - YP.com
是否有一种快速修复方法可以让我将集合中每个url的所有标题存储在文本文件中,而不是仅获取集合中最后一个url的标题?感谢您的输入。如果您一直在循环中打开文件,那么您可以继续覆盖内容,您可以使用
“a”
进行追加,而不是使用“w”
进行重新打开,后者会被覆盖,但在循环之外只打开一次会更容易:
with open("YP_Set_2.txt", "w") as text_file:
for target_urls_2 in targeted_pages_2:
targeted_page_2 = s.get(target_urls_2, headers=headers, cookies=cj)
targeted_soup_2 = BeautifulSoup(targeted_page_2.content, "lxml")
for record in targeted_soup_2.find_all('title'):
text_file.write(set_2 + '\n' + record.text)
对这两个块执行相同的操作。如果继续在循环中打开文件,以便继续覆盖内容,则可以继续使用附加的
“a”
来重新打开文件,而不是使用覆盖的“w”
,但在循环之外只打开一次会更容易:
with open("YP_Set_2.txt", "w") as text_file:
for target_urls_2 in targeted_pages_2:
targeted_page_2 = s.get(target_urls_2, headers=headers, cookies=cj)
targeted_soup_2 = BeautifulSoup(targeted_page_2.content, "lxml")
for record in targeted_soup_2.find_all('title'):
text_file.write(set_2 + '\n' + record.text)
对这两个模块都做同样的操作。再次感谢您的帮助。再次感谢您的帮助。