Python 美丽的汤-无法创建csv和文本文件后,刮
我试图从网站的所有页面中提取文章的URL。只有第一页中的URL会被重复地刮取并存储在csv文件中。 来自这些链接的信息再次以同样的方式被刮取并存储在文本文件中 在这个问题上需要一些帮助Python 美丽的汤-无法创建csv和文本文件后,刮,python,python-2.7,web-scraping,beautifulsoup,Python,Python 2.7,Web Scraping,Beautifulsoup,我试图从网站的所有页面中提取文章的URL。只有第一页中的URL会被重复地刮取并存储在csv文件中。 来自这些链接的信息再次以同样的方式被刮取并存储在文本文件中 在这个问题上需要一些帮助 import requests from bs4 import BeautifulSoup import csv import lxml import urllib2 base_url = 'https://www.marketingweek.com/?s=big+data' response = reques
import requests
from bs4 import BeautifulSoup
import csv
import lxml
import urllib2
base_url = 'https://www.marketingweek.com/?s=big+data'
response = requests.get(base_url)
soup = BeautifulSoup(response.content, "lxml")
res = []
while 1:
search_results = soup.find('div', class_='archive-constraint') #localizing search window with article links
article_link_tags = search_results.findAll('a') #ordinary scheme goes further
res.append([url['href'] for url in article_link_tags])
#Automatically clicks next button to load other articles
next_button = soup.find('a', text='>>')
#Searches for articles till Next button is not found
if not next_button:
break
res.append([url['href'] for url in article_link_tags])
soup = BeautifulSoup(response.text, "lxml")
for i in res:
for j in i:
print(j)
####Storing scraped links in csv file###
with open('StoreUrl1.csv', 'w+') as f:
f.seek(0)
for i in res:
for j in i:
f.write('\n'.join(i))
#######Extracting info from URLs########
with open('StoreUrl1.csv', 'rb') as f1:
f1.seek(0)
reader = csv.reader(f1)
for line in reader:
url = line[0]
soup = BeautifulSoup(urllib2.urlopen(url), "lxml")
with open('InfoOutput1.txt', 'a+') as f2:
for tag in soup.find_all('p'):
f2.write(tag.text.encode('utf-8') + '\n')
我试图打印
next炪按钮
,它是None
。如果我们可以通过稍微修改url直接进入特定页面,为什么我们需要下一步按钮的标签。在您的代码中,假设您找到了指向下一页的链接,但没有获得由soup.find('a',text=“>>”)返回的元素的文本。这是你修改过的代码。让我知道它是否适合你
import requests
from bs4 import BeautifulSoup
import csv
import pandas as pd
import urllib2
df = pd.DataFrame()
base_url = 'https://www.marketingweek.com/?s=big+data'
res = []
page = 1
while page < 362:
page_url = 'https://www.marketingweek.com/page/'+str(page)+'/?s=big+data'
response = requests.get(page_url)
soup = BeautifulSoup(response.content, "html.parser")
search_results = soup.find('div', class_='columns-flex full-ads') #localizing search window with article links
try:
article_link_tags = search_results.findAll('a') #ordinary scheme goes further
except AttributeError:
continue
res.append([url['href'] for url in article_link_tags])
print('Found {} links on page {} '.format(len(res[page-1]), page))
df.append(res[page-1])
#for i in res:
# for j in i:
# print(j)
page += 1
# break
####Storing scraped links in csv file###
df.to_csv('SampleUrl.csv', index=False)
导入请求
从bs4导入BeautifulSoup
导入csv
作为pd进口熊猫
导入urllib2
df=pd.DataFrame()
基本url=https://www.marketingweek.com/?s=big+数据的
res=[]
页码=1
而第362页:
第页https://www.marketingweek.com/page/“+str(第页)+”/?s=大+数据”
response=requests.get(page\u url)
soup=BeautifulSoup(response.content,“html.parser”)
搜索结果=soup.find('div',class='columns-flex full ads')#使用文章链接本地化搜索窗口
尝试:
article_link_tags=search_results.findAll('a')#普通方案更进一步
除属性错误外:
持续
res.append([url['href']用于文章链接标签中的url])
打印({}页上有{}个链接)。格式(len(res[page-1]),page))
df.append(res[第1页])
#对于我在res:
#对于i中的j:
#印刷品(j)
页码+=1
#中断
####在csv文件中存储刮取的链接###
df.to_uCSV('SampleUrl.csv',index=False)
我只知道url集合部分:
import requests
from bs4 import BeautifulSoup
next_button = 'https://www.marketingweek.com/page/1/?s=big+data'
res = []
while 1:
response = requests.get(next_button)
soup = BeautifulSoup(response.text, "lxml")
search_results = soup.find('div', class_='archive-constraint') #localizing search window with article links
article_link_tags = search_results.findAll('a') #ordinary scheme goes further
#added duplication drop from list of urls
row = [url['href'] for url in article_link_tags]
row = list(set(row))
res.append(row)
#Automatically clicks next button to load other articles
next_button = soup.find('a', class_='next page-numbers')
#Searches for articles till Next button is not found
if not next_button:
break
next_button = next_button['href']
for i in res:
for j in i:
print(j)
编辑:第一版面向代码的解析器bot现场阻止。下面的这些修复可能会有所帮助(为了100%保证,您需要了解机器人程序拦截器在站点上的工作方式)。我在对站点的请求中添加了随机sleep
从5秒到30秒,以及随机user agent
更改(仅桌面用户代理),以跳过阻塞:
import numpy as np
import time
import requests
from bs4 import BeautifulSoup
user_agents = {0:
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36 Edge/12.246',
1:
'Mozilla/5.0 (X11; CrOS x86_64 8172.45.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.64 Safari/537.36',
2:
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/601.3.9 (KHTML, like Gecko) Version/9.0.2 Safari/601.3.9',
3:
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.111 Safari/537.36',
4:
'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:15.0) Gecko/20100101 Firefox/15.0.1'}
next_button = 'https://www.marketingweek.com/page/1/?s=big+data'
res = []
while 1:
response = requests.get(next_button, headers={'User-Agent':user_agents[np.random.randint(0,5)]})
soup = BeautifulSoup(response.text, "lxml")
search_results = soup.find('div', class_='columns-flex full-ads') #localizing search window with article links
article_link_tags = search_results.findAll('a') #ordinary scheme goes further
#added duplication drop from list of urls
row = [url['href'] for url in article_link_tags]
row = list(set(row))
res.append(row)
#Automatically clicks next button to load other articles
next_button = soup.find('a', class_='next page-numbers')
#Searches for articles till Next button is not found
if not next_button:
break
next_button = next_button['href']
time.sleep((30 - 5) * np.random.random() + 5)
for i in res:
for j in i:
print(j)
如果一段时间后仍然失败,请尝试将睡眠时间延长两次或更大。使用lxml的html解析器解决方案。
有361页,每一页上有12个链接。我们可以迭代到每个页面,并使用xpath提取链接
xpath有助于获得:
- 特定标记下的文本
- 特定标记的值(“a”标记的“href”属性的值)
有什么解释吗?@ksai没有。第一页中的文章URL只显示和刮取。显然它会刮取第一页,但是它如何在不单击“下一步”按钮或不按url浏览页面的情况下刮取第二页。与更改url相比,查找“下一步”按钮的a
标记有点困难。@ksai-您建议的修改不起作用。有其他解决方案吗?很遗憾,输出屏幕是空白的。没有打印任何内容。所需站点的处理将花费大量时间,也许您没有等到脚本完成?大约30秒后,我手动停止了脚本执行。并且得到了包含6个URL列表的res
。好的,我会按照建议去做。得到了这个错误:AttributeError:“NoneType”对象在article\u link\u tags=search\u results行中没有属性“findAll”。findAll('a')
你复制并粘贴了我的代码吗?此错误意味着特定页面的搜索结果
为空。可以通过轻松治愈,但请尝试。但是你显然不会收集那个特定页面的链接。或者应该做一些更大的变通。在最后的文件“C:\Users\Rajesh Ramesh\documents\visualstudio 2015\Projects\Final website scraping-2\Final website scraping-2\Final\u website\u scraping\uuuuuu 2.py”第381行,在soup=BeautifulSoup(urlib2.urlopen(url))文件“C:\Python27\lib\urlib2.py”第154行中发现了这个错误,在urlopen return opener.open(url、数据、超时)文件“C:\Python27\lib\urllib2.py”第421行中,在open protocol=req.get_type()文件“C:\Python27\lib\urllib2.py”第283行中,在get_type raise ValueError中,“未知url类型:%s”%self.\uu原始值错误:未知url类型:Link
您能在这里帮我一下吗?
import csv
from lxml import html
from time import sleep
import requests
from random import randint
outputFile = open("All_links.csv", r'wb')
fileWriter = csv.writer(outputFile)
fileWriter.writerow(["Sl. No.", "Page Number", "Link"])
url1 = 'https://www.marketingweek.com/page/'
url2 = '/?s=big+data'
sl_no = 1
#iterating from 1st page through 361th page
for i in xrange(1, 362):
#generating final url to be scraped using page number
url = url1 + str(i) + url2
#Fetching page
response = requests.get(url)
sleep(randint(10, 20))
#using html parser
htmlContent = html.fromstring(response.content)
#Capturing all 'a' tags under h2 tag with class 'hentry-title entry-title'
page_links = htmlContent.xpath('//div[@class = "archive-constraint"]//h2[@class = "hentry-title entry-title"]/a/@href')
for page_link in page_links:
fileWriter.writerow([sl_no, i, page_link])
sl_no += 1