Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/299.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python-数据不完整(Web抓取)_Python_Web Scraping_Beautifulsoup_Urllib2 - Fatal编程技术网

Python-数据不完整(Web抓取)

Python-数据不完整(Web抓取),python,web-scraping,beautifulsoup,urllib2,Python,Web Scraping,Beautifulsoup,Urllib2,这是我的代码: from bs4 import BeautifulSoup import urllib2 import re import sys main_url = "http://sukhansara.com/سخن-سرا-پر-خوش-آمدید/newposts/parveenshakir/psghazals/" test_url = urllib2.urlopen(main_url) readHtml = test_url.read() test_url.close()

这是我的代码:

from bs4 import BeautifulSoup
import urllib2
import re
import sys


main_url = "http://sukhansara.com/سخن-سرا-پر-خوش-آمدید/newposts/parveenshakir/psghazals/"
test_url = urllib2.urlopen(main_url)
readHtml = test_url.read()
test_url.close()


soup = BeautifulSoup(readHtml, "html.parser")

url = soup.find('div',attrs={"class":"entry-content"}).findAll('div', attrs={"class":None})

count = 1

fobj = open('D:\Scrapping\parveen_again2.xml', 'w')
for getting in url:
   url = getting.find('a')
   if url.has_attr('href'):
          urls = url['href']       
          test_url = urllib2.urlopen(urls, timeout=36)
          readHtml = test_url.read()
          test_url.close()

          soup1 = BeautifulSoup(readHtml, "html.parser")

          title = soup1.find('title')
          title = title.get_text('+')
          title = title.split("|")

          author = soup1.find('div',attrs={"class":"entry-meta"}).find('span',attrs={"class":"categories-links"})


          author = author.findAll('a')

          fobj.write("<add><doc>\n")
          fobj.write("<field name=\"id\">sukhansara.com_pg1Author"+author[0].string.encode('utf8')+"Count"+str(count)+"</field>\n")
          fobj.write("<field name=\"title\">"+title[0].encode('utf8')+"</field>\n")
          fobj.write("<field name=\"content\">")

          count += 1


          poetry = soup1.find('div',attrs={"class":"entry-content"}).findAll('div')

          x=1
          check = True

          while check:
                 if poetry[x+1].string.encode('utf8') != author[0].string.encode('utf8'):
                        fobj.write(poetry[x].string.encode('utf8')+"|")
                        x+=1
                 else:
                        check = False
          fobj.write(poetry[x].string.encode('utf8'))

          fobj.write("</field>\n")
          fobj.write("<field name=\"group\">ur_poetry</field>\n")
          fobj.write("<field name=\"author\">"+author[0].string.encode('utf8')+"</field>\n")
          fobj.write("<field name=\"url\">"+urls+"</field>\n")
          fobj.write("<add><doc>\n\n")



fobj.close()

print "Done printing"
从bs4导入美化组
导入urllib2
进口稀土
导入系统
主url=”http://sukhansara.com/س-س-س-س-ش-ش-ٓ-ی/newposts/parveenshakir/psghazals/”
test_url=urllib2.urlopen(主url)
readHtml=test_url.read()
test_url.close()
soup=BeautifulSoup(readHtml,“html.parser”)
url=soup.find('div',attrs={“class”:“entry content”}).findAll('div',attrs={“class”:None})
计数=1
fobj=open('D:\scrasting\parveen\u again2.xml','w')
要获取url,请执行以下操作:
url=get.find('a')
如果url.has_attr('href'):
url=url['href']
test_url=urllib2.urlopen(url,超时=36)
readHtml=test_url.read()
test_url.close()
soup1=BeautifulSoup(readHtml,“html.parser”)
title=soup1.find('title')
title=title.get_文本(“+”)
title=title.split(“|”)
author=soup1.find('div',attrs={“class”:“entry meta”}.find('span',attrs={“class”:“categories links”})
author=author.findAll('a'))
fobj.write(“\n”)
write(“sukhansara.com_pg1Author”+作者[0].string.encode('utf8')+“Count”+str(Count)+“\n”)
fobj.write(“+title[0]。编码('utf8')+“\n”)
fobj.写(“”)
计数+=1
poetry=soup1.find('div',attrs={“class”:“entry content”}.findAll('div'))
x=1
检查=正确
检查时:
如果诗[x+1].string.encode('utf8')!=作者[0]。字符串。编码('utf8'):
fobj.write(诗歌[x].string.encode('utf8')+“|”)
x+=1
其他:
检查=错误
write(poetry[x].string.encode('utf8'))
fobj.write(“\n”)
写作(“你的诗歌”)
fobj.write(“+author[0].string.encode('utf8')+“\n”)
fobj.write(“+URL+”\n”)
fobj.write(“\n\n”)
fobj.close()
打印“完成打印”
有时我从24个网址上收到24首诗,有时81首。但是几乎有100个URL?每当我达到81岁时,就会出现这种错误

AttributeError:“非类型”对象没有属性“encode”

或者有时设置超时错误。我做错了什么?

切换到并维护一个打开的会话应该可以让它工作:

import requests

with requests.Session() as session:
    main_url = "http://sukhansara.com/سخن-سرا-پر-خوش-آمدید/newposts/parveenshakir/psghazals/"

    readHtml = session.get(main_url).content
    soup = BeautifulSoup(readHtml, "html.parser")

    # ...