Warning: file_get_contents(/data/phpspider/zhask/data//catemap/8/python-3.x/16.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 蟒蛇3刮刀。不';直到最后才解析xpath_Python_Python 3.x_Web Scraping_Web Crawler - Fatal编程技术网

Python 蟒蛇3刮刀。不';直到最后才解析xpath

Python 蟒蛇3刮刀。不';直到最后才解析xpath,python,python-3.x,web-scraping,web-crawler,Python,Python 3.x,Web Scraping,Web Crawler,我正在使用lxml.html模块 from lxml import html page = html.parse('http://directory.ccnecommunity.org/reports/rptAccreditedPrograms_New.asp?sort=institution') # print(page.content) unis = page.xpath('//tr/td[@valign="top" and @style="width: 50%;padding-

我正在使用lxml.html模块

from lxml import html   

page = html.parse('http://directory.ccnecommunity.org/reports/rptAccreditedPrograms_New.asp?sort=institution')

# print(page.content)

unis = page.xpath('//tr/td[@valign="top" and @style="width: 50%;padding-right:15px"]/h3/text()')

print(unis.__len__())

with open('workfile.txt', 'w') as f:
    for uni in unis:
        f.write(uni + '\n')
这里的网站()上全是大学

问题是它一直解析到字母“H”(244个单位)。 我不明白为什么,因为我看到它一直解析所有的HTML


我还记录了我自己,244不是python3中列表或任何东西的限制

那个HTML页面根本不是HTML,它完全被破坏了。但以下内容将满足您的要求。它使用解析器


有关更多信息,请参阅。

有关网页抓取,我建议您使用 有了bs4,这很容易做到:

from bs4 import BeautifulSoup
import urllib.request

universities = []
result = urllib.request.urlopen('http://directory.ccnecommunity.org/reports/rptAccreditedPrograms_New.asp?sort=institution#Z')

soup = BeautifulSoup(result.read(),'html.parser')

table = soup.find_all(lambda tag: tag.name=='table')
for t in table:
    rows = t.find_all(lambda tag: tag.name=='tr')
    for r in rows:
        # there are also the A-Z headers -> check length
        # there are also empty headers -> check isspace()
        headers = r.find_all(lambda tag: tag.name=='h3' and tag.text.isspace()==False and len(tag.text.strip()) > 2)
        for h in headers:
            universities.append(h.text)

考虑使用<代码>请求< /代码>和<代码>漂亮的Soup4?同样,正如我所说的,它解析HTML直到结束。所以问题不在我使用的请求函数中。它是用于python3的吗?因为它没有识别urlopen函数。抱歉,用Python 2测试了它。对于Python3,您需要添加
request
。答案已更新。但请注意,lxml可能会遇到另一个问题:
NameError:name'unichr'未定义
这在以后的lxml版本中已得到修复(请参阅以供参考)。顺便说一句,为了减少XPath表达式对格式的依赖,您可能应该使用
//tr td/h3[以下同级::br]/text()
相反。或者要模拟Mad Matts解决方案,请使用
//tr/td/h3/text()[字符串长度(规范化空格()>0]
from bs4 import BeautifulSoup
import urllib.request

universities = []
result = urllib.request.urlopen('http://directory.ccnecommunity.org/reports/rptAccreditedPrograms_New.asp?sort=institution#Z')

soup = BeautifulSoup(result.read(),'html.parser')

table = soup.find_all(lambda tag: tag.name=='table')
for t in table:
    rows = t.find_all(lambda tag: tag.name=='tr')
    for r in rows:
        # there are also the A-Z headers -> check length
        # there are also empty headers -> check isspace()
        headers = r.find_all(lambda tag: tag.name=='h3' and tag.text.isspace()==False and len(tag.text.strip()) > 2)
        for h in headers:
            universities.append(h.text)