Web scraping Scrapy:使用css选择器获取表tr不起作用
这里,标题和文本的输出为null[]。并且也得到了这个错误Web scraping Scrapy:使用css选择器获取表tr不起作用,web-scraping,scrapy,web-crawler,Web Scraping,Scrapy,Web Crawler,这里,标题和文本的输出为null[]。并且也得到了这个错误 def parse_courseTimings(self, response): sub_courses_tables = response.css('table.datadisplaytable tr') flag2 = 0 for sub_course in sub_courses_tables: flag2 = flag2 + 1
def parse_courseTimings(self, response):
sub_courses_tables = response.css('table.datadisplaytable tr')
flag2 = 0
for sub_course in sub_courses_tables:
flag2 = flag2 + 1
if flag2 == 1:
title = sub_course.css('th.ddttitle a::text').extract_first()
print(title)
else:
text = sub_course.css('td.dddefault :: text').extract()
# while "\n" in text: text.remove("\n")
print(text)
if flag2 == 2:
flag2 = 0
没有一个
没有一个
没有一个
2021-03-31 12:34:59[scrapy.core.scraper]错误:十字轴错误处理(参考:https://banner.uregina.ca:17023/s
sbprod/bwckctlg.p_显示课程)
回溯(最近一次呼叫最后一次):
文件“c:\users\uposia\pycharmprojects\schedulesrapper\venv\lib\site packages\twisted\internet\defer.py”,第662行,在运行回调中
current.result=回调(current.result,*args,**kw)
文件“C:\Users\UPosia\PycharmProjects\schedulesrapper\schedule\u crawler\schedule\u crawler\spider\schedule\u spider.py”,第144行,在parse\u courseTimings中
text=sub_course.css('td.dddefault::text')。extract()
css中的文件“c:\users\uposia\pycharmprojects\schedulespare\venv\lib\site packages\parsel\selector.py”,第282行
返回self.xpath(self.\u css2xpath(查询))
文件“c:\users\uposia\pycharmprojects\schedulespare\venv\lib\site packages\parsel\selector.py”,第285行,在_css2xpath中
返回self.\u cstranslator.css\u到\u xpath(查询)
文件“c:\users\uposia\pycharmprojects\schedulespare\venv\lib\site packages\parsel\cstranslator.py”,第107行,css_to_xpath
返回super(HTMLTranslator,self).css\u to\uxpath(css,前缀)
文件“c:\users\uposia\pycharmprojects\schedulespares\venv\lib\site packages\cssselect\xpath.py”,第192行,css_to_xpath
用于解析中的选择器(css))
文件“c:\users\uposia\pycharmprojects\schedulespares\venv\lib\site packages\cssselect\parser.py”,第415行,在parse中
返回列表(解析\选择器\组(流))
文件“c:\users\uposia\pycharmprojects\schedulesrapper\venv\lib\site packages\cssselect\parser.py”,第428行,在parse\u selector\u组中
收益选择器(*解析选择器(流))
文件“c:\users\uposia\pycharmprojects\schedulespares\venv\lib\site packages\cssselect\parser.py”,第454行,在parse_选择器中
下一个\选择器,伪\元素=解析\简单\选择器(流)
文件“c:\users\uposia\pycharmprojects\schedulespares\venv\lib\site packages\cssselect\parser.py”,第503行,在parse\u simple\u选择器中
伪_元素=stream.next_ident()
文件“c:\users\uposia\pycharmprojects\schedulespares\venv\lib\site packages\cssselect\parser.py”,第819行,在下一个标识中
raise SELECTORSYNTASERROR('需要标识,获取了%s'(下一步,))
文件“”,第行无
cssselect.parser.SelectorSyntaxError:应为标识符,已获取
我不确定这里有什么问题。
我正在努力获取课程内容的所有细节。然而,当我试图使用for循环获取每个课程的信息时。但这会引起错误。更新:这个问题解决了,我只需在获取tr时向表中添加summary属性
<Selector xpath="descendant-or-self::table[@class and contains(concat(' ', normalize-space(@class), ' '), ' datadisplaytable ')]/descendant-or-self::*/tr" data='<tr>\n<th class="ddtitle" scope="colgr...'>
None
None
<Selector xpath="descendant-or-self::table[@class and contains(concat(' ', normalize-space(@class), ' '), ' datadisplaytable ')]/descendant-or-self::*/tr" data='<tr>\n<td class="dddefault">\nPlus one ...'>
None
2021-03-31 12:34:59 [scrapy.core.scraper] ERROR: Spider error processing <GET https://banner.uregina.ca:17023/ssbprod/bwckctlg.p_disp_listcrse?term_in=202130&subj_in=CS&crse_in=330&schd_in=A> (referer: https://banner.uregina.ca:17023/s
sbprod/bwckctlg.p_display_courses)
Traceback (most recent call last):
File "c:\users\uposia\pycharmprojects\schedulescraper\venv\lib\site-packages\twisted\internet\defer.py", line 662, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "C:\Users\UPosia\PycharmProjects\ScheduleScraper\schedule_crawler\schedule_crawler\spiders\schedule_spider.py", line 144, in parse_courseTimings
text = sub_course.css('td.dddefault :: text').extract()
File "c:\users\uposia\pycharmprojects\schedulescraper\venv\lib\site-packages\parsel\selector.py", line 282, in css
return self.xpath(self._css2xpath(query))
File "c:\users\uposia\pycharmprojects\schedulescraper\venv\lib\site-packages\parsel\selector.py", line 285, in _css2xpath
return self._csstranslator.css_to_xpath(query)
File "c:\users\uposia\pycharmprojects\schedulescraper\venv\lib\site-packages\parsel\csstranslator.py", line 107, in css_to_xpath
return super(HTMLTranslator, self).css_to_xpath(css, prefix)
File "c:\users\uposia\pycharmprojects\schedulescraper\venv\lib\site-packages\cssselect\xpath.py", line 192, in css_to_xpath
for selector in parse(css))
File "c:\users\uposia\pycharmprojects\schedulescraper\venv\lib\site-packages\cssselect\parser.py", line 415, in parse
return list(parse_selector_group(stream))
File "c:\users\uposia\pycharmprojects\schedulescraper\venv\lib\site-packages\cssselect\parser.py", line 428, in parse_selector_group
yield Selector(*parse_selector(stream))
File "c:\users\uposia\pycharmprojects\schedulescraper\venv\lib\site-packages\cssselect\parser.py", line 454, in parse_selector
next_selector, pseudo_element = parse_simple_selector(stream)
File "c:\users\uposia\pycharmprojects\schedulescraper\venv\lib\site-packages\cssselect\parser.py", line 503, in parse_simple_selector
pseudo_element = stream.next_ident()
File "c:\users\uposia\pycharmprojects\schedulescraper\venv\lib\site-packages\cssselect\parser.py", line 819, in next_ident
raise SelectorSyntaxError('Expected ident, got %s' % (next,))
File "<string>", line None
cssselect.parser.SelectorSyntaxError: Expected ident, got <S ' ' at 15>
你想要什么输出?这能回答你的问题吗?不,这是一个被删除的问题。我已经得到了答案。并且纠正了语法错误?是的,这纠正了所有的错误。事实上,问题是在页面的HTML中,我正在抓取的表中有另一个具有相同名称属性的表。唯一使第一个表唯一的是summary属性@QHarr
<Selector xpath="descendant-or-self::table[@class and contains(concat(' ', normalize-space(@class), ' '), ' datadisplaytable ')]/descendant-or-self::*/tr" data='<tr>\n<th class="ddtitle" scope="colgr...'>
None
None
<Selector xpath="descendant-or-self::table[@class and contains(concat(' ', normalize-space(@class), ' '), ' datadisplaytable ')]/descendant-or-self::*/tr" data='<tr>\n<td class="dddefault">\nPlus one ...'>
None
2021-03-31 12:34:59 [scrapy.core.scraper] ERROR: Spider error processing <GET https://banner.uregina.ca:17023/ssbprod/bwckctlg.p_disp_listcrse?term_in=202130&subj_in=CS&crse_in=330&schd_in=A> (referer: https://banner.uregina.ca:17023/s
sbprod/bwckctlg.p_display_courses)
Traceback (most recent call last):
File "c:\users\uposia\pycharmprojects\schedulescraper\venv\lib\site-packages\twisted\internet\defer.py", line 662, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "C:\Users\UPosia\PycharmProjects\ScheduleScraper\schedule_crawler\schedule_crawler\spiders\schedule_spider.py", line 144, in parse_courseTimings
text = sub_course.css('td.dddefault :: text').extract()
File "c:\users\uposia\pycharmprojects\schedulescraper\venv\lib\site-packages\parsel\selector.py", line 282, in css
return self.xpath(self._css2xpath(query))
File "c:\users\uposia\pycharmprojects\schedulescraper\venv\lib\site-packages\parsel\selector.py", line 285, in _css2xpath
return self._csstranslator.css_to_xpath(query)
File "c:\users\uposia\pycharmprojects\schedulescraper\venv\lib\site-packages\parsel\csstranslator.py", line 107, in css_to_xpath
return super(HTMLTranslator, self).css_to_xpath(css, prefix)
File "c:\users\uposia\pycharmprojects\schedulescraper\venv\lib\site-packages\cssselect\xpath.py", line 192, in css_to_xpath
for selector in parse(css))
File "c:\users\uposia\pycharmprojects\schedulescraper\venv\lib\site-packages\cssselect\parser.py", line 415, in parse
return list(parse_selector_group(stream))
File "c:\users\uposia\pycharmprojects\schedulescraper\venv\lib\site-packages\cssselect\parser.py", line 428, in parse_selector_group
yield Selector(*parse_selector(stream))
File "c:\users\uposia\pycharmprojects\schedulescraper\venv\lib\site-packages\cssselect\parser.py", line 454, in parse_selector
next_selector, pseudo_element = parse_simple_selector(stream)
File "c:\users\uposia\pycharmprojects\schedulescraper\venv\lib\site-packages\cssselect\parser.py", line 503, in parse_simple_selector
pseudo_element = stream.next_ident()
File "c:\users\uposia\pycharmprojects\schedulescraper\venv\lib\site-packages\cssselect\parser.py", line 819, in next_ident
raise SelectorSyntaxError('Expected ident, got %s' % (next,))
File "<string>", line None
cssselect.parser.SelectorSyntaxError: Expected ident, got <S ' ' at 15>
sub_courses_tables = response.css('table.datadisplaytable tr')
#correct code
sub_courses_tables = response.css('table.datadisplaytable[summary="This layout table is used to present the sections found"] tr')