Web scraping Scrapy:使用css选择器获取表tr不起作用

Web scraping Scrapy:使用css选择器获取表tr不起作用,web-scraping,scrapy,web-crawler,Web Scraping,Scrapy,Web Crawler,这里,标题和文本的输出为null[]。并且也得到了这个错误 def parse_courseTimings(self, response): sub_courses_tables = response.css('table.datadisplaytable tr') flag2 = 0 for sub_course in sub_courses_tables: flag2 = flag2 + 1

这里,标题和文本的输出为null[]。并且也得到了这个错误

   def parse_courseTimings(self, response):

        sub_courses_tables = response.css('table.datadisplaytable tr')

        flag2 = 0
        for sub_course in sub_courses_tables:
            flag2 = flag2 + 1
       
            if flag2 == 1:
                title = sub_course.css('th.ddttitle a::text').extract_first()
                print(title)
            else:
                text = sub_course.css('td.dddefault :: text').extract()
                # while "\n" in text: text.remove("\n")
                print(text)
            if flag2 == 2:
                flag2 = 0

没有一个
没有一个
没有一个
2021-03-31 12:34:59[scrapy.core.scraper]错误:十字轴错误处理(参考:https://banner.uregina.ca:17023/s
sbprod/bwckctlg.p_显示课程)
回溯(最近一次呼叫最后一次):
文件“c:\users\uposia\pycharmprojects\schedulesrapper\venv\lib\site packages\twisted\internet\defer.py”,第662行,在运行回调中
current.result=回调(current.result,*args,**kw)
文件“C:\Users\UPosia\PycharmProjects\schedulesrapper\schedule\u crawler\schedule\u crawler\spider\schedule\u spider.py”,第144行,在parse\u courseTimings中
text=sub_course.css('td.dddefault::text')。extract()
css中的文件“c:\users\uposia\pycharmprojects\schedulespare\venv\lib\site packages\parsel\selector.py”,第282行
返回self.xpath(self.\u css2xpath(查询))
文件“c:\users\uposia\pycharmprojects\schedulespare\venv\lib\site packages\parsel\selector.py”,第285行,在_css2xpath中
返回self.\u cstranslator.css\u到\u xpath(查询)
文件“c:\users\uposia\pycharmprojects\schedulespare\venv\lib\site packages\parsel\cstranslator.py”,第107行,css_to_xpath
返回super(HTMLTranslator,self).css\u to\uxpath(css,前缀)
文件“c:\users\uposia\pycharmprojects\schedulespares\venv\lib\site packages\cssselect\xpath.py”,第192行,css_to_xpath
用于解析中的选择器(css))
文件“c:\users\uposia\pycharmprojects\schedulespares\venv\lib\site packages\cssselect\parser.py”,第415行,在parse中
返回列表(解析\选择器\组(流))
文件“c:\users\uposia\pycharmprojects\schedulesrapper\venv\lib\site packages\cssselect\parser.py”,第428行,在parse\u selector\u组中
收益选择器(*解析选择器(流))
文件“c:\users\uposia\pycharmprojects\schedulespares\venv\lib\site packages\cssselect\parser.py”,第454行,在parse_选择器中
下一个\选择器,伪\元素=解析\简单\选择器(流)
文件“c:\users\uposia\pycharmprojects\schedulespares\venv\lib\site packages\cssselect\parser.py”,第503行,在parse\u simple\u选择器中
伪_元素=stream.next_ident()
文件“c:\users\uposia\pycharmprojects\schedulespares\venv\lib\site packages\cssselect\parser.py”,第819行,在下一个标识中
raise SELECTORSYNTASERROR('需要标识,获取了%s'(下一步,))
文件“”,第行无
cssselect.parser.SelectorSyntaxError:应为标识符,已获取
我不确定这里有什么问题。
我正在努力获取课程内容的所有细节。然而,当我试图使用for循环获取每个课程的信息时。但这会引起错误。

更新:这个问题解决了,我只需在获取tr时向表中添加summary属性

<Selector xpath="descendant-or-self::table[@class and contains(concat(' ', normalize-space(@class), ' '), ' datadisplaytable ')]/descendant-or-self::*/tr" data='<tr>\n<th class="ddtitle" scope="colgr...'>
None
None
<Selector xpath="descendant-or-self::table[@class and contains(concat(' ', normalize-space(@class), ' '), ' datadisplaytable ')]/descendant-or-self::*/tr" data='<tr>\n<td class="dddefault">\nPlus one ...'>
None
2021-03-31 12:34:59 [scrapy.core.scraper] ERROR: Spider error processing <GET https://banner.uregina.ca:17023/ssbprod/bwckctlg.p_disp_listcrse?term_in=202130&subj_in=CS&crse_in=330&schd_in=A> (referer: https://banner.uregina.ca:17023/s
sbprod/bwckctlg.p_display_courses)
Traceback (most recent call last):
  File "c:\users\uposia\pycharmprojects\schedulescraper\venv\lib\site-packages\twisted\internet\defer.py", line 662, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "C:\Users\UPosia\PycharmProjects\ScheduleScraper\schedule_crawler\schedule_crawler\spiders\schedule_spider.py", line 144, in parse_courseTimings
    text = sub_course.css('td.dddefault :: text').extract()
  File "c:\users\uposia\pycharmprojects\schedulescraper\venv\lib\site-packages\parsel\selector.py", line 282, in css
    return self.xpath(self._css2xpath(query))
  File "c:\users\uposia\pycharmprojects\schedulescraper\venv\lib\site-packages\parsel\selector.py", line 285, in _css2xpath
    return self._csstranslator.css_to_xpath(query)
  File "c:\users\uposia\pycharmprojects\schedulescraper\venv\lib\site-packages\parsel\csstranslator.py", line 107, in css_to_xpath
    return super(HTMLTranslator, self).css_to_xpath(css, prefix)
  File "c:\users\uposia\pycharmprojects\schedulescraper\venv\lib\site-packages\cssselect\xpath.py", line 192, in css_to_xpath
    for selector in parse(css))
  File "c:\users\uposia\pycharmprojects\schedulescraper\venv\lib\site-packages\cssselect\parser.py", line 415, in parse
    return list(parse_selector_group(stream))
  File "c:\users\uposia\pycharmprojects\schedulescraper\venv\lib\site-packages\cssselect\parser.py", line 428, in parse_selector_group
    yield Selector(*parse_selector(stream))
  File "c:\users\uposia\pycharmprojects\schedulescraper\venv\lib\site-packages\cssselect\parser.py", line 454, in parse_selector
    next_selector, pseudo_element = parse_simple_selector(stream)
  File "c:\users\uposia\pycharmprojects\schedulescraper\venv\lib\site-packages\cssselect\parser.py", line 503, in parse_simple_selector
    pseudo_element = stream.next_ident()
  File "c:\users\uposia\pycharmprojects\schedulescraper\venv\lib\site-packages\cssselect\parser.py", line 819, in next_ident
    raise SelectorSyntaxError('Expected ident, got %s' % (next,))
  File "<string>", line None
cssselect.parser.SelectorSyntaxError: Expected ident, got <S ' ' at 15>


你想要什么输出?这能回答你的问题吗?不,这是一个被删除的问题。我已经得到了答案。并且纠正了语法错误?是的,这纠正了所有的错误。事实上,问题是在页面的HTML中,我正在抓取的表中有另一个具有相同名称属性的表。唯一使第一个表唯一的是summary属性@QHarr
<Selector xpath="descendant-or-self::table[@class and contains(concat(' ', normalize-space(@class), ' '), ' datadisplaytable ')]/descendant-or-self::*/tr" data='<tr>\n<th class="ddtitle" scope="colgr...'>
None
None
<Selector xpath="descendant-or-self::table[@class and contains(concat(' ', normalize-space(@class), ' '), ' datadisplaytable ')]/descendant-or-self::*/tr" data='<tr>\n<td class="dddefault">\nPlus one ...'>
None
2021-03-31 12:34:59 [scrapy.core.scraper] ERROR: Spider error processing <GET https://banner.uregina.ca:17023/ssbprod/bwckctlg.p_disp_listcrse?term_in=202130&subj_in=CS&crse_in=330&schd_in=A> (referer: https://banner.uregina.ca:17023/s
sbprod/bwckctlg.p_display_courses)
Traceback (most recent call last):
  File "c:\users\uposia\pycharmprojects\schedulescraper\venv\lib\site-packages\twisted\internet\defer.py", line 662, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "C:\Users\UPosia\PycharmProjects\ScheduleScraper\schedule_crawler\schedule_crawler\spiders\schedule_spider.py", line 144, in parse_courseTimings
    text = sub_course.css('td.dddefault :: text').extract()
  File "c:\users\uposia\pycharmprojects\schedulescraper\venv\lib\site-packages\parsel\selector.py", line 282, in css
    return self.xpath(self._css2xpath(query))
  File "c:\users\uposia\pycharmprojects\schedulescraper\venv\lib\site-packages\parsel\selector.py", line 285, in _css2xpath
    return self._csstranslator.css_to_xpath(query)
  File "c:\users\uposia\pycharmprojects\schedulescraper\venv\lib\site-packages\parsel\csstranslator.py", line 107, in css_to_xpath
    return super(HTMLTranslator, self).css_to_xpath(css, prefix)
  File "c:\users\uposia\pycharmprojects\schedulescraper\venv\lib\site-packages\cssselect\xpath.py", line 192, in css_to_xpath
    for selector in parse(css))
  File "c:\users\uposia\pycharmprojects\schedulescraper\venv\lib\site-packages\cssselect\parser.py", line 415, in parse
    return list(parse_selector_group(stream))
  File "c:\users\uposia\pycharmprojects\schedulescraper\venv\lib\site-packages\cssselect\parser.py", line 428, in parse_selector_group
    yield Selector(*parse_selector(stream))
  File "c:\users\uposia\pycharmprojects\schedulescraper\venv\lib\site-packages\cssselect\parser.py", line 454, in parse_selector
    next_selector, pseudo_element = parse_simple_selector(stream)
  File "c:\users\uposia\pycharmprojects\schedulescraper\venv\lib\site-packages\cssselect\parser.py", line 503, in parse_simple_selector
    pseudo_element = stream.next_ident()
  File "c:\users\uposia\pycharmprojects\schedulescraper\venv\lib\site-packages\cssselect\parser.py", line 819, in next_ident
    raise SelectorSyntaxError('Expected ident, got %s' % (next,))
  File "<string>", line None
cssselect.parser.SelectorSyntaxError: Expected ident, got <S ' ' at 15>

sub_courses_tables = response.css('table.datadisplaytable tr')
#correct code

sub_courses_tables = response.css('table.datadisplaytable[summary="This layout table is used to present the sections found"] tr')