Warning: file_get_contents(/data/phpspider/zhask/data//catemap/1/cassandra/3.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 获得<;h2>;与<;李>;标签_Python_Beautifulsoup - Fatal编程技术网

Python 获得<;h2>;与<;李>;标签

Python 获得<;h2>;与<;李>;标签,python,beautifulsoup,Python,Beautifulsoup,我正在使用BeautifulSoup清理一个网站。我可以获取标记中的所有数据,但我需要获取与相应标记相关的标记中的日期 date = results_table.find_all('h2', string=re.compile('January|February|March|April|May|June|July|August|September|October|November|December')) locale.setlocale(locale.LC_ALL, 'en_US')

我正在使用BeautifulSoup清理一个网站。我可以获取
  • 标记中的所有数据,但我需要获取与相应
  • 标记相关的
    标记中的日期

    date = results_table.find_all('h2', string=re.compile('January|February|March|April|May|June|July|August|September|October|November|December'))
        locale.setlocale(locale.LC_ALL, 'en_US')
        changeDateFormat = date.text.strip()
        datePublished = datetime.datetime.strptime(changeDateFormat, '%B %d, %Y').strftime('%m%d%Y')
        ul = results_table.find('ul')
    
        for item in results_table.find_all('li', {'class': 'level-item'}):
            # try to obtain the correct date
            print(ul.previous_element)
            for nextLink in item.find_all('a'):
                for ad_id in nextLink.find_all('span'):
                    print(ad_id.text.strip())
    
    所需输出:

    05182018,/somedirectoryname/anothername/009,sometext,another value,long description 
    05182018,/somedirectoryname/anothername/008,sometext,another value,long description 
    03092018,/somedirectoryname/anothername/007,sometext,another value,long description 
    03092018,/somedirectoryname/anothername/006,sometext,another value,long description 
    03092018,/somedirectoryname/anothername/005,sometext,another value,long description 
    03092018,/somedirectoryname/anothername/004,sometext,another value,long description 
    
    网页结构:

    <h2>May 18, 2018<h2>
    <ul>
    
     <li class="level-item"><a href=“/somedirectoryname/anothername/009”><span class=“some text”>another value</span> long description </a></li>
    
     <li class="level-item"><a href=“/somedirectoryname/anothername/008”><span class=“some text”>another value</span> long description </a></li>
    
    </ul>
    
    <h2>March 9, 2018<h2>
    <ul>
    <li class="level-item"><a href=“/somedirectoryname/anothername/007”><span class=“some text”>another value</span> long description </a></li>
    
    <li class="level-item"><a href=“/somedirectoryname/anothername/006”><span class=“some text”>another value</span> long description </a></li>
    
    <li class="level-item"><a href=“/somedirectoryname/anothername/005”><span class=“some text”>another value</span> long description </a></li>
    
    <li class="level-item"><a href=“/somedirectoryname/anothername/004”><span class=“some text”>another value</span> long description </a></li>
    
    </ul>
    
    <h2>December 1, 2017<h2>
    <ul>
    
    <li class="level-item"><a href=“/somedirectoryname/anothername/003”><span class=“some text”>another value</span> long description </a></li>
    
    <li class="level-item"><a href=“/somedirectoryname/anothername/002”><span class=“some text”>another value</span> long description </a></li>
    
    <li class="level-item"><a href=“/somedirectoryname/anothername/001”><span class=“some text”>another value</span> long description </a></li>
    
    for date_tag in results_table.find_all('h2'):
        date = date_tag.text
        for item in date_tag.find_next('ul').find_all('li'):
            print(date, item.a['href'], item.span['class'][0], item.get_text(',', strip=True), sep=',')
    
    May 18, 2018,/somedirectoryname/anothername/009,some,another value,long description
    May 18, 2018,/somedirectoryname/anothername/008,some,another value,long description
    March 9, 2018,/somedirectoryname/anothername/007,some,another value,long description
    March 9, 2018,/somedirectoryname/anothername/006,some,another value,long description
    March 9, 2018,/somedirectoryname/anothername/005,some,another value,long description
    March 9, 2018,/somedirectoryname/anothername/004,some,another value,long description
    December 1, 2017,/somedirectoryname/anothername/003,some,another value,long description
    December 1, 2017,/somedirectoryname/anothername/002,some,another value,long description
    December 1, 2017,/somedirectoryname/anothername/001,some,another value,long description
    
    使用您所做的查找所有
    标记后,您可以使用或获取相应的
    标记。然后简单地迭代所有的
  • 标记

    代码:

    <h2>May 18, 2018<h2>
    <ul>
    
     <li class="level-item"><a href=“/somedirectoryname/anothername/009”><span class=“some text”>another value</span> long description </a></li>
    
     <li class="level-item"><a href=“/somedirectoryname/anothername/008”><span class=“some text”>another value</span> long description </a></li>
    
    </ul>
    
    <h2>March 9, 2018<h2>
    <ul>
    <li class="level-item"><a href=“/somedirectoryname/anothername/007”><span class=“some text”>another value</span> long description </a></li>
    
    <li class="level-item"><a href=“/somedirectoryname/anothername/006”><span class=“some text”>another value</span> long description </a></li>
    
    <li class="level-item"><a href=“/somedirectoryname/anothername/005”><span class=“some text”>another value</span> long description </a></li>
    
    <li class="level-item"><a href=“/somedirectoryname/anothername/004”><span class=“some text”>another value</span> long description </a></li>
    
    </ul>
    
    <h2>December 1, 2017<h2>
    <ul>
    
    <li class="level-item"><a href=“/somedirectoryname/anothername/003”><span class=“some text”>another value</span> long description </a></li>
    
    <li class="level-item"><a href=“/somedirectoryname/anothername/002”><span class=“some text”>another value</span> long description </a></li>
    
    <li class="level-item"><a href=“/somedirectoryname/anothername/001”><span class=“some text”>another value</span> long description </a></li>
    
    for date_tag in results_table.find_all('h2'):
        date = date_tag.text
        for item in date_tag.find_next('ul').find_all('li'):
            print(date, item.a['href'], item.span['class'][0], item.get_text(',', strip=True), sep=',')
    
    May 18, 2018,/somedirectoryname/anothername/009,some,another value,long description
    May 18, 2018,/somedirectoryname/anothername/008,some,another value,long description
    March 9, 2018,/somedirectoryname/anothername/007,some,another value,long description
    March 9, 2018,/somedirectoryname/anothername/006,some,another value,long description
    March 9, 2018,/somedirectoryname/anothername/005,some,another value,long description
    March 9, 2018,/somedirectoryname/anothername/004,some,another value,long description
    December 1, 2017,/somedirectoryname/anothername/003,some,another value,long description
    December 1, 2017,/somedirectoryname/anothername/002,some,another value,long description
    December 1, 2017,/somedirectoryname/anothername/001,some,another value,long description
    
    输出:

    <h2>May 18, 2018<h2>
    <ul>
    
     <li class="level-item"><a href=“/somedirectoryname/anothername/009”><span class=“some text”>another value</span> long description </a></li>
    
     <li class="level-item"><a href=“/somedirectoryname/anothername/008”><span class=“some text”>another value</span> long description </a></li>
    
    </ul>
    
    <h2>March 9, 2018<h2>
    <ul>
    <li class="level-item"><a href=“/somedirectoryname/anothername/007”><span class=“some text”>another value</span> long description </a></li>
    
    <li class="level-item"><a href=“/somedirectoryname/anothername/006”><span class=“some text”>another value</span> long description </a></li>
    
    <li class="level-item"><a href=“/somedirectoryname/anothername/005”><span class=“some text”>another value</span> long description </a></li>
    
    <li class="level-item"><a href=“/somedirectoryname/anothername/004”><span class=“some text”>another value</span> long description </a></li>
    
    </ul>
    
    <h2>December 1, 2017<h2>
    <ul>
    
    <li class="level-item"><a href=“/somedirectoryname/anothername/003”><span class=“some text”>another value</span> long description </a></li>
    
    <li class="level-item"><a href=“/somedirectoryname/anothername/002”><span class=“some text”>another value</span> long description </a></li>
    
    <li class="level-item"><a href=“/somedirectoryname/anothername/001”><span class=“some text”>another value</span> long description </a></li>
    
    for date_tag in results_table.find_all('h2'):
        date = date_tag.text
        for item in date_tag.find_next('ul').find_all('li'):
            print(date, item.a['href'], item.span['class'][0], item.get_text(',', strip=True), sep=',')
    
    May 18, 2018,/somedirectoryname/anothername/009,some,another value,long description
    May 18, 2018,/somedirectoryname/anothername/008,some,another value,long description
    March 9, 2018,/somedirectoryname/anothername/007,some,another value,long description
    March 9, 2018,/somedirectoryname/anothername/006,some,another value,long description
    March 9, 2018,/somedirectoryname/anothername/005,some,another value,long description
    March 9, 2018,/somedirectoryname/anothername/004,some,another value,long description
    December 1, 2017,/somedirectoryname/anothername/003,some,another value,long description
    December 1, 2017,/somedirectoryname/anothername/002,some,another value,long description
    December 1, 2017,/somedirectoryname/anothername/001,some,another value,long description
    

    请注意,我没有将日期转换为您想要的格式,因为您已经这样做了。如果HTML中有其他
    标记,则可以使用
    find_all('h2',string=re.compile('一月|…|十二月'))
    而不是
    find_all('h2')