Python 使用Beautifulsoup和collect表文本值进行Web刮片

Python 使用Beautifulsoup和collect表文本值进行Web刮片,python,python-3.x,beautifulsoup,Python,Python 3.x,Beautifulsoup,我有我的代码如下,从NSE网站收集数据。 基本上我想收集两个信息: 什么是公告主题 检查是否有任何pdf文件可用,然后打印链接 我能够获取pdf链接,但无法阅读公告主题,该主题为 麦克电子有限公司已通知交易所 'M/s的解决方案。Cosyn财团在M/s问题上。麦克风 电子有限公司已获得海得拉巴Bench Hon'ble NCLT的批准 任何帮助 import requests import json import bs4 base_url = 'https://www.nseindia.com

我有我的代码如下,从NSE网站收集数据。 基本上我想收集两个信息:

  • 什么是
    公告
    主题
  • 检查是否有任何
    pdf
    文件可用,然后打印链接
  • 我能够获取pdf链接,但无法阅读
    公告
    主题,该主题为

    麦克电子有限公司已通知交易所 'M/s的解决方案。Cosyn财团在M/s问题上。麦克风 电子有限公司已获得海得拉巴Bench Hon'ble NCLT的批准

    任何帮助

    import requests
    import json
    import bs4
    
    base_url = 'https://www.nseindia.com'
    url = 'https://www.nseindia.com/corporates/directLink/latestAnnouncementsCorpHome.jsp'
    
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'}
    
    response = requests.get(url, headers=headers)
    jsonStr = response.text.strip()
    keys_needing_quotes = ['company:','date:','desc:','link:','symbol:']
    
    for key in keys_needing_quotes:
        jsonStr = jsonStr.replace(key, '"%s":' %(key[:-1]))
    
    data = json.loads(jsonStr)
    data = data['rows']
    # print(data)
    
    symbol_list = ['MIC']
    for x in range(0, len(data)):
        if data[x]['symbol'] in symbol_list:
            response = requests.get(base_url + data[x]['link'], headers=headers)
            soup = bs4.BeautifulSoup(response.text, 'html.parser')
            print(soup)
    
            try:
                pdf_file = base_url + soup.find_all('a', href=True)[0]['href']
                print("File_Link:", pdf_file)
            except:
                print('PDF not found')
    

    这是因为您的代码不包含任何内容,甚至无法尝试获取
    Announcement
    单元格。添加它很容易,请查看下面的内容

    symbol_list = ['MIC']
    for x in range(0, len(data)):
        if data[x]['symbol'] in symbol_list:
            response = requests.get(base_url + data[x]['link'], headers=headers)
            soup = bs4.BeautifulSoup(response.text, 'html.parser')
    
            try:
                # Announcement is 6th element of class t1.
                announce = soup.find_all(class_='t1')[5].get_text()
                print("Announcement: ", announce)
            except:
                print("Announcement not found")
    
            try:
                pdf_file = base_url + soup.find_all('a', href=True)[0]['href']
                print("File_Link: ", pdf_file)
            except:
                print('PDF not found')
    
    这将输出预期结果:

    或者您可以使用:

    for s in soup.find_all('td', 'tablehead'):
        if 'Announcement' in s.text:
            break
    
    print(s.find_next_sibling().text))
    # output: 
    # MIC Electronics Limited has informed the Exchange regarding 'Resolution Plan of M/s. Cosyn Consortium in the matter of M/s. MIC Electronics Limited has been approved by Hon'ble NCLT, Hyderabad Bench