如何使用Python抓取网站内容

如何使用Python抓取网站内容,python,beautifulsoup,Python,Beautifulsoup,使用python如何从网站中获取内容 import re import time import requests from bs4 import BeautifulSoup import pandas as pd def main(): html = requests.get("https://economictimes.indiatimes.com/marketstats/pageno-1,pid-58,sortby-CurrentYearRank,

使用python如何从网站中获取内容

import re    
import time    
import requests    
from bs4 import BeautifulSoup    
import pandas as pd

def main():

    html = requests.get("https://economictimes.indiatimes.com/marketstats/pageno-1,pid-58,sortby-CurrentYearRank,sortorder-asc,year-2017.cms")
    soup = BeautifulSoup(html.text, 'html.parser')
    jstr = {}
    lis = []
    code = ''
    comp = ''
    for link in soup.find_all(class_='w170 alignL'):

        print(link.get('href'))
        Name1 = link
        Name11 = str(Name1)
        Name2 = Name11.lstrip('</b>')
        Name = Name2.rstrip('</b>')
        print(Name)
        input()

        try:
            data = {'Name': Name}
            print('\n \n')
            lis.append(data)
            li = []
            p = re.compile('\w+')
            processed_texts = []
            processed_texts = p.findall(str(data))
            print("processed_texts",processed_texts)

        except:
            pass    

if __name__ == '__main__':    
    main()
重新导入
导入时间
导入请求
从bs4导入BeautifulSoup
作为pd进口熊猫
def main():
html=请求。获取(“https://economictimes.indiatimes.com/marketstats/pageno-1,pid-58,排序依据CurrentYearRank,排序依据asc,2017年。cms”)
soup=BeautifulSoup(html.text,'html.parser')
jstr={}
lis=[]
代码=“”
comp=“”
查找所有(class='w170 alignL'):
打印(link.get('href'))
名称1=链接
Name11=str(Name1)
Name2=Name11.lstrip(“”)
Name=Name2.rstrip(“”)
印刷品(名称)
输入()
尝试:
数据={'Name':Name}
打印('\n\n')
lis.append(数据)
li=[]
p=重新编译('\w+'))
已处理的_文本=[]
已处理文本=p.findall(str(数据))
打印(“已处理的_文本”,已处理的_文本)
除:
通过
如果uuuu name uuuuuu='\uuuuuuu main\uuuuuuu':
main()

如果您检查该表,您可以看到它位于该
标签。但是,如果查看页面源代码,您将看到以下代码:

内容是用JS动态生成的。您不能直接使用
请求
模块执行JS。为此,您必须使用硒。对于安装和演示

您可以通过以下方式使用Selenium:

from bs4 import BeautifulSoup
from selenium import webdriver

URL = 'https://economictimes.indiatimes.com/marketstats/pageno-1,pid-58,sortby-CurrentYearRank,sortorder-asc,year-2017.cms'
driver = webdriver.Chrome()
driver.get(URL)
html = driver.page_source
driver.quit()

soup = BeautifulSoup(html, 'html.parser')
for li in soup.find_all('li', class_='w170 alignL'):
    a = li.find('a')
    company_name = a.text
    company_url = a['href']  # This is the link that you were looking for.
    # You can save or print these values however you want.
    print(company_name, company_url)
输出:

Indian Oil Corporation Ltd. /indian-oil-corporation-ltd/stocks/companyid-11924.cms
Reliance Industries Ltd. /reliance-industries-ltd/stocks/companyid-13215.cms
State Bank of India /state-bank-of-india/stocks/companyid-11984.cms
Tata Motors Ltd. /tata-motors-ltd/stocks/companyid-12934.cms
Rajesh Exports Ltd. /rajesh-exports-ltd/stocks/companyid-6650.cms
Bharat Petroleum Corporation Ltd. /bharat-petroleum-corporation-ltd/stocks/companyid-11941.cms
Hindustan Petroleum Corporation Ltd. /hindustan-petroleum-corporation-ltd/stocks/companyid-12078.cms
Oil And Natural Gas Corporation Ltd. /oil-and-natural-gas-corporation-ltd/stocks/companyid-11599.cms
Coal India Ltd. /coal-india-ltd/stocks/companyid-11822.cms
Tata Consultancy Services Ltd. /tata-consultancy-services-ltd/stocks/companyid-8345.cms
ICICI Bank Ltd. /icici-bank-ltd/stocks/companyid-9194.cms
Tata Steel Ltd. /tata-steel-ltd/stocks/companyid-12902.cms
Larsen & Toubro Ltd. /larsen-&-toubro-ltd/stocks/companyid-13447.cms
Hindalco Industries Ltd. /hindalco-industries-ltd/stocks/companyid-13637.cms
Bharti Airtel Ltd. /bharti-airtel-ltd/stocks/companyid-2718.cms
HDFC Bank Ltd. /hdfc-bank-ltd/stocks/companyid-9195.cms
Mahindra & Mahindra Ltd. /mahindra-&-mahindra-ltd/stocks/companyid-11898.cms
NTPC Ltd. /ntpc-ltd/stocks/companyid-12316.cms
Vedanta Ltd. /vedanta-ltd/stocks/companyid-13111.cms
Infosys Ltd. /infosys-ltd/stocks/companyid-10960.cms
Maruti Suzuki India Ltd. /maruti-suzuki-india-ltd/stocks/companyid-11890.cms
Housing Development Finance Corporation Ltd. /housing-development-finance-corporation-ltd/stocks/companyid-13640.cms
Wipro Ltd. /wipro-ltd/stocks/companyid-12799.cms
Axis Bank Ltd. /axis-bank-ltd/stocks/companyid-9175.cms
Punjab National Bank /punjab-national-bank/stocks/companyid-11585.cms

我不明白你的问题。你到底想做什么?@VivekKalyanarangan:我在代码中提到了一个url,从这个url中,我想删除类中的内容,你目前面临的问题是什么?@VivekKalyanarangan:我使用相同的代码删除了其他网站的内容。但它不适用于此url