Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/304.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
在python中使用bs4从一组URL获取详细信息_Python_Pandas_Web Scraping_Beautifulsoup - Fatal编程技术网

在python中使用bs4从一组URL获取详细信息

在python中使用bs4从一组URL获取详细信息,python,pandas,web-scraping,beautifulsoup,Python,Pandas,Web Scraping,Beautifulsoup,我是使用Python进行Web抓取的绝对初学者,对Python编程知之甚少。我只是想提取田纳西州律师的信息。在该网页中,有多个链接,其中有指向律师类别的进一步链接,其中有律师的详细信息 我已经将各个城市的链接提取到一个列表中,还提取了每个城市链接中可用的各类律师。配置文件链接也已获取并作为一个集合存储。现在,我试图获取每个律师的姓名、地址、公司名称和执业区域,并将其存储为.xls文件 import requests from bs4 import BeautifulSoup as bs impo

我是使用Python进行Web抓取的绝对初学者,对Python编程知之甚少。我只是想提取田纳西州律师的信息。在该网页中,有多个链接,其中有指向律师类别的进一步链接,其中有律师的详细信息

我已经将各个城市的链接提取到一个列表中,还提取了每个城市链接中可用的各类律师。配置文件链接也已获取并作为一个集合存储。现在,我试图获取每个律师的姓名、地址、公司名称和执业区域,并将其存储为.xls文件

import requests
from bs4 import BeautifulSoup as bs
import pandas as pd

final=[]
records=[]
with requests.Session() as s:
    res = s.get('https://attorneys.superlawyers.com/tennessee/', headers = {'User-agent': 'Super Bot 9000'})
    soup = bs(res.content, 'lxml')

    cities = [item['href'] for item in soup.select('#browse_view a')]
    for c in cities:
        r=s.get(c)
        s1=bs(r.content,'lxml')
        categories = [item['href'] for item in s1.select('.three_browse_columns:nth-of-type(2) a')]
        for c1 in categories:
            r1=s.get(c1)
            s2=bs(r1.content,'lxml')
            lawyers = [item['href'].split('*')[1] if '*' in item['href'] else item['href'] for item in
                       s2.select('.indigo_text .directory_profile')]
            final.append(lawyers)
final_list={item for sublist in final for item in sublist}
for i in final_list:
    r2 = s.get(i)
    s3 = bs(r2.content, 'lxml')
    name = s3.find('h2').text.strip()
    add = s3.find("div").text.strip()
    f_name = s3.find("a").text.strip()
    p_area = s3.find('ul',{"class":"basic_profile aag_data_value"}).find('li').text.strip()
    records.append({'Names': name, 'Address': add, 'Firm Name': f_name,'Practice Area':p_area})
df = pd.DataFrame(records,columns=['Names','Address','Firm Name','Practice Areas'])
df=df.drop_duplicates()
df.to_excel(r'C:\Users\laptop\Desktop\lawyers.xls', sheet_name='MyData2', index = False, header=True)

我希望得到一个.xls文件,但在执行过程中没有返回任何内容。直到我强制停止,并且没有生成.xls文件,它才会终止。

您需要访问每个律师的页面并使用适当的选择器来提取这些详细信息。比如:

import requests
from bs4 import BeautifulSoup as bs
import pandas as pd

records = []
final = []

with requests.Session() as s:
    res = s.get('https://attorneys.superlawyers.com/tennessee/', headers = {'User-agent': 'Super Bot 9000'})
    soup = bs(res.content, 'lxml')
    cities = [item['href'] for item in soup.select('#browse_view a')]
    for c in cities:
        r = s.get(c)
        s1 = bs(r.content,'lxml')
        categories = [item['href'] for item in s1.select('.three_browse_columns:nth-of-type(2) a')]
        for c1 in categories:
            r1 = s.get(c1)
            s2 = bs(r1.content,'lxml')
            lawyers = [item['href'].split('*')[1] if '*' in item['href'] else item['href'] for item in s2.select('.indigo_text .directory_profile')]
            final.append(lawyers)
    final_list = {item for sublist in final for item in sublist}
    for link in final_list:
        r = s.get(link)
        soup = bs(r.content, 'lxml')
        name = soup.select_one('#lawyer_name').text
        firm = soup.select_one('#firm_profile_page').text
        address = ' '.join([string for string in soup.select_one('#poap_postal_addr_block').stripped_strings][1:])
        practices = ' '.join([item.text for item in soup.select('#pa_list li')])
        row = [name, firm, address, practices]
        records.append(row)

df = pd.DataFrame(records, columns = ['Name', 'Firm', 'Address', 'Practices'])
print(df)
df.to_csv(r'C:\Users\User\Desktop\Lawyers.csv', sep=',', encoding='utf-8-sig',index = False )

farm的意思是律师执业的法律农场。对不起,拼写错误实际上是律师事务所。好的,但总共有大量律师,每个律师都必须单独搜索?是的,除非有其他来源?您可以尝试并行化或使用asyncio来加快速度。我可能问了一个错误的问题,但如果最终的列表集包含我们正在迭代这些链接并存储其内容的链接,那么为什么它不允许我们获取名称、地址、公司名称,在每次迭代中从该内容中选择练习区域。我假设练习区域列在概要文件的级别,这是我的代码在底部最后一个循环中看到的。以下是所有结果: