Python 在dt-dd标记中删除数据,其中包含链接

Python 在dt-dd标记中删除数据,其中包含链接,python,python-3.x,web-scraping,beautifulsoup,Python,Python 3.x,Web Scraping,Beautifulsoup,实际上,我想从一个网站“”中抓取数据,我的数据存在于dt和dd标记中,因为该网站上不允许使用bot。所以我保存了页面,并通过这种方式在保存的页面上应用了beautifulsoup模块,尽管我在代码中提到了实际的url soup = BeautifulSoup(open(r"C:\Users\acer\Desktop\pythonbooks\tam.html").read()) 实际产量: 0 3 Acquisitions 1 None 2 Bengaluru, Karnataka 3 Ola

实际上,我想从一个网站“”中抓取数据,我的数据存在于dt和dd标记中,因为该网站上不允许使用bot。所以我保存了页面,并通过这种方式在保存的页面上应用了beautifulsoup模块,尽管我在代码中提到了实际的url

soup = BeautifulSoup(open(r"C:\Users\acer\Desktop\pythonbooks\tam.html").read())

实际产量:

0 3 Acquisitions
1 None
2 Bengaluru, Karnataka
3 Ola is a mobile app for cab booking in India.
4 None
5 None
6 olacab link
7 None
8 December 3, 2010
9 ANI Technologies Pvt Ltd, Olacabs.com, Ola Cabs, Olacabs
10 media@olacabs.com
11 None
这里有几个地方我都没有,因为那里有指向内容的超链接。例如,在“页面”上,“类别”选项卡有5个类别,分别命名为:电子商务、互联网、交通、应用和移动,每个类别都连接到一个超链接,因此我无法获得我想要的文本,即这5个类别

我想要的输出为:

0 3 Acquisitions
1 (All that text (though not important to me))
2 Bengaluru, Karnataka
3 Ola is a mobile app for cab booking in India.
4 (all that text(though not important to me))
==>5 (E-Commerce, Internet, Transportation, Apps, Mobile)(Extremely important)
6 olacab link
7 (all that text(though not important to me))
8 December 3, 2010
9 ANI Technologies Pvt Ltd, Olacabs.com, Ola Cabs, Olacabs
10 media@olacabs.com
11 (all that text(though not important to me))
如果我能得到这样的词典,那将非常有帮助:

{"Headquarters":["Bengaluru,Karnataka"],
 "Description":["Ola is a mobile app for cab booking in India."],
 "Category": ["E-Commerce", "Internet", "Transportation", "Apps", "Mobile"]}
问题:。。。我无法得到我想要的文本。。。如果我能找到字典

从所有

签名:find_all(名称、属性、递归、字符串、限制、**kwargs)


使用Python:3.4.2-bs4:4.6.0进行测试

否,实际上我没有得到任何文本,但由于一些标签嵌套(由于超链接),我无法提取该文本
{"Headquarters":["Bengaluru,Karnataka"],
 "Description":["Ola is a mobile app for cab booking in India."],
 "Category": ["E-Commerce", "Internet", "Transportation", "Apps", "Mobile"]}
from collections import OrderedDict
os_dict = OrderedDict()

for div_class in ['definition-list-container', 'details definition-list']:
    divs = soup.find_all("div", class_=div_class)
    key = '?'
    for div in divs:
        for child in div.findChildren():
            if child.name == 'dt':
                key = child.text[:-1]
            if child.name == 'dd':
                if child.select('a[href]'):
                    a_list = child.find_all("a")
                    if key in ['Social:']:
                        os_dict[key] = [a['href'] for a in a_list]
                    elif len(a_list) == 1:
                        os_dict[key] = a_list[0].text
                    else:
                        os_dict[key] = [a.text for a in a_list]
                else:
                    os_dict[key] = child.text

for n, key in enumerate(os_dict, 1):
    print('{:>2}: {:>20}:\t{}'.format(n, key, os_dict[key]))
 1:          Acquisition:   3 Acquisitions
 2:  Total Equity Fundin:   ['11 Rounds', '24 Investors']
 3:         Headquarters:   Bengaluru, Karnataka
 4:          Description:   Ola is a mobile app for cab booking in India.
 5:             Founders:   ['Bhavish Aggarwal', 'Ankit Bhati']
 6:           Categories:   ['E-Commerce', 'Internet', 'Transportation', 'Apps', 'Mobile']
 7:              Website:   http://www.olacabs.com
 8:              Social::   ['http://www.facebook.com/olacabs', 'http://twitter.com/olacabs', 'http://www.linkedin.com/company/olacabs-com']
 9:              Founded:   December 3, 2010
10:              Aliases:   ANI Technologies Pvt Ltd, Olacabs.com, Ola Cabs, Olacabs
11:              Contact:   media@olacabs.com
12:            Employees:   8 in Crunchbase
dl_data = soup.find_all("dd")
for n, dlitem in enumerate(dl_data, 1):
    if dlitem.select('a[href]'):
        a_text = [a.text for a in dlitem.find_all("a")]
        print('{}: {}'.format(n, a_text))
    else:
        print('{}: {}'.format(n, dlitem.text))