Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/308.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python BeautifulSoup-如何提取此文本_Python_Python 3.x_Beautifulsoup_Extraction_Data Extraction - Fatal编程技术网

Python BeautifulSoup-如何提取此文本

Python BeautifulSoup-如何提取此文本,python,python-3.x,beautifulsoup,extraction,data-extraction,Python,Python 3.x,Beautifulsoup,Extraction,Data Extraction,当前Python脚本: import win_unicode_console win_unicode_console.enable() import requests from bs4 import BeautifulSoup data = ''' <div class="info"> <h1>Company Title</h1> <p class="type">Company type</p> <p

当前Python脚本:

import win_unicode_console
win_unicode_console.enable()

import requests
from bs4 import BeautifulSoup

data = '''
<div class="info">
    <h1>Company Title</h1>
    <p class="type">Company type</p>
    <p class="address"><strong>ZIP, City</strong></p>
    <p class="address"><strong>Street 123</strong></p>
    <p style="margin-top:10px;"> Phone: <strong>(111) 123-456-78</strong><br />
        Fax: <strong>(222) 321-654-87</strong><br />
        Phone: <strong>(333) 87-654-321</strong><br />
        Fax: <strong>(444) 000-1111-2222</strong><br />
    </p>
    <p style="margin-top:10px;"> E-mail: <a href="mailto:mail@domain.com">mail@domain.com</a><br />
    E-mail: <a href="mailto:mail2@domain.com">mail2@domain.com</a><br />
    </p>
    <p> Web: <a href="http://www.domain.com" target="_blank">www.domain.com</a><br />
    </p>
    <p style="margin-top:10px;"> ID: <strong>123456789</strong><br />
        VAT: <strong>987654321</strong> </p>
    <p class="del" style="margin-top:10px;">Some info:</p>
    <ul>
        <li><a href="#category">&raquo; Category</a></li>
    </ul>
</div>
'''

html = BeautifulSoup(data, "html.parser")

p = html.find_all('p', attrs={'class': None})

for pp in p:
    print(pp.contents)
[' Phone: ', <strong>123-456-78</strong>, <br/>, '\n\t\tFax: ', <strong>321-654-87</strong>, <br/>, '\n\t\tPhone: ', <strong>87-654-321</strong>, <br/>, '\n\t\tFax: ', <strong>000-1111-2222</strong>, <br/>, '\n']
[' E-mail: ', <a href="mailto:mail@domain.com">mail@domain.com</a>, <br/>, '\n\tE-mail: ', <a href="mailto:mail2@domain.com">mail2@domain.com</a>, <br/>, '\n']
[' Web: ', <a href="http://www.domain.com" target="_blank">www.domain.com</a>, <br/>, '\n']
[' ID: ', <strong>123456789</strong>, <br/>, '\n\t\tVAT: ', <strong>987654321</strong>, ' ']

拆分后,可以使用defaultdict对数据进行分组:

html = BeautifulSoup(data, "html.parser")

p = html.find_all('p', attrs={'class': None})
from collections import defaultdict

d = defaultdict(list)
for pp in p:
    spl = iter(pp.text.split(None,1))
    for ele in spl:
        d[ele.rstrip(":")].append(next(spl).rstrip())

print(d)
defaultdict(<class 'list'>, {'Phone': ['123-456-78', '87-654-321'],
'Fax': ['321-654-87', '000-1111-2222'], 'E-mail': ['mail@domain.com',
'mail2@domain.com'], 'VAT': ['987654321'], 'Web': ['www.domain.com'], 
'ID': ['123456789']})
所以我们使用每两个元素作为键/值对。追加重复的键

要编辑以捕获传真和电话号码中的空格,只需使用拆分行将其拆分为几行,并在空白处拆分一次: 从集合导入defaultdict

d = defaultdict(list)
for pp in p:
    spl = pp.text.splitlines()
    for ele in spl:
        k, v = ele.strip().split(None, 1)
        d[k.rstrip(":")].append(v.rstrip())
输出:

defaultdict(<class 'list'>, {'Fax': ['(222) 321-654-87', '(444) 000-1111-2222'],
 'Web': ['www.domain.com'], 'ID': ['123456789'], 'E-mail': ['mail@domain.com', 'mail2@domain.com'],
 'VAT': ['987654321'], 'Phone': ['(111) 123-456-78', '(333) 87-654-321']})
defaultdict(,{'Fax':['(222)321-654-87','(444)000-1111-2222'],
“Web”:['www.domain.com'],'ID':['123456789'],'E-mail':['mail@domain.com', 'mail2@domain.com'],
‘增值税’:[‘987654321’,‘电话’:[‘(111)123-456-78’,‘(333)87-654-321’])

很抱歉,当电话号码类似于
(111)222-333-4444时,我收到错误消息,谢谢您的更新!我现在有另一个问题。电话包含:
'Phone':['(111)123-456-78\n\t\tFax:(222)321-654-87\n\t\t电话:(333)87-654-321\n\t\tFax:(444)000-1111-222']
我的意思是它没有拆分到
电话和
传真,我使用:
为p中的pp:spl中的pp=iter(pp.text.splitlines())为spl中的ele:for(
为ele.splitlines()):spl3=iter(spl2.split(None,1))在spl3:d[ele2.rstrip(“:”).append(next(spl3.rstrip())
它并不优雅,但可以工作:)@rhymguy,编辑应该用更少的代码完成所需的工作;)
d = defaultdict(list)
for pp in p:
    spl = pp.text.splitlines()
    for ele in spl:
        k, v = ele.strip().split(None, 1)
        d[k.rstrip(":")].append(v.rstrip())
defaultdict(<class 'list'>, {'Fax': ['(222) 321-654-87', '(444) 000-1111-2222'],
 'Web': ['www.domain.com'], 'ID': ['123456789'], 'E-mail': ['mail@domain.com', 'mail2@domain.com'],
 'VAT': ['987654321'], 'Phone': ['(111) 123-456-78', '(333) 87-654-321']})