Python 将大字符串输出转换为字典_Python_String_Python 3.x_Dictionary

Python 将大字符串输出转换为字典

python string python-3.x dictionary

Python 将大字符串输出转换为字典,python,string,python-3.x,dictionary,Python,String,Python 3.x,Dictionary,我有一个类似这样的函数，它在给定url时查找who.is上的域： import whois def who_is(url): w = whois.whois(url) return w.text 它将以下内容作为一个巨大的字符串返回： Domain name: amazon.co.uk Registrant: Amazon Europe Holding Technologies SCS Registrant type: Unknown R

我有一个类似这样的函数，它在给定url时查找who.is上的域：

import whois    

def who_is(url):
    w = whois.whois(url)
    return w.text

它将以下内容作为一个巨大的字符串返回：

Domain name:
    amazon.co.uk

Registrant:
    Amazon Europe Holding Technologies SCS

Registrant type:
    Unknown

Registrant's address:
    65 boulevard G-D. Charlotte
    Luxembourg City
    Luxembourg
    LU-1311
    Luxembourg

Data validation:
    Nominet was able to match the registrant's name and address against a 3rd party data source on 10-Dec-2012

Registrar:
    Amazon.com, Inc. t/a Amazon.com, Inc. [Tag = AMAZON-COM]
    URL: http://www.amazon.com

Relevant dates:
    Registered on: before Aug-1996
    Expiry date:  05-Dec-2020
    Last updated:  23-Oct-2013

Registration status:
    Registered until expiry date.

Name servers:
    ns1.p31.dynect.net
    ns2.p31.dynect.net
    ns3.p31.dynect.net
    ns4.p31.dynect.net
    pdns1.ultradns.net
    pdns2.ultradns.net
    pdns3.ultradns.org
    pdns4.ultradns.org
    pdns5.ultradns.info
    pdns6.ultradns.co.uk      204.74.115.1  2610:00a1:1017:0000:0000:0000:0000:0001

WHOIS lookup made at 21:09:42 10-May-2017

 -- 
   This WHOIS information is provided for free by Nominet UK the central registry
for .uk domain names. This information and the .uk WHOIS are:

Copyright Nominet UK 1996 - 2017.

You may not access the .uk WHOIS or use any data from it except as permitted
by the terms of use available in full at http://www.nominet.uk/whoisterms,
 which includes restrictions on: (A) use of the data for advertising, or its
 repackaging, recompilation, redistribution or reuse (B) obscuring, removing
 or hiding any or all of this notice and (C) exceeding query rate or volume
limits. The data is provided on an 'as-is' basis and may lag behind the
register. Access may be withdrawn or restricted at any time.

所以只要看看它，我就可以看到它的布局是为了把它变成一本字典，但不知道如何以最有效的方式进行。我需要删除底部不需要的文本，并删除所有的换行符和缩进。单独做这些事情效率不高。我希望能够将任何url传递给函数，并使用字典。任何帮助都将不胜感激

预期的产出将是：

dict = {
'Domain name':'amazon.co.uk',
'Registrant':'Amazon Europe Holding Technologies'
'Registrant type': 'Unknown'
and so on for all the available fields.
}

到目前为止，我已尝试使用remove函数删除所有新行和\r，然后使用replace函数替换所有缩进。然而，我根本不知道如何删除底部的大部分文本

python whois文档告诉您只打印w，但这样做时会返回以下内容：

{
  "domain_name": null,
  "registrar": null,
  "registrar_url": "http://www.amazon.com",
  "status": null,
  "registrant_name": null,
  "creation_date": "before Aug-1996",
  "expiration_date": "2020-12-05 00:00:00",
  "updated_date": "2013-10-23 00:00:00",
  "name_servers": null
 }

正如您所看到的，这些值中的大多数都是空的，但是当返回w.text时，它们确实有值

，显然，您正在使用

看这张照片。您可以以结构化形式获取所有数据，而不是需要解析的文本：

import whois
w = whois.whois('webscraping.com')
w.expiration_date  # dates converted to datetime object
# datetime.datetime(2013, 6, 26, 0, 0)
w.text  # the content downloaded from whois server
# u'\nWhois Server Version 2.0\n\nDomain names in the .com and .net ...'

print w  # print values of all found attributes
# creation_date: 2004-06-26 00:00:00
# domain_name: [u'WEBSCRAPING.COM', u'WEBSCRAPING.COM']
# emails: [u'WEBSCRAPING.COM@domainsbyproxy.com', u'WEBSCRAPING.COM@domainsbyproxy.com']
# expiration_date: 2013-06-26 00:00:00

您可以从whois对象w中逐个获得所需的所有属性，并将它们存储在dict中，或者将对象本身传递给需要这些信息的函数

w.text中是否有任何信息不能作为w的属性访问

编辑：它适用于我，使用与您相同的示例URL

pip install python-whois
pip freeze |grep python-whois
# python-whois==0.6.5

import whois
w = whois.whois("amazon.co.uk")
w
# {'updated_date': datetime.datetime(2013, 10, 23, 0, 0), 'creation_date': 'before Aug-1996', 'registrar': None, 'registrar_url': 'http://www.amazon.com', 'domain_name': None, 'expiration_date': datetime.datetime(2020, 12, 5, 0, 0), 'name_servers': None, 'status': None, 'registrant_name': None}

编辑2：如果我认为我在解析器中发现了问题

正则表达式不应该是

'Registrant:\n\s*(.*)'

但是

您可以尝试在本地克隆whois并像添加一样对其进行修改\r，然后如果可行，建议使用此修补程序，或者至少在中提及。

尝试以下操作：

from collections import OrderedDict

key_value=OrderedDict() #use dict() if order of keys is not important

for block in textstring.split("\n\n"): #textstring contains the string of w.text.
    try:
        key_value[block.split(":\n")[0].strip()] = '\n'.join(element.strip() for element in block.split(":\n")[1].split('\n'))
    except IndexError:
        pass

#print the result
for key in key_value:
    print(key)
    print(key_value[key])
    print("\n")

当您尝试以w形式访问时，它将返回您解析为未知的任何字段，而当您以w.text形式返回时，您可以看到实际的数据。查看您的输出，我可以看到诸如注册人名称、名称服务器、域名等字段的值为零，但在w.text中它们有值。看起来该字段有问题。也许你可以在bitbucket上打开一个bug。谢谢你，我已经提交了一个bug报告，我会发消息让他们知道你已经发现了问题和你提出的修复方案，再次感谢你，这对我帮助很大。@abccd我已经添加了一些澄清，我不是要求别人帮我做的，如果是这样的话，我很抱歉，我一直在寻找一个更有效的方法。我可以通过将字符串转换为列表并使用replace来删除所有缩进，然后删除所有的\n和\r，然后将其转换回字符串，按：拆分，然后将所有偶数索引变成键，奇数索引变成值。但这似乎效率不高，而且似乎是一种糟糕的做法。这些帖子可能会对您有所帮助：说真的，python whois看起来像一个很好的库，任何解析w.text的尝试都会彻底失败。为您的用例修复它应该是一条真正的道路。不幸的是，它依赖于正则表达式，如果您不熟悉正则表达式，这可能会让您感到痛苦。但是如果你打开一个包含所有需要的信息的票证——不多，只有URL和你的输出，这个问题可能会由开发人员为你解决。。。

from collections import OrderedDict

key_value=OrderedDict() #use dict() if order of keys is not important

for block in textstring.split("\n\n"): #textstring contains the string of w.text.
    try:
        key_value[block.split(":\n")[0].strip()] = '\n'.join(element.strip() for element in block.split(":\n")[1].split('\n'))
    except IndexError:
        pass

#print the result
for key in key_value:
    print(key)
    print(key_value[key])
    print("\n")