Warning: file_get_contents(/data/phpspider/zhask/data//catemap/7/python-2.7/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 2.7 vcf文件中用于删除vcard重复项的python代码适用于vobject,但仅适用于;精确副本”;_Python 2.7_Duplicates_Vcf Vcard_Vobject - Fatal编程技术网

Python 2.7 vcf文件中用于删除vcard重复项的python代码适用于vobject,但仅适用于;精确副本”;

Python 2.7 vcf文件中用于删除vcard重复项的python代码适用于vobject,但仅适用于;精确副本”;,python-2.7,duplicates,vcf-vcard,vobject,Python 2.7,Duplicates,Vcf Vcard,Vobject,上面的代码可以工作并创建一个新文件,其中没有精确的副本(具有相同奇点的副本)。我知道代码有一些效率问题:它是n平方,当它可以是n*logn时;我们只能将每个vacard序列化一次;for等的使用效率低下。在这里,我想提供一个简短的代码来说明我不知道如何解决的问题之一 #!/usr/bin/env python2.7 import vobject abinfile='/foo/bar/dir/infile.vcf' #ab stands for address book aboutfi

上面的代码可以工作并创建一个新文件,其中没有精确的副本(具有相同奇点的副本)。我知道代码有一些效率问题:它是n平方,当它可以是n*logn时;我们只能将每个vacard序列化一次;for等的使用效率低下。在这里,我想提供一个简短的代码来说明我不知道如何解决的问题之一

#!/usr/bin/env python2.7 

import vobject

abinfile='/foo/bar/dir/infile.vcf' #ab stands for address book  

aboutfile='/foo/bar/dir/outfile.vcf'  

def eliminate_vcard_duplicates (abinfile, aboutfile):

    #we first convert the Adrees Book IN FILE into a list

    with open(abinfile) as source_file:
        ablist = list(vobject.readComponents(source_file))

    #then add each vcard from that list in a new list unless it's already there

    ablist_norepeats=[]
    ablist_norepeats.append(ablist[0])

    for i in range(1, len(ablist)):
        jay=len(ablist_norepeats)
        for j in reversed(range(0, jay)): #we do reversed because usually cards have duplicates nearby
            if ablist_norepeats[j].serialize() == ablist[i].serialize():
                break
            else:
                jay += -1
        if jay == 0:
            ablist_norepeats.append(ablist[i])

    #and finally write the singularized list to the Adrees Book OUT FILE

    with open(aboutfile, 'w') as destination_file:
        for j in range(0, len(ablist_norepeats)):
            destination_file.write(ablist_norepeats[j].serialize)

eliminate_vcard_duplicates(abinfile, aboutfile)
我不知道如何优雅地解决这个问题:如果卡中的某些字段被置乱,它将无法检测到它们相等。有没有一种方法可以通过vobject、re或其他方法检测此类重复

#!/usr/bin/env python2.7 

import vobject

abinfile='/foo/bar/dir/infile.vcf' #ab stands for address book  

aboutfile='/foo/bar/dir/outfile.vcf'  

def eliminate_vcard_duplicates (abinfile, aboutfile):

    #we first convert the Adrees Book IN FILE into a list

    with open(abinfile) as source_file:
        ablist = list(vobject.readComponents(source_file))

    #then add each vcard from that list in a new list unless it's already there

    ablist_norepeats=[]
    ablist_norepeats.append(ablist[0])

    for i in range(1, len(ablist)):
        jay=len(ablist_norepeats)
        for j in reversed(range(0, jay)): #we do reversed because usually cards have duplicates nearby
            if ablist_norepeats[j].serialize() == ablist[i].serialize():
                break
            else:
                jay += -1
        if jay == 0:
            ablist_norepeats.append(ablist[i])

    #and finally write the singularized list to the Adrees Book OUT FILE

    with open(aboutfile, 'w') as destination_file:
        for j in range(0, len(ablist_norepeats)):
            destination_file.write(ablist_norepeats[j].serialize)

eliminate_vcard_duplicates(abinfile, aboutfile)
测试中使用的文件内容,有四个相等的vCard(电话乱码,而不是电子邮件乱码),如下所示:

#!/usr/bin/env python2.7 

import vobject

abinfile='/foo/bar/dir/infile.vcf' #ab stands for address book  

aboutfile='/foo/bar/dir/outfile.vcf'  

def eliminate_vcard_duplicates (abinfile, aboutfile):

    #we first convert the Adrees Book IN FILE into a list

    with open(abinfile) as source_file:
        ablist = list(vobject.readComponents(source_file))

    #then add each vcard from that list in a new list unless it's already there

    ablist_norepeats=[]
    ablist_norepeats.append(ablist[0])

    for i in range(1, len(ablist)):
        jay=len(ablist_norepeats)
        for j in reversed(range(0, jay)): #we do reversed because usually cards have duplicates nearby
            if ablist_norepeats[j].serialize() == ablist[i].serialize():
                break
            else:
                jay += -1
        if jay == 0:
            ablist_norepeats.append(ablist[i])

    #and finally write the singularized list to the Adrees Book OUT FILE

    with open(aboutfile, 'w') as destination_file:
        for j in range(0, len(ablist_norepeats)):
            destination_file.write(ablist_norepeats[j].serialize)

eliminate_vcard_duplicates(abinfile, aboutfile)
上面的代码不会检测到这四个都是相同的,因为最后一个的电话号码被打乱了

#!/usr/bin/env python2.7 

import vobject

abinfile='/foo/bar/dir/infile.vcf' #ab stands for address book  

aboutfile='/foo/bar/dir/outfile.vcf'  

def eliminate_vcard_duplicates (abinfile, aboutfile):

    #we first convert the Adrees Book IN FILE into a list

    with open(abinfile) as source_file:
        ablist = list(vobject.readComponents(source_file))

    #then add each vcard from that list in a new list unless it's already there

    ablist_norepeats=[]
    ablist_norepeats.append(ablist[0])

    for i in range(1, len(ablist)):
        jay=len(ablist_norepeats)
        for j in reversed(range(0, jay)): #we do reversed because usually cards have duplicates nearby
            if ablist_norepeats[j].serialize() == ablist[i].serialize():
                break
            else:
                jay += -1
        if jay == 0:
            ablist_norepeats.append(ablist[i])

    #and finally write the singularized list to the Adrees Book OUT FILE

    with open(aboutfile, 'w') as destination_file:
        for j in range(0, len(ablist_norepeats)):
            destination_file.write(ablist_norepeats[j].serialize)

eliminate_vcard_duplicates(abinfile, aboutfile)

作为奖励,如果有人有一个更快的算法,如果它可以被共享,那就太好了。上面的一个在30000 Vcard文件上需要几天时间…

下面是一个更快的代码(大约三个数量级),但仍然只删除精确的副本

#!/usr/bin/env python2.7 

import vobject

abinfile='/foo/bar/dir/infile.vcf' #ab stands for address book  

aboutfile='/foo/bar/dir/outfile.vcf'  

def eliminate_vcard_duplicates (abinfile, aboutfile):

    #we first convert the Adrees Book IN FILE into a list

    with open(abinfile) as source_file:
        ablist = list(vobject.readComponents(source_file))

    #then add each vcard from that list in a new list unless it's already there

    ablist_norepeats=[]
    ablist_norepeats.append(ablist[0])

    for i in range(1, len(ablist)):
        jay=len(ablist_norepeats)
        for j in reversed(range(0, jay)): #we do reversed because usually cards have duplicates nearby
            if ablist_norepeats[j].serialize() == ablist[i].serialize():
                break
            else:
                jay += -1
        if jay == 0:
            ablist_norepeats.append(ablist[i])

    #and finally write the singularized list to the Adrees Book OUT FILE

    with open(aboutfile, 'w') as destination_file:
        for j in range(0, len(ablist_norepeats)):
            destination_file.write(ablist_norepeats[j].serialize)

eliminate_vcard_duplicates(abinfile, aboutfile)
BEGIN:VCARD
VERSION:3.0
FN:Foo_bar1
N:;Foo_bar1;;;
EMAIL;TYPE=INTERNET:foobar1@foo.bar.com
TEL;TYPE=CELL:123456789
TEL;TYPE=CELL:987654321
END:VCARD
BEGIN:VCARD
VERSION:3.0
FN:Foo_bar1
N:;Foo_bar1;;;
EMAIL;TYPE=INTERNET:foobar1@foo.bar.com
TEL;TYPE=CELL:123456789
TEL;TYPE=CELL:987654321
END:VCARD
BEGIN:VCARD
VERSION:3.0
FN:Foo_bar1
N:;Foo_bar1;;;
TEL;TYPE=CELL:123456789
TEL;TYPE=CELL:987654321
EMAIL;TYPE=INTERNET:foobar1@foo.bar.com
END:VCARD
BEGIN:VCARD
VERSION:3.0
FN:Foo_bar1
N:;Foo_bar1;;;
TEL;TYPE=CELL:987654321
TEL;TYPE=CELL:123456789
EMAIL;TYPE=INTERNET:foobar1@foo.bar.com
END:VCARD

你可能注意到的一件事是,如果你打电话给
.serialize()
方法,然后将
电子邮件
排序在
FN
之前。但是 不幸的是,电话号码没有分类。如果他们是,你 可以将序列化的单个组件添加到集合中,并让 唯一散列对多次出现的情况进行排序

#!/usr/bin/env python2.7 

import vobject

abinfile='/foo/bar/dir/infile.vcf' #ab stands for address book  

aboutfile='/foo/bar/dir/outfile.vcf'  

def eliminate_vcard_duplicates (abinfile, aboutfile):

    #we first convert the Adrees Book IN FILE into a list

    with open(abinfile) as source_file:
        ablist = list(vobject.readComponents(source_file))

    #then add each vcard from that list in a new list unless it's already there

    ablist_norepeats=[]
    ablist_norepeats.append(ablist[0])

    for i in range(1, len(ablist)):
        jay=len(ablist_norepeats)
        for j in reversed(range(0, jay)): #we do reversed because usually cards have duplicates nearby
            if ablist_norepeats[j].serialize() == ablist[i].serialize():
                break
            else:
                jay += -1
        if jay == 0:
            ablist_norepeats.append(ablist[i])

    #and finally write the singularized list to the Adrees Book OUT FILE

    with open(aboutfile, 'w') as destination_file:
        for j in range(0, len(ablist_norepeats)):
            destination_file.write(ablist_norepeats[j].serialize)

eliminate_vcard_duplicates(abinfile, aboutfile)
如果你调查一下你从发电机里得到了什么
vobject.readComponents()
(例如,使用
type()
),您将看到 是来自模块
vobject.base
组件
,并在上使用
dir()
您看到的一个实例是一个方法
getSortedChildren()
。如果你看到了 在源代码中,您将发现:

#!/usr/bin/env python2.7 

import vobject

abinfile='/foo/bar/dir/infile.vcf' #ab stands for address book  

aboutfile='/foo/bar/dir/outfile.vcf'  

def eliminate_vcard_duplicates (abinfile, aboutfile):

    #we first convert the Adrees Book IN FILE into a list

    with open(abinfile) as source_file:
        ablist = list(vobject.readComponents(source_file))

    #then add each vcard from that list in a new list unless it's already there

    ablist_norepeats=[]
    ablist_norepeats.append(ablist[0])

    for i in range(1, len(ablist)):
        jay=len(ablist_norepeats)
        for j in reversed(range(0, jay)): #we do reversed because usually cards have duplicates nearby
            if ablist_norepeats[j].serialize() == ablist[i].serialize():
                break
            else:
                jay += -1
        if jay == 0:
            ablist_norepeats.append(ablist[i])

    #and finally write the singularized list to the Adrees Book OUT FILE

    with open(aboutfile, 'w') as destination_file:
        for j in range(0, len(ablist_norepeats)):
            destination_file.write(ablist_norepeats[j].serialize)

eliminate_vcard_duplicates(abinfile, aboutfile)
    #!/usr/bin/env python2.7 

    import vobject
    import datetime

    abinfile='/foo/bar/dir/infile.vcf' #ab stands for address book  

    aboutfile='/foo/bar/dir/outfile.vcf' 

    def eliminate_vcard_duplicatesv2(abinfile, aboutfile):

        #we first convert the Adrees Book IN FILE into a list
        ablist=[]
        with open(abinfile) as source_file:
            ablist = list(vobject.readComponents(source_file))

        #we then serialize the list to expedite comparison process
        ablist_serial=[]
        for i in range(0, len(ablist)):
            ablist_serial.append(ablist[i].serialize())

        #then add each unique vcard's position from that list in a new list unless it's already there
        ablist_singletons=[]
        duplicates=0
        for i in range(1, len(ablist_serial)):
            if i % 1000 == 0:
                print "COMPUTED CARD:", i, "Number of duplicates: ", duplicates, "Current time:", datetime.datetime.now().time()
            jay=len(ablist_singletons)
            for j in reversed(range(0, jay)): #we do reversed because usually cards have duplicates nearby
                if ablist_serial[ablist_singletons[j]] == ablist_serial[i]:
                    duplicates += 1
                    break
                else:
                    jay += -1
            if jay == 0:
                ablist_singletons.append(i)

        print "Length of Original Vcard File: ", len(ablist)
        print "Length of Singleton Vcard File: ", len(ablist_singletons)
        print "Generating Singleton Vcard file and storing it in: ", aboutfile

        #and finally write the singularized list to the Adrees Book OUT FILE
        with open(aboutfile, 'w') as destination_file:
            for k in range(0, len(ablist_singletons)):
                destination_file.write(ablist_serial[ablist_singletons[k]])

    eliminate_vcard_duplicatesv2(abinfile, aboutfile)
sortChildKeys()
a直接位于:

#!/usr/bin/env python2.7 

import vobject

abinfile='/foo/bar/dir/infile.vcf' #ab stands for address book  

aboutfile='/foo/bar/dir/outfile.vcf'  

def eliminate_vcard_duplicates (abinfile, aboutfile):

    #we first convert the Adrees Book IN FILE into a list

    with open(abinfile) as source_file:
        ablist = list(vobject.readComponents(source_file))

    #then add each vcard from that list in a new list unless it's already there

    ablist_norepeats=[]
    ablist_norepeats.append(ablist[0])

    for i in range(1, len(ablist)):
        jay=len(ablist_norepeats)
        for j in reversed(range(0, jay)): #we do reversed because usually cards have duplicates nearby
            if ablist_norepeats[j].serialize() == ablist[i].serialize():
                break
            else:
                jay += -1
        if jay == 0:
            ablist_norepeats.append(ablist[i])

    #and finally write the singularized list to the Adrees Book OUT FILE

    with open(aboutfile, 'w') as destination_file:
        for j in range(0, len(ablist_norepeats)):
            destination_file.write(ablist_norepeats[j].serialize)

eliminate_vcard_duplicates(abinfile, aboutfile)
def getSortedChildren(self):
    return [obj for k in self.sortChildKeys() for obj in self.contents[k]]
在示例实例上调用
sortChildKeys()
,会得到
['version',
“email”、“fn”、“n”、“tel”]
,由此得出两个结论:

#!/usr/bin/env python2.7 

import vobject

abinfile='/foo/bar/dir/infile.vcf' #ab stands for address book  

aboutfile='/foo/bar/dir/outfile.vcf'  

def eliminate_vcard_duplicates (abinfile, aboutfile):

    #we first convert the Adrees Book IN FILE into a list

    with open(abinfile) as source_file:
        ablist = list(vobject.readComponents(source_file))

    #then add each vcard from that list in a new list unless it's already there

    ablist_norepeats=[]
    ablist_norepeats.append(ablist[0])

    for i in range(1, len(ablist)):
        jay=len(ablist_norepeats)
        for j in reversed(range(0, jay)): #we do reversed because usually cards have duplicates nearby
            if ablist_norepeats[j].serialize() == ablist[i].serialize():
                break
            else:
                jay += -1
        if jay == 0:
            ablist_norepeats.append(ablist[i])

    #and finally write the singularized list to the Adrees Book OUT FILE

    with open(aboutfile, 'w') as destination_file:
        for j in range(0, len(ablist_norepeats)):
            destination_file.write(ablist_norepeats[j].serialize)

eliminate_vcard_duplicates(abinfile, aboutfile)
  • sortFirst
    使
    version
    位于前面
  • self.contents[k]中obj的
    未排序,因此您的TEL条目未排序
解决方案似乎是将
getSortedChildren()
重新定义为:

#!/usr/bin/env python2.7 

import vobject

abinfile='/foo/bar/dir/infile.vcf' #ab stands for address book  

aboutfile='/foo/bar/dir/outfile.vcf'  

def eliminate_vcard_duplicates (abinfile, aboutfile):

    #we first convert the Adrees Book IN FILE into a list

    with open(abinfile) as source_file:
        ablist = list(vobject.readComponents(source_file))

    #then add each vcard from that list in a new list unless it's already there

    ablist_norepeats=[]
    ablist_norepeats.append(ablist[0])

    for i in range(1, len(ablist)):
        jay=len(ablist_norepeats)
        for j in reversed(range(0, jay)): #we do reversed because usually cards have duplicates nearby
            if ablist_norepeats[j].serialize() == ablist[i].serialize():
                break
            else:
                jay += -1
        if jay == 0:
            ablist_norepeats.append(ablist[i])

    #and finally write the singularized list to the Adrees Book OUT FILE

    with open(aboutfile, 'w') as destination_file:
        for j in range(0, len(ablist_norepeats)):
            destination_file.write(ablist_norepeats[j].serialize)

eliminate_vcard_duplicates(abinfile, aboutfile)
def sortChildKeys(self):
    try:
        first = [s for s in self.behavior.sortFirst if s in self.contents]
    except Exception:
        first = []
    return first + sorted(k for k in self.contents.keys() if k not in first)
但这导致:

#!/usr/bin/env python2.7 

import vobject

abinfile='/foo/bar/dir/infile.vcf' #ab stands for address book  

aboutfile='/foo/bar/dir/outfile.vcf'  

def eliminate_vcard_duplicates (abinfile, aboutfile):

    #we first convert the Adrees Book IN FILE into a list

    with open(abinfile) as source_file:
        ablist = list(vobject.readComponents(source_file))

    #then add each vcard from that list in a new list unless it's already there

    ablist_norepeats=[]
    ablist_norepeats.append(ablist[0])

    for i in range(1, len(ablist)):
        jay=len(ablist_norepeats)
        for j in reversed(range(0, jay)): #we do reversed because usually cards have duplicates nearby
            if ablist_norepeats[j].serialize() == ablist[i].serialize():
                break
            else:
                jay += -1
        if jay == 0:
            ablist_norepeats.append(ablist[i])

    #and finally write the singularized list to the Adrees Book OUT FILE

    with open(aboutfile, 'w') as destination_file:
        for j in range(0, len(ablist_norepeats)):
            destination_file.write(ablist_norepeats[j].serialize)

eliminate_vcard_duplicates(abinfile, aboutfile)
TypeError:“上的变体,使用类装饰符

#!/usr/bin/env python2.7 

import vobject

abinfile='/foo/bar/dir/infile.vcf' #ab stands for address book  

aboutfile='/foo/bar/dir/outfile.vcf'  

def eliminate_vcard_duplicates (abinfile, aboutfile):

    #we first convert the Adrees Book IN FILE into a list

    with open(abinfile) as source_file:
        ablist = list(vobject.readComponents(source_file))

    #then add each vcard from that list in a new list unless it's already there

    ablist_norepeats=[]
    ablist_norepeats.append(ablist[0])

    for i in range(1, len(ablist)):
        jay=len(ablist_norepeats)
        for j in reversed(range(0, jay)): #we do reversed because usually cards have duplicates nearby
            if ablist_norepeats[j].serialize() == ablist[i].serialize():
                break
            else:
                jay += -1
        if jay == 0:
            ablist_norepeats.append(ablist[i])

    #and finally write the singularized list to the Adrees Book OUT FILE

    with open(aboutfile, 'w') as destination_file:
        for j in range(0, len(ablist_norepeats)):
            destination_file.write(ablist_norepeats[j].serialize)

eliminate_vcard_duplicates(abinfile, aboutfile)
导入vobject
从vobject.base导入组件,ContentLine
def分类内容(cls):
def getSortedChildren(自身):
返回[self.sortChildKeys()中k的obj,用于已排序的obj(self.contents[k])]
cls.getSortedChildren=getSortedChildren
返回cls
def可排序内容(cls):
定义(自身、其他):
返回str(自身)