连接两个具有相同结构的Python分支_Python_Merge_Scikit Learn

连接两个具有相同结构的Python分支

python merge scikit-learn

连接两个具有相同结构的Python分支,python,merge,scikit-learn,Python,Merge,Scikit Learn,在Python中，连接两个结构完全相同的分支的正确方法是什么（最初我问过合并，但“合并”很难定义） Sklearn使用两个束对象分别存储列车和测试数据。我需要连接它们。因此，我希望有一个新的数据包，包含测试和训练数据包中的所有数据。重复条目可以下面是束的示例结构。但我认为应该有一个通用的方法，独立于束内容，只要两个束具有完全相同的结构 Key Type Size DESCR NoneType 1 data

在Python中，连接两个结构完全相同的分支的正确方法是什么

（最初我问过合并，但“合并”很难定义）

Sklearn使用两个束对象分别存储列车和测试数据。我需要连接它们。因此，我希望有一个新的数据包，包含测试和训练数据包中的所有数据。重复条目可以

下面是束的示例结构。但我认为应该有一个通用的方法，独立于束内容，只要两个束具有完全相同的结构

Key             Type        Size
DESCR           NoneType    1
data            list        2
target_names    list        6212
filenames       int32       (6212L,)
target          string744   (6212L,)

以下是创建这些束的文档摘录：

def load_files(container_path, description=None, categories=None,
           load_content=True, shuffle=True, encoding=None,
           decode_error='strict', random_state=0):
"""Load text files with categories as subfolder names.

Individual samples are assumed to be files stored a two levels folder
structure such as the following:

    container_folder/
        category_1_folder/
            file_1.txt
            file_2.txt
            ...
            file_42.txt
        category_2_folder/
            file_43.txt
            file_44.txt
            ...

The folder names are used as supervised signal label names. The
individual file names are not important.

虽然描述中说文件名并不重要，但它们对我来说很重要，所以集群化对象元素的顺序应该保持同步

下面是一个例子：

import string,os,shutil
from random import randint,choice

############## data preparation ##############################################
def id_generator(size, chars=string.ascii_uppercase + string.digits):
    return ''.join(choice(chars) for _ in range(size))
# utility, copies content of one directory into another with same structure
def merge_directories(root_src_dir,root_dst_dir):
    for src_dir, dirs, files in os.walk(root_src_dir):
        dst_dir = src_dir.replace(root_src_dir, root_dst_dir, 1)
        if not os.path.exists(dst_dir):
            os.mkdirs(dst_dir)
        for file_ in files:
            src_file = os.path.join(src_dir, file_)
            shutil.copy(src_file, dst_dir)
# create the directory structures with some random content       
d1='bunch_concatenation'
try:
    shutil.rmtree(d1)
except:
    pass
import itertools as itt
for d2,d3 in itt.product(('test_folder','train_folder')
                         ,('category_1_folder'
                          ,'category_2_folder'
                          # any number of categories really
                          )):
    directory=os.path.join(d1,d2,d3)
    os.makedirs(directory)
    for n_files in range(randint(2,5)):
        f=open(os.path.join(directory,id_generator(size=6))+'.txt',"w")
        f.writelines( "%s\n" %  
            "\n".join([str(id_generator(size=randint(5,8))) 
            for n_lines in range(randint(4,7))])
            )
        f.close()
test_folder=os.path.join(d1,'test_folder')    
train_folder=os.path.join(d1,'train_folder')
both_folder=os.path.join(d1,'both_folder') 
############ end data preparation ###########################################

# This is what I do in my actual code to get bunch_A and bunch_B:
from sklearn.datasets import load_files
bunch_A=load_files(test_folder)
bunch_B=load_files(train_folder)

# Now we have bunch_A and bunch_B which I need to concatenate into bunch_C

############ a workaround to serve as an example. ############################
# Create a both_folder that has all content from test_folder and train_folder.
shutil.copytree(test_folder, both_folder)
merge_directories(train_folder, both_folder)
# And load bunch_C from disk from both_folder
bunch_C=load_files(both_folder)
############# end workaround #################################################

print bunch_A
print bunch_B
print bunch_C    

#this is the function I want you to help me to write
def concatenate_bunches(bunch_A,bunch_B):
    """ What should go here?  
    How to create such bunch_C without actually copying the above directories 
    and loading the result from the disk?
    """
    return bunch_C 

# and this would be the usage example :-)
bunch_C=concatenate_bunches(bunch_A,bunch_B)


"""
Here is the directory structure again for your reference
please note that the actual filenames will be random strings

bunch_concatenation
    test_folder/
        category_1_folder/
            file_1.txt
            file_2.txt
            ...
            file_42.txt
        category_2_folder/
            file_43.txt
            file_44.txt
            ...
            file_99.txt
    train_folder/
        category_1_folder/
            file_111.txt
            file_112.txt
            ...
            file_142.txt
        category_2_folder/
            file_143.txt
            file_144.txt
            ...        
            file_150.txt
    both_folder/
        category_1_folder/
            file_1.txt
            file_2.txt
            ...
            file_42.txt
            file_111.txt
            ...
            file_142.txt
        category_2_folder/
            file_43.txt
            file_44.txt
            ...
            file_99.txt
            file_143.txt
            ...
            file_150.txt

"""

打印输出的束

Bunch\u A
{'target_names'：['category_1_folder'，'category_2_folder']，
“数据”收集：“数据”数据收集：1.ZZGYZZGY7\r\r\nNQQQQQQQQQQQQQQQQQQQQQQQ7\r\n16HB\r\r\r\r\r\r\r\r\r\r\nQQQQQQQ7\r\r\n16HB\r\r\r\n16HB\r\r\r\n\n\n\n\n\n16HB6印度6\r\r\r\nBJJJJJ\r\r\n\n\n\n6\r\r\n\n6北京北京市政府军6\r\r\r\r\r\r\r\r\nBJJJJJJJ\r\r\r\r\r\r\r\r\r\r\nJJJJJJJJJ\n6\r\r\n6“]，
“目标”：数组（[1,1,0,0]），
“DESCR”：无，
“文件名”：数组([
'bunch_concatenation\\test_folder\\category_2_folder\\KJO1WG.txt'，
'bunch\u concatenation\\test\u folder\\category\u 2\u folder\\UI0203.txt'，
'bunch\u concatenation\\test\u folder\\category\u 1\u folder\\FYFAEX.txt'，
'bunch_concatenation\\test_folder\\category_1_folder\\9AP5UJ.txt']，
数据类型='| S60'）}

Bunch\B
{'target_names'：['category_1_folder'，'category_2_folder']，
“数据”：“T7QHHHHJ\r\nfTT7QHHHHHHHHTTTT7HHHHTTTTTTTT7HHHTTT7TTT7QQQQHHHHHHTTTTTT7\r\r\nfT7T7HHHHHHJTTT7\r\r\n\nTT7\nT7T7T7T7T7\nT7\nT7\nT7\nTT7\nTTT7\nTTTTT7\nTTT7\nT7\nTT7\nT7\nTTTTT7\nTT7\nTTTTTTTTTTTT7-T7TTTTT7-T7-TT7-T7-TT7-T7-TJJJJJHHHHHHHHHHHHH\ nAUR1U4\r\n299LGBH\r\nXN5PIJ\r\n'，0DSW2E5M\r\nO4TPUZUL\r\nDQZYQ08\r\nHNE945\r\n'，'275XS4VW\r\nPTDVG17W\r\nVELJHWFB\r\nTP62OLYE\r\nQ49OZG\r\n']，
“目标”：数组（[1,0,0,0,0,1,0]），
“DESCR”：无，
“文件名”：数组([
'bunch\u concatenation\\train\u folder\\category\u 2\u folder\\S05CO4.txt'，
'bunch\u concatenation\\train\u folder\\category\u 1\u folder\\R0IYPO.txt'，
'bunch\u concatenation\\train\u folder\\category\u 1\u folder\\FIS7PE.txt'，
'bunch\u concatenation\\train\u folder\\category\u 1\u folder\\TRX13N.txt'，
'bunch\u concatenation\\train\u folder\\category\u 1\u folder\\7V3ELC.txt'，
'bunch\u concatenation\\train\u folder\\category\u 2\u folder\\FAL8AK.txt'，
'bunch\u concatenation\\train\u folder\\category\u 1\u folder\\WDGLVH.txt']，
数据类型='| S61'）}

Bunch\u C
{'target_names'：['category_1_folder'，'category_2_folder']，'data'：目前，中国政府对维吉尼准准准准DJ5 Q3\r\\r\nTTTTTTTQQ3\r\r\r\r\r\r\r\r\r\r\r\r\nTTQDDJ5-DQDJ5 3\r\nTf3\r\r\n\nTf3\r\r\nTf3\n\nTf3\r\r\n\nTf3\nTf3\nTf3\nTf3\nTf3\nTf3\nTf3\nTf3\nTf3\nTf3\nTf3\nTf3\nTf7\nTf7\nTf7\nTf7\nTf3\nTf3\nTf3\nTfTfTf3\nTfTfTfTfTfTfTfTfTfTf7 nPTDVG17W\r\nVELJHWFB\r\nTP62OLYE\r\nQ49OZG\r\n“HIAUSHF\r\n7IGHJL\r\nEDJBE\r\nF2SDY4”\\r\r\r\n\n nMo6nMoMo6RoRoMo6RoRoRoRonMo6RoRonMo6RoRonMoMo6RoRonMo6RoMo6RoMo6RoRoMo6RoMoMoMo6RoMo6RoMo6RoMo6RonMo6RoRonMo6RonMo6RonMo6RonMo6RonMo6RonMo6Ro0\r\r\n\r\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nMo6Mo6Mo6RonMo6Mo6Mo6Mo6Mo6Mo6Mo6RonMo6RonMo6RonMo6RonMo6RonMo6Mo6Mo6Ro6金属金属金属金属金属金属金属金属金属金属金属金属金属金属金属金属金属金属金属金属金属金属金属金属金属金属金属金属金属金属金属金属金属金属金属金属GJD\r\nFO6BBKC\r\nJIUXT17\r\n1USOO6\r\n71FQ1RR1\r\n']，
“目标”：数组（[0,1,0,1,0,0,1,1,0,0]），
“DESCR”：无，
“文件名”：数组([
'bunch\u concatenation\\two\u folder\\category\u 1\u folder\\R0IYPO.txt'，
'bunch\u concatenation\\both\u folder\\category\u 2\u folder\\S05CO4.txt'，
'bunch\u concatenation\\both\u folder\\category\u 1\u folder\\FIS7PE.txt'，
'bunch\u concatenation\\both\u folder\\category\u 2\u folder\\UI0203.txt'，
'bunch\u concatenation\\both\u folder\\category\u 1\u folder\\WDGLVH.txt'，
'bunch_concatenation\\both_folder\\category_1_folder\\9AP5UJ.txt'，
'bunch\u concatenation\\both\u folder\\category\u 2\u folder\\FAL8AK.txt'，
'bunch_concatenation\\two_folder\\category_2_folder\\KJO1WG.txt'，
'bunch\u concatenation\\both\u folder\\category\u 1\u folder\\FYFAEX.txt'，
'bunch\u concatenation\\both\u folder\\category\u 1\u folder\\7V3ELC.txt'，
'bunch\u concatenation\\both\u folder\\category\u 1\u folder\\TRX13N.txt']，
数据类型='| S60'）}

我刚刚注意到，我没有实现一件事——bunch_C中的文件名中有'both_folder'，但根据源bunch，它必须是'test_folder'或'train_folder'。这来自解决方法。但这实际上对于我需要如何处理生成的bunch并不重要。

您能确切地说明一下是什么吗你是说“合并”？假设我想执行

A+B

，其中

和

都是

Bunch

对象。如果两者都有包含不同列表的

数据

条目，会发生什么情况？它们应该被串联，还是应该在输出中只使用

或

中的一个列表？abo是什么其他类型？如果

A[“数据”]

和

B[“数据”]

包含不同类型的对象？我理解不同结构的意义以及由此带来的复杂性。因此，我在这里询问的是完全相同的结构。在我的具体案例中，我从两个目录加载了两个束，并希望有另一个束，看起来就像这两个目录中的一个是同一个目录与复制文件（覆盖）时的逻辑相同。作为一种解决方法，我可以通过操作系统进行复制，然后将生成的目录加载到新的目录组中。因此a和B不应包含具有不同类型的对象。通过“覆盖”，我假定您不需要任何