Python 如何将新sklearn数据帧与原始数据帧结合

Python 如何将新sklearn数据帧与原始数据帧结合,python,pandas,merge,scikit-learn,Python,Pandas,Merge,Scikit Learn,我正在导入一个.tsv文件,并使用sklearn创建一个特征矩阵。这很好用。代码如下: import nltk, string, csv, operator, re, collections, sys, struct, zlib, ast, io, math, time from nltk.corpus import stopwords import pandas as pd # This function removes numbers from an array def remove_nu

我正在导入一个.tsv文件,并使用sklearn创建一个特征矩阵。这很好用。代码如下:

import nltk, string, csv, operator, re, collections, sys, struct, zlib, ast, io, math, time
from nltk.corpus import stopwords
import pandas as pd

# This function removes numbers from an array
def remove_nums(arr): 
    # Declare a regular expression
    pattern = '[0-9]'  
    # Remove the pattern, which is a number
    arr = [re.sub(pattern, '', i) for i in arr]    
    # Return the array with numbers removed
    return arr

# This function cleans the passed in paragraph and parses it
def get_words(para):   
    # Create a set of stop words
    stop_words = set(stopwords.words('english'))
    # Split it into lower case    
    lower = para.lower().split()
    # Remove punctuation
    no_punctuation = (nopunc.translate(str.maketrans('', '', string.punctuation)) for nopunc in lower)
    # Remove integers
    no_integers = remove_nums(no_punctuation)
    # Remove stop words
    dirty_tokens = (data for data in no_integers if data not in stop_words)
    # Ensure it is not empty
    tokens = [data for data in dirty_tokens if data.strip()]
    # Ensure there is more than 1 character to make up the word
    tokens = [data for data in tokens if len(data) > 1]

    # Return the tokens
    return tokens 

def main():

    tsv_file = "C:\\Users\\Kelly\\Desktop\\Programming Assignment 4\\train.tsv"
    print(tsv_file)
    csv_table=pd.read_csv(tsv_file, sep='\t')
    csv_table.columns = ['rating', 'ID', 'text']

    s = pd.Series(csv_table['text'])
    new = s.str.cat(sep=' ')
    vocab = get_words(new)

    from sklearn.feature_extraction.text import TfidfVectorizer
    s = pd.Series(csv_table['text'])
    corpus = s.apply(lambda s: ' '.join(get_words(s)))

    vectorizer = TfidfVectorizer()
    X = vectorizer.fit_transform(corpus)

    df = pd.DataFrame(data=X.todense(), columns=vectorizer.get_feature_names())

    dfshape = df.shape
    csvshape = csv_table.shape
    print("SHAPE OF DF: {}".format(dfshape))
    print("SHAPE OF CSV_TABLE: {}".format(csvshape))

    print(df)
    print(csv_table)



main()
该代码创建两个数据帧,
csv_table
df
,它们具有以下形状:

SHAPE OF DF: (1999, 12287)
SHAPE OF CSV_TABLE: (1999, 3)
.tsv
文件的示例如下所示:

0   abch7619    Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. 42Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat…..........
1   uewl0928    Duis aute irure d21olor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excep3teur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
0   ahwb3612    Sed ut perspiciatis unde omnis iste natus  error sit voluptatem accusantium doloremque laudantium, totam rem aperiam, eaque ipsa quae ab illo inventore veritatis et quasi architecto beatae vitae dicta sunt explicabo. Nemo enim ipsam voluptatem                            quia voluptas sit aspernatur aut odit aut fugit, sed quia consequuntur magni dolores eos qui ratione voluptatem sequi nesciunt. Neque porro quisquam est, qui dolorem ipsum quia dolor sit amet, consectetur
1   llll2019    adipisci velit, sed quia non numquam eius modi tempora incidunt ut labore et                                     dolore magnam aliquam quaerat voluptatem. Ut enim ad minima veniam, quis nostrum exercitationem ullam corporis suscipit laboriosam, nisi ut aliquid ex ea commodi consequatur???? Quis autem                                                                               vel eum iure reprehenderit qui in ea voluptate velit esse quam nihil molestiae consequatur, vel illum qui dolorem eum fugiat quo voluptas nulla pariatur?
0   jdne2319    At vero eos et accusamus et iusto odio dignissimos ducimus qui blanditiis praesentium voluptatum deleniti atque corrupti quos dolores et quas molestias excepturi sint occaecati cupiditate non provident, similique sunt in culpa qui officia deserunt mollitia animi, id est laborum et dolorum fuga. 
1   asbq0918    Et harum quidem rerum facilis est et expedita distinctio................................ Nam libero tempore, cum soluta nobis est eligendi optio cumque nihil impedit quo minus id quod maxime placeat facere possimus, omnis voluptas assumenda est, omnis dolor repellendus. Temporibus autem quibusdam et               aut
      rating                      ID                                               text
0          2  BIeDBg4MrEd1NwWRlFHLQQ  Decent but terribly inconsistent food. I've ha...
1          4  NJHPiW30SKhItD5E2jqpHw  Looks aren't everything.......  This little di...
2          2  nnS89FMpIHz7NPjkvYHmug  Being a creature of habit anytime I want good ...
      aaargh  aah  aaron  aback  abacus  abandon  abandoned  abc  ability  ablaze  able  aboard  abode  ...  zippys  ziti  zitti  zoes  zombified  zomg  zoo  zoom  zsa  zsu  ztejas  zucchini  zuppa
0        0.0  0.0    0.0    0.0     0.0      0.0        0.0  0.0      0.0     0.0   0.0     0.0    0.0  ...     0.0   0.0    0.0   0.0        0.0   0.0  0.0   0.0  0.0  0.0     0.0       0.0    0.0
1        0.0  0.0    0.0    0.0     0.0      0.0        0.0  0.0      0.0     0.0   0.0     0.0    0.0  ...     0.0   0.0    0.0   0.0        0.0   0.0  0.0   0.0  0.0  0.0     0.0       0.0    0.0
2        0.0  0.0    0.0    0.0     0.0      0.0        0.0  0.0      0.0     0.0   0.0     0.0    0.0  ...     0.0   0.0    0.0   0.0        0.0   0.0  0.0   0.0  0.0  0.0     0.0       0.0    0.0
3        0.0  0.0    0.0    0.0     0.0      0.0        0.0  0.0      0.0     0.0   0.0     0.0    0.0  ...     0.0   0.0    0.0   0.0        0.0   0.0  0.0   0.0  0.0  0.0     0.0       0.0    0.0
4        0.0  0.0    0.0    0.0     0.0      0.0        0.0  0.0      0.0     0.0   0.0     0.0    0.0  ...     0.0   0.0    0.0   0.0        0.0   0.0  0.0   0.0  0.0  0.0     0.0       0.0    0.0
5        0.0  0.0    0.0    0.0     0.0      0.0        0.0  0.0      0.0     0.0   0.0     0.0    0.0  ... 
csv_表的示例如下所示:

0   abch7619    Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. 42Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat…..........
1   uewl0928    Duis aute irure d21olor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excep3teur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
0   ahwb3612    Sed ut perspiciatis unde omnis iste natus  error sit voluptatem accusantium doloremque laudantium, totam rem aperiam, eaque ipsa quae ab illo inventore veritatis et quasi architecto beatae vitae dicta sunt explicabo. Nemo enim ipsam voluptatem                            quia voluptas sit aspernatur aut odit aut fugit, sed quia consequuntur magni dolores eos qui ratione voluptatem sequi nesciunt. Neque porro quisquam est, qui dolorem ipsum quia dolor sit amet, consectetur
1   llll2019    adipisci velit, sed quia non numquam eius modi tempora incidunt ut labore et                                     dolore magnam aliquam quaerat voluptatem. Ut enim ad minima veniam, quis nostrum exercitationem ullam corporis suscipit laboriosam, nisi ut aliquid ex ea commodi consequatur???? Quis autem                                                                               vel eum iure reprehenderit qui in ea voluptate velit esse quam nihil molestiae consequatur, vel illum qui dolorem eum fugiat quo voluptas nulla pariatur?
0   jdne2319    At vero eos et accusamus et iusto odio dignissimos ducimus qui blanditiis praesentium voluptatum deleniti atque corrupti quos dolores et quas molestias excepturi sint occaecati cupiditate non provident, similique sunt in culpa qui officia deserunt mollitia animi, id est laborum et dolorum fuga. 
1   asbq0918    Et harum quidem rerum facilis est et expedita distinctio................................ Nam libero tempore, cum soluta nobis est eligendi optio cumque nihil impedit quo minus id quod maxime placeat facere possimus, omnis voluptas assumenda est, omnis dolor repellendus. Temporibus autem quibusdam et               aut
      rating                      ID                                               text
0          2  BIeDBg4MrEd1NwWRlFHLQQ  Decent but terribly inconsistent food. I've ha...
1          4  NJHPiW30SKhItD5E2jqpHw  Looks aren't everything.......  This little di...
2          2  nnS89FMpIHz7NPjkvYHmug  Being a creature of habit anytime I want good ...
      aaargh  aah  aaron  aback  abacus  abandon  abandoned  abc  ability  ablaze  able  aboard  abode  ...  zippys  ziti  zitti  zoes  zombified  zomg  zoo  zoom  zsa  zsu  ztejas  zucchini  zuppa
0        0.0  0.0    0.0    0.0     0.0      0.0        0.0  0.0      0.0     0.0   0.0     0.0    0.0  ...     0.0   0.0    0.0   0.0        0.0   0.0  0.0   0.0  0.0  0.0     0.0       0.0    0.0
1        0.0  0.0    0.0    0.0     0.0      0.0        0.0  0.0      0.0     0.0   0.0     0.0    0.0  ...     0.0   0.0    0.0   0.0        0.0   0.0  0.0   0.0  0.0  0.0     0.0       0.0    0.0
2        0.0  0.0    0.0    0.0     0.0      0.0        0.0  0.0      0.0     0.0   0.0     0.0    0.0  ...     0.0   0.0    0.0   0.0        0.0   0.0  0.0   0.0  0.0  0.0     0.0       0.0    0.0
3        0.0  0.0    0.0    0.0     0.0      0.0        0.0  0.0      0.0     0.0   0.0     0.0    0.0  ...     0.0   0.0    0.0   0.0        0.0   0.0  0.0   0.0  0.0  0.0     0.0       0.0    0.0
4        0.0  0.0    0.0    0.0     0.0      0.0        0.0  0.0      0.0     0.0   0.0     0.0    0.0  ...     0.0   0.0    0.0   0.0        0.0   0.0  0.0   0.0  0.0  0.0     0.0       0.0    0.0
5        0.0  0.0    0.0    0.0     0.0      0.0        0.0  0.0      0.0     0.0   0.0     0.0    0.0  ... 
df
的示例如下所示:

0   abch7619    Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. 42Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat…..........
1   uewl0928    Duis aute irure d21olor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excep3teur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
0   ahwb3612    Sed ut perspiciatis unde omnis iste natus  error sit voluptatem accusantium doloremque laudantium, totam rem aperiam, eaque ipsa quae ab illo inventore veritatis et quasi architecto beatae vitae dicta sunt explicabo. Nemo enim ipsam voluptatem                            quia voluptas sit aspernatur aut odit aut fugit, sed quia consequuntur magni dolores eos qui ratione voluptatem sequi nesciunt. Neque porro quisquam est, qui dolorem ipsum quia dolor sit amet, consectetur
1   llll2019    adipisci velit, sed quia non numquam eius modi tempora incidunt ut labore et                                     dolore magnam aliquam quaerat voluptatem. Ut enim ad minima veniam, quis nostrum exercitationem ullam corporis suscipit laboriosam, nisi ut aliquid ex ea commodi consequatur???? Quis autem                                                                               vel eum iure reprehenderit qui in ea voluptate velit esse quam nihil molestiae consequatur, vel illum qui dolorem eum fugiat quo voluptas nulla pariatur?
0   jdne2319    At vero eos et accusamus et iusto odio dignissimos ducimus qui blanditiis praesentium voluptatum deleniti atque corrupti quos dolores et quas molestias excepturi sint occaecati cupiditate non provident, similique sunt in culpa qui officia deserunt mollitia animi, id est laborum et dolorum fuga. 
1   asbq0918    Et harum quidem rerum facilis est et expedita distinctio................................ Nam libero tempore, cum soluta nobis est eligendi optio cumque nihil impedit quo minus id quod maxime placeat facere possimus, omnis voluptas assumenda est, omnis dolor repellendus. Temporibus autem quibusdam et               aut
      rating                      ID                                               text
0          2  BIeDBg4MrEd1NwWRlFHLQQ  Decent but terribly inconsistent food. I've ha...
1          4  NJHPiW30SKhItD5E2jqpHw  Looks aren't everything.......  This little di...
2          2  nnS89FMpIHz7NPjkvYHmug  Being a creature of habit anytime I want good ...
      aaargh  aah  aaron  aback  abacus  abandon  abandoned  abc  ability  ablaze  able  aboard  abode  ...  zippys  ziti  zitti  zoes  zombified  zomg  zoo  zoom  zsa  zsu  ztejas  zucchini  zuppa
0        0.0  0.0    0.0    0.0     0.0      0.0        0.0  0.0      0.0     0.0   0.0     0.0    0.0  ...     0.0   0.0    0.0   0.0        0.0   0.0  0.0   0.0  0.0  0.0     0.0       0.0    0.0
1        0.0  0.0    0.0    0.0     0.0      0.0        0.0  0.0      0.0     0.0   0.0     0.0    0.0  ...     0.0   0.0    0.0   0.0        0.0   0.0  0.0   0.0  0.0  0.0     0.0       0.0    0.0
2        0.0  0.0    0.0    0.0     0.0      0.0        0.0  0.0      0.0     0.0   0.0     0.0    0.0  ...     0.0   0.0    0.0   0.0        0.0   0.0  0.0   0.0  0.0  0.0     0.0       0.0    0.0
3        0.0  0.0    0.0    0.0     0.0      0.0        0.0  0.0      0.0     0.0   0.0     0.0    0.0  ...     0.0   0.0    0.0   0.0        0.0   0.0  0.0   0.0  0.0  0.0     0.0       0.0    0.0
4        0.0  0.0    0.0    0.0     0.0      0.0        0.0  0.0      0.0     0.0   0.0     0.0    0.0  ...     0.0   0.0    0.0   0.0        0.0   0.0  0.0   0.0  0.0  0.0     0.0       0.0    0.0
5        0.0  0.0    0.0    0.0     0.0      0.0        0.0  0.0      0.0     0.0   0.0     0.0    0.0  ... 
然而,我现在需要完成的是合并
df
csv_表
,为刚刚创建的每个类/ID组合创建一个正确分类、ID和特征矩阵的真实数据集

我试着看了看,但这并没有给我带来任何有价值的东西。 我也会查看,但我没有
索引
列(至少我不这么认为)


既然我没有键或索引,那么如何在没有连接的情况下合并这两个呢?

数据的形状基本相同。没有应用洗牌,因此行顺序永远不会更改

因此,所需要的是:

result = pd.concat([csv_table, df], axis=1, sort=False)

df
csv_表中的列是什么?那么它们的维度是什么呢?那么,实际构建从原始
.tsv
组合而成的两个数据帧的代码没有帮助吗?这似乎很奇怪。我已将您的评论标记为版主干预。现在编辑帖子