如何检查python中的panda列值列表_Python_Pandas_Dataframe_Vectorization

如何检查python中的panda列值列表

python pandas dataframe

如何检查python中的panda列值列表,python,pandas,dataframe,vectorization,Python,Pandas,Dataframe,Vectorization,我有一个Panda数据框，其中一列包含列表的值。我正在将一个列值输入到kfold filtered_labels = filtered_df['labels'] filtered_sentences = filtered_df.drop('labels', axis=1) kf = KFold(n_splits=5) # Define the split - into 5 folds kf.get_n_splits(filtered_sentences) for train_index,

我有一个Panda数据框，其中一列包含列表的值。我正在将一个列值输入到kfold

filtered_labels = filtered_df['labels']
filtered_sentences = filtered_df.drop('labels', axis=1)

kf = KFold(n_splits=5) # Define the split - into 5 folds 
kf.get_n_splits(filtered_sentences)

for train_index, test_index in kf.split(filtered_sentences.shape[0]):
    X_train, X_test = filtered_sentences.loc[train_index,filtered_sentences.columns], filtered_sentences.loc[test_index,filtered_sentences.columns]
    y_train, y_test = filtered_labels[train_index], filtered_labels[test_index]

    tdif_vectorizer = TfidfVectorizer(max_df=5,norm='l2',smooth_idf=True,use_idf=True,ngram_range=(1,1))

    train_corpus_as_string = [get_string_representation_from_tokens(sentence_tokens)
                                for sentence_tokens in X_train['setenceTokens']]

    tdif_train_features = tdif_vectorizer.fit_transform(train_corpus_as_string)
         tdif_test_features = tdif_vectorizer.transform(X_test) 

    vModel = LogisticRegression()
    vModel.fit(tdif_train_features,y_train)
    tdif_predicted_data_set = vModel.predict(tdif_test_features)

当我打印内容时，它显示如下：

X_train, X_test = filtered_sentences.loc[train_index,filtered_sentences.columns], filtered_sentences.loc[test_index,filtered_sentences.columns]

X_train['setenceTokens']

Out[642]: 
2171     [catastrophic, effect, hiroshima, nagasaki, at...
2172     [iraq, catastrophic, need, replace, constant, ...
2173          [learn, legacy, catastrophic, eruption, via]
2174     [catastrophic, effect, hiroshima, nagasaki, at...
2175              [wish, go, custom, werent, catastrophic]
2176     [best, part, old, baseball, manager, wear, uni...
2177               [learn, event, u, history, year, later]
2178     [catastrophic, effect, hiroshima, nagasaki, at...
2179     [catastrophic, effect, hiroshima, nagasaki, at...
2180              [society, respond, crisis, catastrophic]
2181     [british, upper, class, cause, catastrophic, s...
2182                   [dear, anyone, family, alive, 2040]
2183     [scientist, believe, catastrophic, manmade, gl...
2184     [everything, seem, catastrophic, feel, bad, hi...
2185     [jim, blog, catastrophic, outcome, may, come, ...
2186     [u, want, lead, united, state, catastrophic, w...
2187                  [stop, extreme, hurt, middle, class]
2188     [learn, legacy, catastrophic, eruption, new, y...
2189          [learn, legacy, catastrophic, eruption, via]
2190     [catastrophic, effect, hiroshima, nagasaki, at...
2191            [good, look, catastrophic, rain, flooding]
...

由于这些值在列表列表中，我想将它们转换为以下格式的数组，['society，Response，crisis，Distancert'，'something，Distancert，come，tune'…]，以便我可以将其提供给我的tdif_矢量器。fit_transform（字符串数组_）

当使用以下命令迭代令牌时

train_corpus_as_string = [get_string_representation_from_tokens(sentence_tokens)
                        for sentence_tokens in X_train['setenceTokens']]

在函数中，我打印得到的列表，得到nan作为一个值。请看下面

....
['escape', 'place', 'hide', 'time', 'space', 'collide']
['niggra', 'first', 'time', 'hear', 'song', 'sky', 'collide']
['even', 'star', 'moon', 'collide', 'oh', 'oh', 'never', 'want', 'back', 'life', 'take', 'word']
nan

and error : TypeError: 'float' object is not iterable

filtered_sentences.isnull().sum()
Out[652]: 
setenceTokens    0
dtype: int64

下面是我的get_字符串_表示法_from_tokens方法

def get_string_representation_from_tokens(tokens):
    string_tokens = ""
    print(tokens)
    for token in tokens:
        string_tokens += str(token) + " "
    return string_tokens

我的最终目标是运行5次kfold并获得训练数据，然后使用TFIDFvectorier获得向量，并提供逻辑回归模型和预测值。TfidfVectorizer希望数据位于字符串数组中。这就是为什么我要迭代上面的列表列表，以获得我上面提到的所需数组

如何检查值是否为nan并分配空字符串。我尝试了许多方法，但没有成功

问题二

我正在尝试创建一个示例，以便很容易地运行这个想法，但我有一个单独的问题（请原谅我在最后问这个问题）。问题在于当我分割数据时，它引入了nan值

我的原始dataframe列值没有任何null/nan值，因为请参见下文

....
['escape', 'place', 'hide', 'time', 'space', 'collide']
['niggra', 'first', 'time', 'hear', 'song', 'sky', 'collide']
['even', 'star', 'moon', 'collide', 'oh', 'oh', 'never', 'want', 'back', 'life', 'take', 'word']
nan

and error : TypeError: 'float' object is not iterable

filtered_sentences.isnull().sum()
Out[652]: 
setenceTokens    0
dtype: int64

但当我用下面这行分开时

X_train, X_test = filtered_sentences.loc[train_index,filtered_sentences.columns], filtered_sentences.loc[test_index,filtered_sentences.columns]

X_列包含空/非空值，见下文

X_train.isnull().sum()
Out[653]: 
setenceTokens    21
dtype: int64

有21个值。我在中看到了一个类似的问题，但我使用了相同的问题，但仍然得到了nan值。如果我可以通过这个，我不需要检查值。很抱歉把这篇文章写得这么长。

我发现了这个问题。从这个解决方案中，我没有得到nan值。问题是我创建数据框的方式。早些时候，我的dataframe将列值作为数组。如下

['feel','bad','literally','feel']
['feeling','heart','sinking']

但它也应该有价值

feel bad literally feel
feeling heart sinking

然后当我从kfold分离时，它没有给我nan值。希望这能为某人节省时间。

这可能是一个简单的解决方法。但是，请提供一个简单的解决方案，以便更容易地帮助您。提供

get\u string\u representation\u from\u tokens（）

code和实例数据，以及您的预期输出。andrew我已经编辑了我的问题。我已经展示了我得到的成果。希望这有助于理解这个问题，这是一个好的开始-但是你的输入不是真正有用的输入，它只是你系列的打印输出，被截断了。您能否提供完整的前几行和一个带有

NA

的案例，以及准确的预期输出？创建MCVE的一部分是，您不需要提供确切的数据，只需给出一个玩具用例，以捕获您遇到的问题。（如果足够简单的话，您也可以使用真实数据，但是带有

…

的打印输出不允许任何其他人使用您的代码来重现问题。）我添加了点，因为它打印了一个巨大的列表，因为我在方法get_string_representation_from_tokens中打印（标记）。我拥有的是panda数据帧，我给kfold一个列值来分割数据，每个分割的数据集我都需要在for循环中执行上述操作。请查看for循环中的内容。我已经提到了我的预期输出，即字符串数组。我会尽量简化这个问题。请按照安德鲁的建议，给我们一个完整的玩具示例，我们可以在自己的计算机上运行，并且显示出相同的问题。否则，很难猜测问题所在。

df.col\u name.str.join（“”）