Python 尝试标记Keras中的文本时出现空白错误_Python_Numpy_Keras

Python 尝试标记Keras中的文本时出现空白错误

python numpy keras

Python 尝试标记Keras中的文本时出现空白错误,python,numpy,keras,Python,Numpy,Keras,我有一个有两列的数据框。第一列（content_cleaned）包含包含句子的行。第二列（有意义）包含关联的二进制标签我遇到的问题是，当我试图标记content_cleaned列中的文本时，出现了空白。以下是我目前的代码： df = pd.read_csv(pathname, encoding = "ISO-8859-1") df = df[['content_cleaned', 'meaningful']] df = df.sample(frac=1) #Transposed column

我有一个有两列的数据框。第一列（content_cleaned）包含包含句子的行。第二列（有意义）包含关联的二进制标签

我遇到的问题是，当我试图标记content_cleaned列中的文本时，出现了空白。以下是我目前的代码：

df = pd.read_csv(pathname, encoding = "ISO-8859-1")
df = df[['content_cleaned', 'meaningful']]
df = df.sample(frac=1)

#Transposed columns into numpy arrays 
X = np.asarray(df[['content_cleaned']])
y = np.asarray(df[['meaningful']])

#Split into training and testing set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=21) 

# Create tokenizer
tokenizer = Tokenizer(num_words=100) #No row has more than 100 words.

#Tokenize the predictors (text)
X_train = tokenizer.sequences_to_matrix(X_train.astype(np.int32), mode="binary")
X_test = tokenizer.sequences_to_matrix(X_test.astype(np.int32), mode="binary")

#Convert the labels to the binary
encoder = LabelBinarizer()
encoder.fit(y_train) 
y_train = encoder.transform(y_train)
y_test = encoder.transform(y_test)

错误突出显示的代码行是：

X_train = tokenizer.sequences_to_matrix(X_train.astype(np.int32), mode="binary")

错误消息是：

invalid literal for int() with base 10: "STX's better than reported quarter is likely to bode well for WDC results."

“base 10:”后面的句子是包含文本的列中一行的示例。这将是我试图标记的一个示例句子

我相信这是NumPy的一个问题，但我也确信这可能是我标记这个文本数组的方法中的一个错误

任何帮助都会很好

您没有标记文本，

sequences\u to\u matrix

方法没有标记文本，而是将序列列表转换为。有很多方法可以标记文本数据，因此如果您想使用keras标记器，可以按照以下方法操作：

from keras.preprocessing.text import Tokenizer

# Tip for you: the num_words param is not the max length of given sentences
# It is the maximum number of words to keep in dictionary
tokenizer = Tokenizer(num_words=100)

# Creates a word index dictionary in itself
# Do not fit on your test data it will mislead on your score
tokenizer.fit_on_texts(X_train)

# Now you can convert the texts to sequences
X_train_encoded = tokenizer.texts_to_sequences(X_train)
X_test_encoded = tokenizer.texts_to_sequences(X_test)

# You need to add pads to sentences to fix them to same size
from keras.preprocessing.sequence import pad_sequences
max_len = 100
X_train = pad_sequences(X_train_encoded, maxlen=max_len)
X_test = pad_sequences(X_test_encoded, maxlen=max_len)

希望它能帮助你，看看有一个关于keras预处理文本的很棒的教程。

非常感谢你能回到我身边！当试图运行代码时，它会突出显示：“tokenizer.fit_on_texts（X_train）”，并出现错误：“AttributeError:'numpy.ndarray'对象没有属性'lower'”。此外，一旦完成了所有填充，并且为目标变量完成了LabelBinarizer，我就可以创建我的层并适合我的模型了。不要将pandas系列转换为numpy数组，您可以直接传递给keras标记器或将其转换为列表。可以在填充后创建模型。我认为你需要遵循nlp教程，比如keras的情绪分析，大多数nlp用例都包含相同的预处理步骤。这似乎可以做到，谢谢！在结束问题之前，我还有最后一个问题。在您提供的预处理步骤之后，我们预处理的数据是否可以输入到网络中？是的，您可以输入到网络中，但在构建模型时，请确保您的输入层形状与输入长度相同（在本例中为最大句子长度）。