Python Tensorflow文本矢量化层：如何定义自定义标准化函数？_Python_Numpy_Tensorflow_Keras_Text

Python Tensorflow文本矢量化层：如何定义自定义标准化函数？

python numpy tensorflow keras text

Python Tensorflow文本矢量化层：如何定义自定义标准化函数？,python,numpy,tensorflow,keras,text,Python,Numpy,Tensorflow,Keras,Text,我试图为创建一个自定义的标准化函数，但我似乎犯了一些根本性的错误我有以下文本数据： import numpy as np my_array = np.array([ "I am a sentence.", "I am another sentence!" ]) 我的目标我基本上想降低文本，删除标点符号，删除一些单词。文本向量化层的默认标准化功能（LOWER\u和\u STRIP\u标点符号）降低并删除标点符号，但恐怕无法删除整个单

我试图为创建一个自定义的标准化函数，但我似乎犯了一些根本性的错误

我有以下文本数据：

import numpy as np

my_array = np.array([
    "I am a sentence.",
    "I am another sentence!"
])

我的目标我基本上想降低文本，删除标点符号，删除一些单词。文本向量化层的默认标准化功能（

LOWER\u和\u STRIP\u标点符号

）降低并删除标点符号，但恐怕无法删除整个单词

（如果您知道这样做的方法，当然也非常感谢下面介绍的另一种方法）

一个有效的例子首先，找到一个工作的自定义标准化函数的示例

不工作的自定义函数但是，我的自定义标准化不断出现错误。这是我的密码：

import numpy as np import tensorflow as tf from tensorflow.keras.layers.experimental.preprocessing import TextVectorization from tensorflow.keras.preprocessing.text import text_to_word_sequence my_array = np.array([ "I am a sentence", "I am another sentence" ]) # these words should be removed bad_words = ["i", "am"] def remove_words(tokens): return [word for word in tokens if word not in bad_words] # this is the normalization function I want to apply def my_custom_normalize(my_array): tokenized = [text_to_word_sequence(str(sentence)) for sentence in my_array] clean_texts = [" ".join(remove_words(tokenized_string)) for tokenized_string in tokenized] clean_tensor = tf.convert_to_tensor(clean_texts) return clean_tensor my_vectorize_layer = TextVectorization( output_mode='int', standardize=my_custom_normalize, )
然而，一旦我尝试适应，我就会在错误中继续运行：

my_vectorize_layer.adapt(my_array) # raises error
我真的不明白为什么。在报告中说：
当使用自定义可调用项进行标准化时，可调用项接收的数据将完全传递到此层。可调用函数应返回与输入形状相同的张量
我想也许这就是造成错误的原因。但当我看这些形状时，一切似乎都是正确的：

my_result = my_custom_normalize(my_array) my_result.shape # returns TensorShape([2]) working_result = custom_standardization(my_array) working_result.shape # returns TensorShape([2])
我在这里真的迷路了。我做错了什么？我不应该使用列表理解吗？
def自定义标准化（输入数据）： def custom_standardization(input_data): lowercase = tf.strings.lower(input_data) stripped_html = tf.strings.regex_replace(lowercase, "<br />", " ") stripped_html = tf.strings.regex_replace(stripped_html,r'\d+(?:\.\d*)?(?:[eE][+-]?\d+)?', ' ') stripped_html = tf.strings.regex_replace(stripped_html, r'@([A-Za-z0-9_]+)', ' ' ) for i in stopwords_eng: stripped_html = tf.strings.regex_replace(stripped_html, f' {i} ', " ") return tf.strings.regex_replace( stripped_html, "[%s]" % re.escape(string.punctuation), "" ) 小写=tf.strings.lower（输入_数据） stripped_html=tf.strings.regex_replace（小写，，“”） stripped_html=tf.strings.regex_replace（stripped_html，r'\d+（？：\.\d*）？（？：[eE][+-]？\d+），''） stripped_html=tf.strings.regex_replace（stripped_html，r'@（[A-Za-z0-9_]+），''）对于stopwords_eng中的i： stripped_html=tf.strings.regex_replace（stripped_html，f'{i}'，“”）返回tf.strings.regex_replace( 删除html，[%s]%re.escape（字符串、标点符号） )
我认为最好使用带有
tf.strings.regex\u replace的regex来删除这些单词。我从未在keras中使用过TextVectorization ，但看看源代码，这似乎是导致错误的原因：只是一个想法。尝试用以下内容替换my_custom\u normalize 的主体：return tf.strings.regex\u replace（my_数组，“（？i）i | am”，”）实际上，要删除的单词会更多，（>200），因此将它们放入正则表达式中会非常麻烦。。。 InvalidArgumentError: Tried to squeeze dim index 1 for tensor with 1 dimensions. [Op:Squeeze] my_result = my_custom_normalize(my_array) my_result.shape # returns TensorShape([2]) working_result = custom_standardization(my_array) working_result.shape # returns TensorShape([2]) def custom_standardization(input_data): lowercase = tf.strings.lower(input_data) stripped_html = tf.strings.regex_replace(lowercase, "<br />", " ") stripped_html = tf.strings.regex_replace(stripped_html,r'\d+(?:\.\d*)?(?:[eE][+-]?\d+)?', ' ') stripped_html = tf.strings.regex_replace(stripped_html, r'@([A-Za-z0-9_]+)', ' ' ) for i in stopwords_eng: stripped_html = tf.strings.regex_replace(stripped_html, f' {i} ', " ") return tf.strings.regex_replace( stripped_html, "[%s]" % re.escape(string.punctuation), "" )