Python 使用tensorflow数据集API预处理CSV数据_Python_Csv_Tensorflow_Tensorflow Datasets

Python 使用tensorflow数据集API预处理CSV数据

python csv tensorflow

Python 使用tensorflow数据集API预处理CSV数据,python,csv,tensorflow,tensorflow-datasets,Python,Csv,Tensorflow,Tensorflow Datasets,我在玩tensorflow，但对输入管道有点困惑。我正在处理的数据位于一个大型csv文件中，共有307列，其中第一列是表示日期的字符串，其余为浮点数我在预处理数据时遇到了一些问题。我想添加几个功能，而不是基于日期字符串。（具体而言，表示时间的正弦和余弦）。我还想将CSV行中接下来的120个值组合为一个功能，之后的96个值组合为一个功能，并根据CSV中的剩余值设置标签这是我目前生成数据集的代码： import tensorflow as tf defaults = [] defaults.a

我在玩tensorflow，但对输入管道有点困惑。我正在处理的数据位于一个大型csv文件中，共有307列，其中第一列是表示日期的字符串，其余为浮点数

我在预处理数据时遇到了一些问题。我想添加几个功能，而不是基于日期字符串。（具体而言，表示时间的正弦和余弦）。我还想将CSV行中接下来的120个值组合为一个功能，之后的96个值组合为一个功能，并根据CSV中的剩余值设置标签

这是我目前生成数据集的代码：

import tensorflow as tf

defaults = []
defaults.append([""])
for i in range(0,306):
  defaults.append([1.0])

def dataset(train_fraction=0.8):
  path = "training_examples_shuffled.csv"

  # Define how the lines of the file should be parsed
  def decode_line(line):
    items = tf.decode_csv(line, record_defaults=defaults)

    datetimeString = items[0]
    minuteFeatures = items[1:121]
    halfHourFeatures = items[121:217]
    labelFeatures = items[217:]

    ## Do something to convert datetimeString to timeSine and timeCosine

    features_dict = {
      'timeSine': timeSine,
      'timeCosine': timeCosine,
      'minuteFeatures': minuteFeatures,
      'halfHourFeatures': halfHourFeatures
    }

    label = [1] # placeholder. I seem to need some python logic here, but I'm 
                  not sure how to apply that to data in tensor format.

    return features_dict, label

  def in_training_set(line):
    """Returns a boolean tensor, true if the line is in the training set."""
    num_buckets = 1000000
    bucket_id = tf.string_to_hash_bucket_fast(line, num_buckets)
    # Use the hash bucket id as a random number that's deterministic per example
    return bucket_id < int(train_fraction * num_buckets)

  def in_test_set(line):
    """Returns a boolean tensor, true if the line is in the training set."""
    return ~in_training_set(line)

  base_dataset = (tf.data
                  # Get the lines from the file.
                  .TextLineDataset(path))

  train = (base_dataset
           # Take only the training-set lines.
           .filter(in_training_set)
           # Decode each line into a (features_dict, label) pair.
           .map(decode_line))

  # Do the same for the test-set.
  test = (base_dataset.filter(in_test_set).map(decode_line))

  return train, test

将tensorflow导入为tf
默认值=[]
默认值。追加（[“”]）
对于范围（0306）内的i：
默认值。追加（[1.0]）
def数据集（列分数=0.8）：
path=“training\u examples\u shuffled.csv”
#定义应该如何解析文件的行
def解码行（行）：
items=tf.decode\u csv（行、记录\u默认值=默认值）
datetimeString=items[0]
minuteFeatures=项目[1:121]
半小时功能=项目[121:217]
labelFeatures=项目[217:]
##执行一些操作将datetimeString转换为TimeLine和timeCosine
特征\u dict={
“时间线”：时间线，
“时间余弦”：时间余弦，
“细节特征”：细节特征，
“半小时功能”：半小时功能
}
标签=[1]#占位符。我似乎需要一些python逻辑，但我
不知道如何将其应用于张量格式的数据。
返回特征\u dict，标签
def in_training_set（行）：
“”“返回布尔张量，如果行在训练集中，则返回true。”“”
num_bucket=1000000
bucket\u id=tf.string\u to\u hash\u bucket\u fast（行，num\u bucket）
#使用哈希桶id作为随机数，每个示例都是确定的
返回桶id


我现在的问题是：如何访问datetimeString张量中的字符串以将其转换为datetime对象？还是这是个错误的地方？我想使用时间和星期几作为输入功能
第二：基于CSV的剩余值的标签几乎相同。在某种程度上，我可以使用标准的python代码吗？或者，如果可能的话，我应该使用基本的tensorflow操作来实现我想要的功能吗
最后，对这是否是一种处理我的输入的体面方式有何评论？Tensorflow有点令人困惑，互联网上流传的旧教程使用了不推荐的处理输入的方法