Warning: file_get_contents(/data/phpspider/zhask/data//catemap/0/hadoop/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Hadoop 如何使用kite数据集分区模式正确导入csv数据集?_Hadoop_Hdfs_Cloudera Cdh_Hadoop Partitioning_Kite Dataset - Fatal编程技术网

Hadoop 如何使用kite数据集分区模式正确导入csv数据集?

Hadoop 如何使用kite数据集分区模式正确导入csv数据集?,hadoop,hdfs,cloudera-cdh,hadoop-partitioning,kite-dataset,Hadoop,Hdfs,Cloudera Cdh,Hadoop Partitioning,Kite Dataset,我正在使用MovieLens中公开的csv数据集 我已为ratings.csv创建了分区数据集: kite-dataset create ratings --schema rating.avsc --partition-by year-month.json --format parquet 这是我的year-month.json: [ { "name" : "year", "source" : "timestamp", "type" : "year" }, { "name" :

我正在使用MovieLens中公开的csv数据集 我已为ratings.csv创建了分区数据集:

kite-dataset create ratings --schema rating.avsc --partition-by year-month.json --format parquet
这是我的year-month.json:

[ {
  "name" : "year",
  "source" : "timestamp",
  "type" : "year"
}, {
  "name" : "month",
  "source" : "timestamp",
  "type" : "month"
} ]
以下是我的csv导入命令:

mkite-dataset csv-import ratings.csv ratings
导入完成后,我运行此命令以查看实际创建年份和月份分区的位置:

hadoop fs -ls /user/hive/warehouse/ratings/
我注意到,只创建了一个单年分区,在其中创建了一个单月分区:

[cloudera@quickstart ml-20m]$ hadoop fs -ls /user/hive/warehouse/ratings/
Found 3 items
drwxr-xr-x   - cloudera supergroup          0 2016-06-12 18:49 /user/hive/warehouse/ratings/.metadata
drwxr-xr-x   - cloudera supergroup          0 2016-06-12 18:59 /user/hive/warehouse/ratings/.signals
drwxrwxrwx   - cloudera supergroup          0 2016-06-12 18:59 /user/hive/warehouse/ratings/year=1970

[cloudera@quickstart ml-20m]$ hadoop fs -ls /user/hive/warehouse/ratings/year=1970/
Found 1 items
drwxrwxrwx   - cloudera supergroup          0 2016-06-12 18:59 /user/hive/warehouse/ratings/year=1970/month=01

执行这种分区导入的正确方法是什么,这将导致创建所有年份和所有月份的分区

在时间戳的末尾添加三个零

使用下面的shell脚本执行此操作

#!/bin/bash

# add the CSV header to both files
head -n 1 ratings.csv > ratings_1.csv
head -n 1 ratings.csv > ratings_2.csv

# output the first 10,000,000 rows to ratings_1.csv
# this includes the header, and uses tail to remove it
head -n 10000001 ratings.csv | tail -n +2 | awk '{print "000" $1 }' >> ratings_1.csv

    enter code here

# output the rest of the file to ratings_2.csv
# this starts at the line after the ratings_1 file stopped
tail -n +10000002 ratings.csv | awk '{print "000" $1 }' >> ratings_2.csv

就连我也遇到了这个问题,在添加了3个零后就解决了。

在时间戳的末尾添加3个零

使用下面的shell脚本执行此操作

#!/bin/bash

# add the CSV header to both files
head -n 1 ratings.csv > ratings_1.csv
head -n 1 ratings.csv > ratings_2.csv

# output the first 10,000,000 rows to ratings_1.csv
# this includes the header, and uses tail to remove it
head -n 10000001 ratings.csv | tail -n +2 | awk '{print "000" $1 }' >> ratings_1.csv

    enter code here

# output the rest of the file to ratings_2.csv
# this starts at the line after the ratings_1 file stopped
tail -n +10000002 ratings.csv | awk '{print "000" $1 }' >> ratings_2.csv
就连我也有这个问题,加上3个零就解决了