Hadoop 在sqoop中如何将数据拆分为零件文件_Hadoop_Sqoop_Hadoop Partitioning

Hadoop 在sqoop中如何将数据拆分为零件文件

hadoop

Hadoop 在sqoop中如何将数据拆分为零件文件,hadoop,sqoop,hadoop-partitioning,Hadoop,Sqoop,Hadoop Partitioning,如果数据是倾斜的，我怀疑如何将数据分割成部分文件。如果可能的话，请帮我澄清一下假设这是我的department表，其中department\u id是主键 mysql> select * from departments; 2 Fitness 3 Footwear 4 Apparel 5 Golf 6 Outdoors 7 Fan Shop 如果我通过在import命令中提到-m1来使用sqoop import，我知道我将只生成一个包含该命令中所有记录的零件文件现在，我运行命令时没有

如果数据是倾斜的，我怀疑如何将数据分割成部分文件。如果可能的话，请帮我澄清一下

假设这是我的

department

表，其中

department\u id

是主键

mysql> select * from departments;
2 Fitness
3 Footwear
4 Apparel
5 Golf
6 Outdoors
7 Fan Shop

如果我通过在import命令中提到

-m1

来使用

sqoop import

，我知道我将只生成一个包含该命令中所有记录的零件文件

现在，我运行命令时没有指定任何映射器。因此，默认情况下，它应该需要4个映射器，并在HDFS中创建了4个零件文件。下面是记录是如何按零件文件分发的

[cloudera@centsosdemo ~]$ hadoop fs -cat /user/cloudera/departments/part-m-00000
2,Fitness
3,Footwear
[cloudera@centsosdemo ~]$ hadoop fs -cat /user/cloudera/departments/part-m-00001
4,Apparel
[cloudera@centsosdemo ~]$ hadoop fs -cat /user/cloudera/departments/part-m-00002
5,Golf
[cloudera@centsosdemo ~]$ hadoop fs -cat /user/cloudera/departments/part-m-00003
6,Outdoors
7,Fan Shop

根据BoundingValsQuery，默认情况下将使用最小（部门id）=2、最大（部门id）=8和4个映射器

经计算，每个制图员应获得（8-2）/4=1.5条记录

这里我不知道如何分发数据。我不明白为什么有两条记录出现在m-00000部分，只有一条记录出现在m-00001部分、m-00002部分，还有两条记录出现在m-00003部分。

如果你们有机会去图书馆看看的话。这涉及到一系列步骤

Sqoop作业读取记录。通过DBRecordReader

有两种方法可以在这里工作

方法1

protected ResultSet executeQuery(String query) throws SQLException {
Integer fetchSize = dbConf.getFetchSize();
/*get fetchSize according to split which is calculated via getSplits() method of 
org.apache.sqoop.mapreduce.db.DBInputFormat.And no. of splits are calculated
via no. of (count from table/no. of mappers). */
 }

拆分计算：-

org.apache.sqoop.mapreduce.db.DBInputFormat
 public List<InputSplit> getSplits(JobContext job) throws IOException {
 .......//here splits are calculated accroding to count of source table
 .......query.append("SELECT COUNT(*) FROM " + tableName);
}

org.apache.sqoop.mapreduce.db.DBInputFormat
公共列表getSplits（JobContext作业）引发IOException{
……此处根据源表的计数计算拆分
……查询。追加（“从“+表名”中选择计数（*）；
}

方法2

 protected String getSelectQuery() {
    if (dbConf.getInputQuery() == null) {
      query.append("SELECT ");

      for (int i = 0; i < fieldNames.length; i++) {
        query.append(fieldNames[i]);
        if (i != fieldNames.length -1) {
          query.append(", ");
        }
      }

      query.append(" FROM ").append(tableName);
      query.append(" AS ").append(tableName); 
      if (conditions != null && conditions.length() > 0) {
        query.append(" WHERE (").append(conditions).append(")");
      }

      String orderBy = dbConf.getInputOrderBy();
      if (orderBy != null && orderBy.length() > 0) {
        query.append(" ORDER BY ").append(orderBy);
      }
    } else {
      //PREBUILT QUERY
      query.append(dbConf.getInputQuery());
    }

    try {// main logic to decide division of records between mappers.
      query.append(" LIMIT ").append(split.getLength());
      query.append(" OFFSET ").append(split.getStart());
    } catch (IOException ex) {
      // Ignore, will not throw.
    }

    return query.toString();
  }

受保护的字符串getSelectQuery（）{
if（dbConf.getInputQuery（）==null）{
查询。追加（“选择”）；
对于（int i=0；i0）{
query.append（“WHERE（”）.append（conditions）.append（”）；
}
字符串orderBy=dbConf.getInputOrderBy（）；
if（orderBy！=null&&orderBy.length（）>0）{
query.append（“orderBy”）.append（orderBy）；
}
}否则{
//预构建查询
append（dbConf.getInputQuery（））；
}
尝试{//主逻辑来决定映射程序之间记录的划分。
query.append（“LIMIT”）.append（split.getLength（））；
query.append（“OFFSET”）.append（split.getStart（））；
}捕获（IOEX异常）{
//忽略，不会抛出。
}
返回query.toString（）；
}

查看注释主逻辑下的代码部分至…… 这里的记录是根据限制和偏移划分的。对于每个RDBMS，这种逻辑的实现都是不同的。只需查找org.apache.sqoop.mapreduce.db.OracleDBRecordReader它对getSelectQuery（）方法的实现几乎没有什么不同

希望这能让您快速了解如何将记录划分为不同的映射器。

如果您有机会查看库的话。这涉及到一系列步骤

Sqoop作业读取记录。通过DBRecordReader

有两种方法可以在这里工作

方法1

protected ResultSet executeQuery(String query) throws SQLException {
Integer fetchSize = dbConf.getFetchSize();
/*get fetchSize according to split which is calculated via getSplits() method of 
org.apache.sqoop.mapreduce.db.DBInputFormat.And no. of splits are calculated
via no. of (count from table/no. of mappers). */
 }

拆分计算：-

org.apache.sqoop.mapreduce.db.DBInputFormat
 public List<InputSplit> getSplits(JobContext job) throws IOException {
 .......//here splits are calculated accroding to count of source table
 .......query.append("SELECT COUNT(*) FROM " + tableName);
}

org.apache.sqoop.mapreduce.db.DBInputFormat
公共列表getSplits（JobContext作业）引发IOException{
……此处根据源表的计数计算拆分
……查询。追加（“从“+表名”中选择计数（*）；
}

方法2

 protected String getSelectQuery() {
    if (dbConf.getInputQuery() == null) {
      query.append("SELECT ");

      for (int i = 0; i < fieldNames.length; i++) {
        query.append(fieldNames[i]);
        if (i != fieldNames.length -1) {
          query.append(", ");
        }
      }

      query.append(" FROM ").append(tableName);
      query.append(" AS ").append(tableName); 
      if (conditions != null && conditions.length() > 0) {
        query.append(" WHERE (").append(conditions).append(")");
      }

      String orderBy = dbConf.getInputOrderBy();
      if (orderBy != null && orderBy.length() > 0) {
        query.append(" ORDER BY ").append(orderBy);
      }
    } else {
      //PREBUILT QUERY
      query.append(dbConf.getInputQuery());
    }

    try {// main logic to decide division of records between mappers.
      query.append(" LIMIT ").append(split.getLength());
      query.append(" OFFSET ").append(split.getStart());
    } catch (IOException ex) {
      // Ignore, will not throw.
    }

    return query.toString();
  }

受保护的字符串getSelectQuery（）{
if（dbConf.getInputQuery（）==null）{
查询。追加（“选择”）；
对于（int i=0；i0）{
query.append（“WHERE（”）.append（conditions）.append（”）；
}
字符串orderBy=dbConf.getInputOrderBy（）；
if（orderBy！=null&&orderBy.length（）>0）{
query.append（“orderBy”）.append（orderBy）；
}
}否则{
//预构建查询
append（dbConf.getInputQuery（））；
}
尝试{//主逻辑来决定映射程序之间记录的划分。
query.append（“LIMIT”）.append（split.getLength（））；
query.append（“OFFSET”）.append（split.getStart（））；
}捕获（IOEX异常）{
//忽略，不会抛出。
}
返回query.toString（）；
}

希望这能快速了解如何将记录划分为不同的映射器。

Sqoop在主键列或“拆分依据”列中找到最小值和最大值，然后尝试为给定数量的映射器划分范围

示例，如果有一个表的主键列id最小值为0，最大值为1000，并且Sqoop被指示使用4个任务，Sqoop将运行四个进程，每个进程从id>=lo和id 这里的minval=2max=7，因此sqoop将运行四个进程，范围如下（2-4）、（4-5）、（5-6）、（6-7），这意味着

第二名和第三名一起

第四记录

第五记录

在这个范围内是第六和第七位

Sqoop在主键列或split by列中查找最小值和最大值，然后尝试为给定数量的映射器划分范围

示例，