Hadoop 如何在piglatin中每次加载具有不同分隔符的文件_Hadoop_Apache Pig

Hadoop 如何在piglatin中每次加载具有不同分隔符的文件

hadoop apache-pig

Hadoop 如何在piglatin中每次加载具有不同分隔符的文件,hadoop,apache-pig,Hadoop,Apache Pig,来自输入源的数据具有不同的分隔符，如，或。有时可能，有时可能。但PigStorage函数一次只接受单个参数作为分隔符。如何加载此类数据[使用分隔符或；]您能检查一下这是否适用于您吗它将使用不同的分隔符处理所有输入文件它也将使用不同的分隔符处理同一个文件您可以在字符类内添加任意多的分隔符[，：，] 例如： input1.txt 1,2,3,4 input2.txt a-b-c-d input3.txt 100:200:300:400 input4.txt 100,aaa-200:b

来自输入源的数据具有不同的分隔符，如，或。有时可能，有时可能。但PigStorage函数一次只接受单个参数作为分隔符。如何加载此类数据[使用分隔符或；]

您能检查一下这是否适用于您吗

它将使用不同的分隔符处理所有输入文件

它也将使用不同的分隔符处理同一个文件

您可以在字符类内添加任意多的分隔符

[，：，]

例如：

input1.txt
1,2,3,4

input2.txt
a-b-c-d

input3.txt
100:200:300:400

input4.txt
100,aaa-200:b

PigScript:
A = LOAD 'input*' AS line;
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(line,'(.*)[,:-](.*)[,:-](.*)[,:-](.*)'))  AS (f1,f2,f3,f4);
DUMP B;

Output:
(1,2,3,4)
(a,b,c,d)
(100,200,300,400)
(100,aaa,200,b)

您需要为分隔符编写自己的自定义加载程序。
编写自定义加载程序的步骤：
从0.7.0开始，清管器装入器扩展了LoadFunc抽象类。这意味着它们需要重写4个方法：
getInputFormat（）此方法向调用方返回加载程序支持的InputFormat的实例。实际的加载过程需要在加载时使用一个实例，并且不希望对如何创建该实例施加任何约束。
在读取拆分之前调用prepareToRead（）。它传入读取拆分期间使用的读取器以及实际拆分。加载器的实现通常保留读卡器，如果需要，可能希望访问实际的拆分。
setLocation（）Pig调用此函数将加载位置传递给加载程序，加载程序负责将该信息传递给底层InputFormat对象。此方法可以多次调用，因此不应存在与该方法关联的状态（除非在调用该方法时重置该状态）。
完成所有设置后，getNext（）Pig调用此函数从加载程序获取下一个元组。如果此方法返回NULL，Pig将假定通过prepareToRead（）方法传递的拆分中的所有信息都已处理。
请找到密码
包装清管器；
导入java.io.IOException；
导入java.util.ArrayList；
导入java.util.List；
导入org.apache.hadoop.io.Text；
导入org.apache.hadoop.mapreduce.InputFormat；
导入org.apache.hadoop.mapreduce.Job；
导入org.apache.hadoop.mapreduce.RecordReader；
导入org.apache.hadoop.mapreduce.lib.input.FileInputFormat；
导入org.apache.hadoop.mapreduce.lib.input.TextInputFormat；
导入org.apache.pig.LoadFunc；
导入org.apache.pig.PigException；
导入org.apache.pig.backend.executionengine.ExecuteException；
导入org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Pigspit；
导入org.apache.pig.data.Tuple；
导入org.apache.pig.data.TupleFactory；
公共类CustomLoader扩展了LoadFunc{
私有字符串DELIM=“，”；
私有静态最终int默认值_限制=226；
私有整数限制=默认的整数限制；
私人录像机；
私有列表索引；
私人TupleFactory TupleFactory；
公共CustomLoader（字符串删除器）{
this.DELIM=delimter；
}
@凌驾
公共InputFormat getInputFormat（）引发IOException{
返回新的TextInputFormat（）；
}
@凌驾
公共元组getNext（）引发IOException{
Tuple-Tuple=null；
列表值=新的ArrayList（）；
tupleFactory=tupleFactory.getInstance（）；
试一试{
布尔notDone=reader.nextKeyValue（）；
如果（！未完成）{
返回null；
}
Text value=（Text）reader.getCurrentValue（）；
if（值！=null）{
字符串部分[]=value.toString（）.split（DELIM）；
对于（int index=0；index限制）{
抛出新IOException（“索引”+index+”超出范围：max index=“+limit”）；
}否则{
添加（部分[索引]）；
}
}
tuple=tupleFactory.newTuple（值）；
}
}捕捉（中断异常e）{
//向运行时异常条件添加更多信息。
int errCode=6018；
String errMsg=“读取输入时出错”；
抛出新的ExecException（errMsg、errCode、，
PIG（远程_环境，e）；
}
返回元组；
}
@凌驾
public void prepareToRead（记录阅读器，PigSplit PigSplit）
抛出IOException{
this.reader=reader；//注意，对于这个加载程序，我们不关心pigspit。
}
@凌驾
公共void setLocation（字符串位置，作业作业）引发IOException{
FileInputFormat.setInputPaths（作业，位置）；//该位置假定为逗号分隔的路径。
}
公共静态void main（字符串[]args）{
}
} 
创建一个jar文件
注册'/home/impadmin/customloader.jar'；
使用pig.CustomLoader（“：：”）作为（id，mov_id，rat，timestamp）加载“/pig/u.data”；
数据集
196::242::3::881250949
186::302::3::891717742
22::377::1::878887116
244::51::2::880606923
166::346::1::886397596
298::474::4::884182806
115::265::2::881171488
253::465::5::891628467
305::451::::886324817
6::86::3::883603013
现在，您可以指定所需的任何分隔符

您如何知道分隔符是什么？一个文件中的两行可以有不同的分隔符吗；作为分隔符。否，一个文件中的两行具有相同的分隔符。所有记录/行在文件中都将具有相同的分隔符。您可以将分隔符作为参数传递给pig脚本，并使用特定的分隔符调用它。如果是，或；如何传递分隔符？PigStorage只接受单个分隔符。如何判断它是、还是；？单独的文件夹？文件名？你能解释一下正则表达式是如何工作的吗？为什么[，：-]重复了三次？谢谢。但我的问题是知道如何动态加载带有不同分隔符的文件。例如：：可以是：；或者，或者-加载清管器加载程序时需要指定分隔符如果要使用一个字符或两个字符，请使用默认的pigstorage或自定义加载程序，而且当加载文件时，该文件将由一个分隔符分隔，如果一个字段与

A = LOAD '/some/path/COMMA-DELIM-PREFIX*' USING PigStorage(',') AS (f1:chararray, ...);
B = LOAD '/some/path/SEMICOLON-DELIM-PREFIX*' USING PigStorage('\t') AS (f1:chararray, ...);

C = UNION A,B;

You need to write your own custom loader for delimiter .

Steps for writing custom loader :

As of 0.7.0, Pig loaders extend the LoadFunc abstract class.This means they need to override 4 methods:

    getInputFormat() this method returns to the caller an instance of the InputFormat that the loader supports. The actual load process needs an instance to use at load time, and doesn't want to place any constraints on how that instance is created.
    prepareToRead() is called prior to reading a split. It passes in the reader used during the reads of the split, as well as the actual split. The implementation of the loader usually keeps the reader, and may want to access the actual split if needed.
    setLocation() Pig calls this to communicate the load location to the loader, which is responsible for passing that information to the underlying InputFormat object. This method can be called multiple times, so there should be no state associated with the method (unless that state gets reset when the method is called).
    getNext() Pig calls this to get the next tuple from the loader once all setup has been done. If this method returns a NULL, Pig assumes that all  information in the split passed via the prepareToRead() method has been processed. 


please find the code 


package Pig;

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputFormat;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.pig.LoadFunc;
import org.apache.pig.PigException;
import org.apache.pig.backend.executionengine.ExecException;
import org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigSplit;
import org.apache.pig.data.Tuple;
import org.apache.pig.data.TupleFactory;

public class CustomLoader extends LoadFunc {

     private String DELIM = ",";
     private static final int DEFAULT_LIMIT = 226;
     private int limit = DEFAULT_LIMIT;
     private RecordReader reader;
     private List <Integer>indexes;
     private TupleFactory tupleFactory;




     public CustomLoader(String delimter) {
            this.DELIM = delimter;

        }

     @Override
     public InputFormat getInputFormat() throws IOException {
       return new TextInputFormat();

     }




     @Override
     public Tuple getNext() throws IOException {
      Tuple tuple = null;
      List values = new ArrayList();
      tupleFactory = TupleFactory.getInstance();
      try {
       boolean notDone = reader.nextKeyValue();
       if (!notDone) {
           return null;
       }
       Text value = (Text) reader.getCurrentValue();

       if(value != null) {
        String parts[] = value.toString().split(DELIM);


         for (int index=0 ;index< parts.length;index++) {


             if(index > limit) {
          throw new IOException("index "+index+ "is out of bounds: max index = "+limit);
         } else {
          values.add(parts[index]);
         }
         }

        tuple = tupleFactory.newTuple(values);
       }

      } catch (InterruptedException e) {
       // add more information to the runtime exception condition. 
       int errCode = 6018;
                String errMsg = "Error while reading input";
                throw new ExecException(errMsg, errCode,
                        PigException.REMOTE_ENVIRONMENT, e);
      }

      return tuple;

     }

     @Override
     public void prepareToRead(RecordReader reader, PigSplit pigSplit)
       throws IOException {
      this.reader = reader; // note that for this Loader, we don't care about the PigSplit.
     }

     @Override
     public void setLocation(String location, Job job) throws IOException {
      FileInputFormat.setInputPaths(job, location); // the location is assumed to be comma separated paths. 

     }
     public static void main(String[] args) {

    }

    } 



create a jar file 

register '/home/impadmin/customloader.jar' ;


load '/pig/u.data' using Pig.CustomLoader('::') as (id,mov_id,rat,timestamp);

data sets 

196::242::3::881250949
186::302::3::891717742
22::377::1::878887116
244::51::2::880606923
166::346::1::886397596
298::474::4::884182806
115::265::2::881171488
253::465::5::891628467
305::451::::886324817
6::86::3::883603013


Now you can specify any delimiter you want