Java 在AWS EMR自定义jar应用程序中指定其他jar

Java 在AWS EMR自定义jar应用程序中指定其他jar,java,mapreduce,teradata,classnotfoundexception,elastic-map-reduce,Java,Mapreduce,Teradata,Classnotfoundexception,Elastic Map Reduce,我正在尝试在EMR集群上运行hadoop作业。它是作为Java命令运行的,我使用了一个带有依赖项的jar。该作业从Teradata中提取数据,我假设与Teradata相关的jar也打包在jar中,并带有依赖项。然而,我仍然得到一个例外: Exception in thread "main" java.lang.RuntimeException: java.lang.RuntimeException: java.lang.ClassNotFoundException: com.teradata.j

我正在尝试在EMR集群上运行hadoop作业。它是作为Java命令运行的,我使用了一个带有依赖项的
jar
。该作业从Teradata中提取数据,我假设与Teradata相关的jar也打包在jar中,并带有依赖项。然而,我仍然得到一个例外:

Exception in thread "main" java.lang.RuntimeException: java.lang.RuntimeException: java.lang.ClassNotFoundException: com.teradata.jdbc.TeraDriver
at org.apache.hadoop.mapreduce.lib.db.DBInputFormat.setConf(DBInputFormat.java:171)
My
pom
具有以下相关依赖项:

<dependency>
  <groupId>teradata</groupId>
  <artifactId>terajdbc4</artifactId>
  <version>14.10.00.17</version>
</dependency>

<dependency>
  <groupId>teradata</groupId>
  <artifactId>tdgssconfig</artifactId>
  <version>14.10.00.17</version>
</dependency>
以以下方式运行EMR命令:

aws emr create-cluster --release-label emr-5.3.1 \
--instance-groups \
    InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge \
    InstanceGroupType=CORE,InstanceCount=5,BidPrice=0.1,InstanceType=m3.xlarge \
--service-role EMR_DefaultRole --log-uri s3://my-bucket/logs \
--applications Name=Hadoop --name TeradataPullerTest \
--ec2-attributes <ec2-attributes> \

--steps Type=CUSTOM_JAR,Name=EventsPuller,Jar=s3://path-to-jar-with-dependencies.jar,\
Args=[com.my.package.EventsPullerMR],ActionOnFailure=TERMINATE_CLUSTER \
--auto-terminate

我还没有完全解决这个问题,但我找到了一个方法使它工作。理想的解决方案应该是将teradata罐打包到uber罐中。这仍然在发生,但是这些JAR不知何故没有被添加到类路径中。我不知道为什么会这样

我通过创建两个独立的JAR解决了这个问题——一个用于我的代码包,另一个用于所有需要的依赖项。我将这两个JAR都上传到S3,然后编写了一个脚本,执行以下操作(伪代码):

#下载主jar
aws s3 cp。
#在临时目录中下载依赖项jar
aws s3 cp温度
#将依赖项jar解压到另一个目录中(比如'jars`)
解压-j temp/dependencies.jar/*-d jars
LIBJARS=`find jars/*.jar | tr-s'\n'','`
HADOOP_CLASSPATH=`echo${LIBJARS}| sed s/,/:/g`
CLASSPATH=$HADOOP\u类路径
导出类路径HADOOP\u类路径
#通过hadoop命令运行
hadoop jar myjar.jar com.my.package.EventsPullerMR-libjars${libjars}
这就开始了工作

<assembly>
    <id>aws-emr</id>
    <formats>
        <format>jar</format>
    </formats>
    <includeBaseDirectory>false</includeBaseDirectory>
    <dependencySets>
        <dependencySet>
            <unpack>false</unpack>
            <includes>
            </includes>
            <scope>runtime</scope>
            <outputDirectory>lib</outputDirectory>
        </dependencySet>
        <dependencySet>
            <unpack>true</unpack>
            <includes>
                <include>${groupId}:${artifactId}</include>
            </includes>
        </dependencySet>
    </dependencySets>
</assembly>
aws emr create-cluster --release-label emr-5.3.1 \
--instance-groups \
    InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge \
    InstanceGroupType=CORE,InstanceCount=5,BidPrice=0.1,InstanceType=m3.xlarge \
--service-role EMR_DefaultRole --log-uri s3://my-bucket/logs \
--applications Name=Hadoop --name TeradataPullerTest \
--ec2-attributes <ec2-attributes> \

--steps Type=CUSTOM_JAR,Name=EventsPuller,Jar=s3://path-to-jar-with-dependencies.jar,\
Args=[com.my.package.EventsPullerMR],ActionOnFailure=TERMINATE_CLUSTER \
--auto-terminate
aws-emr$ jar tf target/aws-emr-0.0.1-SNAPSHOT-jar-with-dependencies.jar | grep TeraDriver
com/ncr/teradata/TeraDriver.class
com/teradata/jdbc/TeraDriver.class
# download main jar
aws s3 cp <s3-path-to-myjar.jar> .

# download dependency jar in a temp directory
aws s3 cp <s3-path-to-dependency-jar> temp

# unzip the dependencies jar into another directory (say `jars`)
unzip -j temp/dependencies.jar <path-within-jar-to-unzip>/* -d jars

LIBJARS=`find jars/*.jar | tr -s '\n' ','`

HADOOP_CLASSPATH=`echo ${LIBJARS} | sed s/,/:/g`

CLASSPATH=$HADOOP_CLASSPATH

export CLASSPATH HADOOP_CLASSPATH

# run via hadoop command
hadoop jar myjar.jar com.my.package.EventsPullerMR -libjars ${LIBJARS} <arguments to the job>