Encoding Apache Drill处理cp1252字符代码

Encoding Apache Drill处理cp1252字符代码,encoding,apache-drill,Encoding,Apache Drill,我们作为csv的一部分查询的数据包含cp1252字符代码,apache drill给出以下错误: org.apache.drill.common.exceptions.UserRemoteException:系统错误:格式错误PutException:输入长度=1个片段0:0[错误Id:53bc07e3-a6e4-4301-a858-205be382275e on 172.16.243.116:31010](java.lang.RuntimeException)java.nio.charset.

我们作为csv的一部分查询的数据包含cp1252字符代码,apache drill给出以下错误:

org.apache.drill.common.exceptions.UserRemoteException:系统错误:格式错误PutException:输入长度=1个片段0:0[错误Id:53bc07e3-a6e4-4301-a858-205be382275e on 172.16.243.116:31010](java.lang.RuntimeException)java.nio.charset.MalformedInputException:Input length=1 org.apache.drill.exec.expr.fn.impl.CharSequenceWrapper.decodeUT8():185 org.apache.drill.exec.expr.fn.impl.CharSequenceWrapper.setBuffer():119 org.apache.drill.exec.test.generated.FiltererGen174.doEval():50 org.apache.drill.exec.test.generated.FiltererGen174.FilterBachnosv():100 org.apache.drill.exec.test.generated.FiltererGen174.filterBatch():73 org.apache.drill.exec.physical.impl.filter.FilterRecordBatch.doWork():81 org.apache.drill.exec.record.AbstractSingleRecordBatch.Innext():93 org.apache.drill.exec.next():162 org.apache.drill.exec.record.AbstractRecordBatch.next():119 org.apache.drill.exec.record.AbstractRecordBatch.next():109 org.apache.drill.exec.record.AbstractSingleRecordBatch.Innext():51 org.apache.drill.exec.physical.impl.limit.LimitRecordBatch.Innext():115 org.apache.drill.exec.record.AbstractRecordBatch.next():162 org.apache.drill.exec.record:119 org.apache.drill.exec.record.AbstractRecordBatch.next():109 org.apache.drill.exec.record.AbstractSingleRecordBatch.InneNext():51 org.apache.drill.exec.physical.impl.svremover.RemovingRecordBatch.Innext():93 org.apache.drill.exec.record.AbstractRecordBatch.next():162 org.apache.drill.exec.record.next():119 org.apache.drill.exec.record.AbstractRecordBatch.next():109 org.apache.drill.exec.record.AbstractSingleRecordBatch.Innext():51 org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.Innext():135 org.apache.drill.exec.record.AbstractRecordBatch.next():162 org.apache.drill.exec.physical.impl.BaseRootExec.next():104 org.apache.drill.exec.physical.impl.ScreenCreator$ScreenRoot.innerNext():81 org.apache.drill.exec.physical.impl.BaseRootExec.next():94 org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():232 org.apache.drill.exec.work.FragmentExecutor$1.run():226 java.security.AccessController.doPrivileged():-2 javax.security.auth.Subject.doAs():422 org.apache.hadoop.security.UserGroupInformation.doAs():1657 org.apache.drill.exec.work.fragment.FragmentExecutor.run():226 org.apache.drill.common.SelfCleaningRunnable.run():38 java.util.concurrent.thpoolexecutor.runWorker():1142 java.util.concurrent.thpoolexecutor$Worker.run():617 java.lang.Thread.run():745由(java.nio.charset.MalformedInputException)输入长度=1 java.nio.charset.CoderResult.ThroweException()引起:281 org.apache.drill.exec.expr.fn.impl.CharSequenceWrapper.decodeUT8():183 org.apache.drill.exec.expr.fn.impl.CharSequenceWrapper.setBuffer():119 org.apache.drill.exec.test.generated.filterregen174.doEval():50 org.apache.drill.exec.test.generated.filterregen174.filterBatchNoSV():100 org.apache.drill.exec.test.generated.filterregen174.filterBatch():73 org.apache.drill.exec.physical.impl.FilterRecordBatch.doWork():81 org.apache.drill.exec.record.record.AbstractSingleRecordBatch.innext():93 org.apache.drill.exec.record.AbstractRecordBatch.next():162 org.apache.drill.exec.record.AbstractRecordBatch.next():119 org.apache.drill.exec.record.next():109 org.apache.drill.exec.record.AbstractSingleRecordBatch.Innext():51 org.apache.drill.exec.physical.impl.LimitRecordBatch.Innext():115 org.apache.drill.exec.record.AbstractRecordBatch.next():162 org.apache.drill.exec.record.AbstractRecordBatch.next():119 org.apache.drill.exec.record.record.next():109 org.apache.drill.exec.record.AbstractSingleRecordBatch.Innext():51 org.apache.drill.exec.physical.impl.SvRemovingRecordBatch.Innext():93 org.apache.drill.exec.record.AbstractRecordBatch.next():162 org.apache.drill.exec.record.AbstractRecordBatch.next():119 org.apache.drill.exec.record.record.next():109 org.apache.drill.exec.record.AbstractSingleRecordBatch.Innext():51 org.apache.drill.exec.physical.impl.project.RecordBatch.Innext():135 org.apache.drill.exec.record.AbstractRecordBatch.next():162 org.apache.drill.exec.physical.impl.BaseRootExec.next():104 org.apache.drill.exec.physical.impl.BaseRootExec.next():94 org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():232 org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():226 java.security.AccessController.doPrivileged():-2 javax.security.auth.Subject.doAs():422 org.apache.hadoop.security.UserGroupInformation.doAs():1657 org.apache.drill.exec.work.FragmentExecutor.run():226 org.apache.drill.common.selfcleaningrunable.run():38 java.util.concurrent.ThreadPoolExecutor.runWorker():1142 java.util.concurrent.ThreadPoolExecutor$Worker.run():617 java.lang.Thread.run():745

有没有办法在Apache Drill中处理此类数据?

@OP 我知道这是一篇老文章,上周我遇到了新数据源的挑战。

直接在ApacheDrill(MapR版本)中,我使用STRING_BINARY()转换cp1252集。 不是优雅或高效的解决方案,但它是有效的

ApacheDrill 1.10.0 “钻宝宝钻” 0:jdbc:drill:zk=titana-ch2-p3:5181/drill/TIT>use sys; +-------+----------------------------------+ |好|总结| +-------+----------------------------------+ |true |默认模式更改为[sys]| +-------+----------------------------------+ 选择1行(0.975秒) 0:jdbc:drill:zk=titana-ch2-p3:5181/drill/TIT>从版本中选择版本; +----------+ |版本| +----------+ | 1.10.0 | +----------+ 选择1行(0.409秒) 0:jdbc:drill:zk=titana-ch2-p3:5

apache drill 1.10.0 "drill baby drill" 0: jdbc:drill:zk=titana-ch2-p3:5181/drill/TIT> use sys; +-------+----------------------------------+ | ok | summary | +-------+----------------------------------+ | true | Default schema changed to [sys] | +-------+----------------------------------+ 1 row selected (0.975 seconds) 0: jdbc:drill:zk=titana-ch2-p3:5181/drill/TIT> select version from version; +----------+ | version | +----------+ | 1.10.0 | +----------+ 1 row selected (0.409 seconds) 0: jdbc:drill:zk=titana-ch2-p3:5181/drill/TIT> 0: jdbc:drill:zk=titana-ch2-p3:5181/drill/TIT> select * from users.`sbalas002c`.drill_spl_char; +------------------------+--------------------------------------------------------------+ | ORIG_CAMPAIGN_LINE_ID | ORIG_CAMPAIGN_LINE_NAME | +------------------------+--------------------------------------------------------------+ | 30092278 | 1573256-1_306774_SeattleTheatreGroup�_201901_ISV_SEA_Z_SSEA | | 30092282 | 1573257-1_306774_SeattleTheatreGroup�_201901_ISV_SEA_Z_WORD | | 30092286 | 1573254-1_306774_SeattleTheatreGroup�_201901_ISV_SEA_Z_BLIS | | 30092290 | 1573255-1_306774_SeattleTheatreGroup�_201901_ISV_SEA_Z_NSEA | +------------------------+--------------------------------------------------------------+ 4 rows selected (0.445 seconds) 0: jdbc:drill:zk=titana-ch2-p3:5181/drill/TIT> 0: jdbc:drill:zk=titana-ch2-p3:5181/drill/TIT> select ORIG_CAMPAIGN_LINE_NAME, . . . . . . . . . . . . . . . . . . . . . . .> substr(ORIG_CAMPAIGN_LINE_NAME,1,4) sub_CAMPAIGN_LINE_NAME . . . . . . . . . . . . . . . . . . . . . . .> from users.`sbalas002c`.drill_spl_char; Error: SYSTEM ERROR: DrillRuntimeException: Unexpected byte 0xa0 at position 36 encountered while decoding UTF8 string. Fragment 0:0 [Error Id: 1889163a-f847-48ad-a7a9-bbe4284e112c on titand-ch2-p20.cable.comcast.com:31010] (state=,code=0) 0: jdbc:drill:zk=titana-ch2-p3:5181/drill/TIT> 0: jdbc:drill:zk=titana-ch2-p3:5181/drill/TIT> select ORIG_CAMPAIGN_LINE_NAME, . . . . . . . . . . . . . . . . . . . . . . .> STRING_BINARY(ORIG_CAMPAIGN_LINE_NAME) SB_CAMPAIGN_LINE_NAME, . . . . . . . . . . . . . . . . . . . . . . .> regexp_replace(STRING_BINARY(ORIG_CAMPAIGN_LINE_NAME),'\\xA0','') Good_CAMPAIGN_LINE_NAME . . . . . . . . . . . . . . . . . . . . . . .> from users.`sbalas002c`.drill_spl_char; +--------------------------------------------------------------+-----------------------------------------------------------------+-------------------------------------------------------------+ | ORIG_CAMPAIGN_LINE_NAME | SB_CAMPAIGN_LINE_NAME | Good_CAMPAIGN_LINE_NAME | +--------------------------------------------------------------+-----------------------------------------------------------------+-------------------------------------------------------------+ | 1573256-1_306774_SeattleTheatreGroup�_201901_ISV_SEA_Z_SSEA | 1573256-1_306774_SeattleTheatreGroup\xA0_201901_ISV_SEA_Z_SSEA | 1573256-1_306774_SeattleTheatreGroup_201901_ISV_SEA_Z_SSEA | | 1573257-1_306774_SeattleTheatreGroup�_201901_ISV_SEA_Z_WORD | 1573257-1_306774_SeattleTheatreGroup\xA0_201901_ISV_SEA_Z_WORD | 1573257-1_306774_SeattleTheatreGroup_201901_ISV_SEA_Z_WORD | | 1573254-1_306774_SeattleTheatreGroup�_201901_ISV_SEA_Z_BLIS | 1573254-1_306774_SeattleTheatreGroup\xA0_201901_ISV_SEA_Z_BLIS | 1573254-1_306774_SeattleTheatreGroup_201901_ISV_SEA_Z_BLIS | | 1573255-1_306774_SeattleTheatreGroup�_201901_ISV_SEA_Z_NSEA | 1573255-1_306774_SeattleTheatreGroup\xA0_201901_ISV_SEA_Z_NSEA | 1573255-1_306774_SeattleTheatreGroup_201901_ISV_SEA_Z_NSEA | +--------------------------------------------------------------+-----------------------------------------------------------------+-------------------------------------------------------------+ 4 rows selected (0.64 seconds) 0: jdbc:drill:zk=titana-ch2-p3:5181/drill/TIT> Hope this helps others.