Encoding Apache Drill处理cp1252字符代码
我们作为csv的一部分查询的数据包含cp1252字符代码,apache drill给出以下错误: org.apache.drill.common.exceptions.UserRemoteException:系统错误:格式错误PutException:输入长度=1个片段0:0[错误Id:53bc07e3-a6e4-4301-a858-205be382275e on 172.16.243.116:31010](java.lang.RuntimeException)java.nio.charset.MalformedInputException:Input length=1 org.apache.drill.exec.expr.fn.impl.CharSequenceWrapper.decodeUT8():185 org.apache.drill.exec.expr.fn.impl.CharSequenceWrapper.setBuffer():119 org.apache.drill.exec.test.generated.FiltererGen174.doEval():50 org.apache.drill.exec.test.generated.FiltererGen174.FilterBachnosv():100 org.apache.drill.exec.test.generated.FiltererGen174.filterBatch():73 org.apache.drill.exec.physical.impl.filter.FilterRecordBatch.doWork():81 org.apache.drill.exec.record.AbstractSingleRecordBatch.Innext():93 org.apache.drill.exec.next():162 org.apache.drill.exec.record.AbstractRecordBatch.next():119 org.apache.drill.exec.record.AbstractRecordBatch.next():109 org.apache.drill.exec.record.AbstractSingleRecordBatch.Innext():51 org.apache.drill.exec.physical.impl.limit.LimitRecordBatch.Innext():115 org.apache.drill.exec.record.AbstractRecordBatch.next():162 org.apache.drill.exec.record:119 org.apache.drill.exec.record.AbstractRecordBatch.next():109 org.apache.drill.exec.record.AbstractSingleRecordBatch.InneNext():51 org.apache.drill.exec.physical.impl.svremover.RemovingRecordBatch.Innext():93 org.apache.drill.exec.record.AbstractRecordBatch.next():162 org.apache.drill.exec.record.next():119 org.apache.drill.exec.record.AbstractRecordBatch.next():109 org.apache.drill.exec.record.AbstractSingleRecordBatch.Innext():51 org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.Innext():135 org.apache.drill.exec.record.AbstractRecordBatch.next():162 org.apache.drill.exec.physical.impl.BaseRootExec.next():104 org.apache.drill.exec.physical.impl.ScreenCreator$ScreenRoot.innerNext():81 org.apache.drill.exec.physical.impl.BaseRootExec.next():94 org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():232 org.apache.drill.exec.work.FragmentExecutor$1.run():226 java.security.AccessController.doPrivileged():-2 javax.security.auth.Subject.doAs():422 org.apache.hadoop.security.UserGroupInformation.doAs():1657 org.apache.drill.exec.work.fragment.FragmentExecutor.run():226 org.apache.drill.common.SelfCleaningRunnable.run():38 java.util.concurrent.thpoolexecutor.runWorker():1142 java.util.concurrent.thpoolexecutor$Worker.run():617 java.lang.Thread.run():745由(java.nio.charset.MalformedInputException)输入长度=1 java.nio.charset.CoderResult.ThroweException()引起:281 org.apache.drill.exec.expr.fn.impl.CharSequenceWrapper.decodeUT8():183 org.apache.drill.exec.expr.fn.impl.CharSequenceWrapper.setBuffer():119 org.apache.drill.exec.test.generated.filterregen174.doEval():50 org.apache.drill.exec.test.generated.filterregen174.filterBatchNoSV():100 org.apache.drill.exec.test.generated.filterregen174.filterBatch():73 org.apache.drill.exec.physical.impl.FilterRecordBatch.doWork():81 org.apache.drill.exec.record.record.AbstractSingleRecordBatch.innext():93 org.apache.drill.exec.record.AbstractRecordBatch.next():162 org.apache.drill.exec.record.AbstractRecordBatch.next():119 org.apache.drill.exec.record.next():109 org.apache.drill.exec.record.AbstractSingleRecordBatch.Innext():51 org.apache.drill.exec.physical.impl.LimitRecordBatch.Innext():115 org.apache.drill.exec.record.AbstractRecordBatch.next():162 org.apache.drill.exec.record.AbstractRecordBatch.next():119 org.apache.drill.exec.record.record.next():109 org.apache.drill.exec.record.AbstractSingleRecordBatch.Innext():51 org.apache.drill.exec.physical.impl.SvRemovingRecordBatch.Innext():93 org.apache.drill.exec.record.AbstractRecordBatch.next():162 org.apache.drill.exec.record.AbstractRecordBatch.next():119 org.apache.drill.exec.record.record.next():109 org.apache.drill.exec.record.AbstractSingleRecordBatch.Innext():51 org.apache.drill.exec.physical.impl.project.RecordBatch.Innext():135 org.apache.drill.exec.record.AbstractRecordBatch.next():162 org.apache.drill.exec.physical.impl.BaseRootExec.next():104 org.apache.drill.exec.physical.impl.BaseRootExec.next():94 org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():232 org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():226 java.security.AccessController.doPrivileged():-2 javax.security.auth.Subject.doAs():422 org.apache.hadoop.security.UserGroupInformation.doAs():1657 org.apache.drill.exec.work.FragmentExecutor.run():226 org.apache.drill.common.selfcleaningrunable.run():38 java.util.concurrent.ThreadPoolExecutor.runWorker():1142 java.util.concurrent.ThreadPoolExecutor$Worker.run():617 java.lang.Thread.run():745 有没有办法在Apache Drill中处理此类数据?@OP 我知道这是一篇老文章,上周我遇到了新数据源的挑战。Encoding Apache Drill处理cp1252字符代码,encoding,apache-drill,Encoding,Apache Drill,我们作为csv的一部分查询的数据包含cp1252字符代码,apache drill给出以下错误: org.apache.drill.common.exceptions.UserRemoteException:系统错误:格式错误PutException:输入长度=1个片段0:0[错误Id:53bc07e3-a6e4-4301-a858-205be382275e on 172.16.243.116:31010](java.lang.RuntimeException)java.nio.charset.
直接在ApacheDrill(MapR版本)中,我使用STRING_BINARY()转换cp1252集。 不是优雅或高效的解决方案,但它是有效的
ApacheDrill 1.10.0
“钻宝宝钻”
0:jdbc:drill:zk=titana-ch2-p3:5181/drill/TIT>use sys;
+-------+----------------------------------+
|好|总结|
+-------+----------------------------------+
|true |默认模式更改为[sys]|
+-------+----------------------------------+
选择1行(0.975秒)
0:jdbc:drill:zk=titana-ch2-p3:5181/drill/TIT>从版本中选择版本;
+----------+
|版本|
+----------+
| 1.10.0 |
+----------+
选择1行(0.409秒)
0:jdbc:drill:zk=titana-ch2-p3:5
apache drill 1.10.0
"drill baby drill"
0: jdbc:drill:zk=titana-ch2-p3:5181/drill/TIT> use sys;
+-------+----------------------------------+
| ok | summary |
+-------+----------------------------------+
| true | Default schema changed to [sys] |
+-------+----------------------------------+
1 row selected (0.975 seconds)
0: jdbc:drill:zk=titana-ch2-p3:5181/drill/TIT> select version from version;
+----------+
| version |
+----------+
| 1.10.0 |
+----------+
1 row selected (0.409 seconds)
0: jdbc:drill:zk=titana-ch2-p3:5181/drill/TIT>
0: jdbc:drill:zk=titana-ch2-p3:5181/drill/TIT> select * from users.`sbalas002c`.drill_spl_char;
+------------------------+--------------------------------------------------------------+
| ORIG_CAMPAIGN_LINE_ID | ORIG_CAMPAIGN_LINE_NAME |
+------------------------+--------------------------------------------------------------+
| 30092278 | 1573256-1_306774_SeattleTheatreGroup�_201901_ISV_SEA_Z_SSEA |
| 30092282 | 1573257-1_306774_SeattleTheatreGroup�_201901_ISV_SEA_Z_WORD |
| 30092286 | 1573254-1_306774_SeattleTheatreGroup�_201901_ISV_SEA_Z_BLIS |
| 30092290 | 1573255-1_306774_SeattleTheatreGroup�_201901_ISV_SEA_Z_NSEA |
+------------------------+--------------------------------------------------------------+
4 rows selected (0.445 seconds)
0: jdbc:drill:zk=titana-ch2-p3:5181/drill/TIT>
0: jdbc:drill:zk=titana-ch2-p3:5181/drill/TIT> select ORIG_CAMPAIGN_LINE_NAME,
. . . . . . . . . . . . . . . . . . . . . . .> substr(ORIG_CAMPAIGN_LINE_NAME,1,4) sub_CAMPAIGN_LINE_NAME
. . . . . . . . . . . . . . . . . . . . . . .> from users.`sbalas002c`.drill_spl_char;
Error: SYSTEM ERROR: DrillRuntimeException: Unexpected byte 0xa0 at position 36 encountered while decoding UTF8 string.
Fragment 0:0
[Error Id: 1889163a-f847-48ad-a7a9-bbe4284e112c on titand-ch2-p20.cable.comcast.com:31010] (state=,code=0)
0: jdbc:drill:zk=titana-ch2-p3:5181/drill/TIT>
0: jdbc:drill:zk=titana-ch2-p3:5181/drill/TIT> select ORIG_CAMPAIGN_LINE_NAME,
. . . . . . . . . . . . . . . . . . . . . . .> STRING_BINARY(ORIG_CAMPAIGN_LINE_NAME) SB_CAMPAIGN_LINE_NAME,
. . . . . . . . . . . . . . . . . . . . . . .> regexp_replace(STRING_BINARY(ORIG_CAMPAIGN_LINE_NAME),'\\xA0','') Good_CAMPAIGN_LINE_NAME
. . . . . . . . . . . . . . . . . . . . . . .> from users.`sbalas002c`.drill_spl_char;
+--------------------------------------------------------------+-----------------------------------------------------------------+-------------------------------------------------------------+
| ORIG_CAMPAIGN_LINE_NAME | SB_CAMPAIGN_LINE_NAME | Good_CAMPAIGN_LINE_NAME |
+--------------------------------------------------------------+-----------------------------------------------------------------+-------------------------------------------------------------+
| 1573256-1_306774_SeattleTheatreGroup�_201901_ISV_SEA_Z_SSEA | 1573256-1_306774_SeattleTheatreGroup\xA0_201901_ISV_SEA_Z_SSEA | 1573256-1_306774_SeattleTheatreGroup_201901_ISV_SEA_Z_SSEA |
| 1573257-1_306774_SeattleTheatreGroup�_201901_ISV_SEA_Z_WORD | 1573257-1_306774_SeattleTheatreGroup\xA0_201901_ISV_SEA_Z_WORD | 1573257-1_306774_SeattleTheatreGroup_201901_ISV_SEA_Z_WORD |
| 1573254-1_306774_SeattleTheatreGroup�_201901_ISV_SEA_Z_BLIS | 1573254-1_306774_SeattleTheatreGroup\xA0_201901_ISV_SEA_Z_BLIS | 1573254-1_306774_SeattleTheatreGroup_201901_ISV_SEA_Z_BLIS |
| 1573255-1_306774_SeattleTheatreGroup�_201901_ISV_SEA_Z_NSEA | 1573255-1_306774_SeattleTheatreGroup\xA0_201901_ISV_SEA_Z_NSEA | 1573255-1_306774_SeattleTheatreGroup_201901_ISV_SEA_Z_NSEA |
+--------------------------------------------------------------+-----------------------------------------------------------------+-------------------------------------------------------------+
4 rows selected (0.64 seconds)
0: jdbc:drill:zk=titana-ch2-p3:5181/drill/TIT>
Hope this helps others.