Apache pig 子字符串操作在联接操作中不起作用

Apache pig 子字符串操作在联接操作中不起作用,apache-pig,Apache Pig,我在文件1中有一列col1: 00SPY58KHT5 00SPXB2BD0J 00SPXB2DXH6 00SPXDQ02S1 00SPXDY91JI 00SPXFG88L6 00SPXF1AQ4Z 00SPXF5UKS3 00SPXGL9IV6 我在文件2中有col2列: 0SPY58KHT5 0SPXB2BD0J 0SPXB2DXH6 0SPXDQ02S1 0SPXDY91JI 0SPXFG88L6 0SPXF1AQ4Z 0SPXF5UKS3 0SPXGL9IV6 正如您所看到的,在开头的

我在文件1中有一列col1:

00SPY58KHT5
00SPXB2BD0J
00SPXB2DXH6
00SPXDQ02S1
00SPXDY91JI
00SPXFG88L6
00SPXF1AQ4Z
00SPXF5UKS3
00SPXGL9IV6
我在文件2中有col2列:

0SPY58KHT5
0SPXB2BD0J
0SPXB2DXH6
0SPXDQ02S1
0SPXDY91JI
0SPXFG88L6
0SPXF1AQ4Z
0SPXF5UKS3
0SPXGL9IV6
正如您所看到的,在开头的第一个中有不同的0

我需要通过这些列在两个文件之间执行连接操作。所以我需要像这样使用子字符串:

 JOIN_FILE1_FILE2 = JOIN FILE1  BY TRIM(SUBSTRING(col1,1,10)), FILE1  BY TRIM(col2); 

DUMP JOIN_FILE1_FILE2;
但我得到的结果是空的

Input(s):
Successfully read 914493 records from: "/hdfs/data/adhoc/PR/02/RDO0/GUIDES/GUIDE_CONTRAT_USINE.csv"
Successfully read 102851809 records from: "/hdfs/data/adhoc/PR/02/RDO0/BB0/MGM7X007-2019-09-11.csv"

Output(s):
Successfully stored 0 records in: "hdfs://ha-manny/hdfs/hadoop/pig/tmp/temp964914764/tmp1220183619"

我该如何连接呢?

作为一种解决方案,我生成第一个数据,将子字符串函数应用于col1。 然后我使用TRIM进行过滤,最后在另一代中使用CONCAT('0',col1)

换句话说

DATA1 = FOREACH DATA_SOURCE GENERATE
        SUBSTRING(col1,1,10) AS col1;

JOINED_DATA = JOIN DATA1 BY col1, ...

FINAL_DATA = FOREACH JOINED_DATA GENERATE
           CONCAT('0',col1) AS col1,
...
这是毫无问题的