CSV将大量数据加载到Pig中

CSV将大量数据加载到Pig中,csv,apache-pig,Csv,Apache Pig,我在pig中使用此查询从包含50000条记录的CSV文件加载数据 A = LOAD '/home/user/q2.csv' using org.apache.pig.piggybank.storage.CSVExcelStorage(',', 'YES_MULTILINE') as (Id:chararray, PostTypeId:chararray, AcceptedAnswerId:chararray, ParentId:chararray, CreationDate:chararr

我在pig中使用此查询从包含50000条记录的CSV文件加载数据

A = LOAD '/home/user/q2.csv' using org.apache.pig.piggybank.storage.CSVExcelStorage(',', 'YES_MULTILINE') as (Id:chararray,
PostTypeId:chararray, 
AcceptedAnswerId:chararray, 
ParentId:chararray, 
CreationDate:chararray, 
DeletionDate:chararray, 
Score:chararray, 
ViewCount:chararray, 
Body:chararray, 
OwnerUserId:chararray, 
OwnerDisplayName:chararray, 
LastEditorUserId:chararray, 
LastEditorDisplayName:chararray, 
LastEditDate:chararray, 
LastActivityDate:chararray, 
Title:chararray, 
Tags:chararray, 
AnswerCount:chararray, 
CommentCount:chararray, 
FavoriteCount:chararray, 
ClosedDate:chararray, 
CommunityOwnedDate:chararray);
下面是一个查询,用于清除body字段中\n&的数据以及其他一些数据

Q2Clean = FOREACH Q2 GENERATE
Id as Id, 
PostTypeId as PostTypeId, 
AcceptedAnswerId as AcceptedAnswerId, 
(chararray)REPLACE(ParentId,'"','')  as ParentId, 
CreationDate as CreationDate, 
(chararray)REPLACE(DeletionDate,'"','') as DeletionDate, 
Score as Score, 
ViewCount as ViewCount,  
(chararray)REPLACE(REPLACE(Body,'\n',''),',','')as Body, 
OwnerUserId as OwnerUserId, 
(chararray)REPLACE(OwnerDisplayName,'"','') as OwnerDisplayName, 
LastEditorUserId as LastEditorUserId, 
(chararray)REPLACE(LastEditorDisplayName,'"','') as LastEditorDisplayName, 
LastEditDate as LastEditDate, 
LastActivityDate as LastActivityDate, 
(chararray)REPLACE(Title,',','') as Title, 
(chararray)REPLACE(Tags,',','') as Tags, 
AnswerCount as AnswerCount, 
CommentCount as CommentCount, 
FavoriteCount as FavoriteCount, 
(chararray)REPLACE(ClosedDate,'"','') as ClosedDate, 
(chararray)REPLACE(CommunityOwnedDate,'"','') as CommunityOwnedDate;
现在的问题是,当我存储输出时,它显示了写入的617538行。它创建了两个文件。第一个文件包含27000条格式正确的记录,但第二个文件未正确存储。它包含大约610000行和许多行,其中只有。如何正确加载数据,使输出显示50000行而不是617538行


我认为问题出在脚本的下面部分

(chararray)REPLACE(REPLACE(Body,'\n',''),',','')as Body, 
您必须添加另一个反斜杠来替换“\n”

(chararray)REPLACE(REPLACE(Body,'\\n',''),',','')as Body, 

我厌倦了用另一个反斜杠来替换\n但它仍然显示相同数量的记录。@user6118910您能发布示例数据吗?