如何在hadoop中通过关键字匹配从多个页面获取整个页面内容

如何在hadoop中通过关键字匹配从多个页面获取整个页面内容,hadoop,mapreduce,hadoop2,Hadoop,Mapreduce,Hadoop2,我正在尝试一个小例子,比如如果一个关键字在多个页面中的某个特定页面中匹配,那么我需要获取该特定页面的全部内容。页面如下所示 98339-93-05-1,PROD,2,288.000,40.800,34.500,“Slate_Pro_Light”,9.0,8,“981-2535” 98339-93-05-1,PROD,2,324.240,40.800,7.485,“Slate\u Pro\u Light”,9.0,2,“或” 98339-93-05-1,PROD,2,333.360,40.800,

我正在尝试一个小例子,比如如果一个关键字在多个页面中的某个特定页面中匹配,那么我需要获取该特定页面的全部内容。页面如下所示

98339-93-05-1,PROD,2,288.000,40.800,34.500,“Slate_Pro_Light”,9.0,8,“981-2535” 98339-93-05-1,PROD,2,324.240,40.800,7.485,“Slate\u Pro\u Light”,9.0,2,“或” 98339-93-05-1,PROD,2,333.360,40.800,19.473,“Slate\u Pro\u Light”,9.0,5,“电子邮件” 98339-93-05-1,PROD,2288.000,31.440,104.442,“Slate_Pro_Light”,9.0,24jmcgaha@farmersagent.com" 98339-93-05-1,PROD,2,63.120,14.160,22.312,“板岩浓缩版”,8.0,7,“56-6177” 98339-93-05-1,PROD,2,91.920,14.160,7.880,“板岩浓缩版”,8.0,3,“第一” 98339-93-05-1,PROD,3011.280,14.160,19.160,“Slate_Pro_Bk_浓缩版”,8.0,7,“版本” 98339-93-05-1,PROD,3127.920,14.160,12.232,“Slate_Pro_Bk_浓缩版”,8.0,4,“4-14” 98339-93-05-1,PROD,3,45.120,704.160,66.239,“Slate_Pro_中等”,13.5,11,“声明” 98339-93-05-1,PROD,3113.760,704.160,28.350,“Slate_Pro_中等”,13.5,4,“第页” 98339-93-05-1,PROD,3144.480704.160,61.890,“石板灯”,13.5,11,“(续) 98339-93-05-1,PROD,3,45.120,661.200,60.491,“Slate_Pro_MediumIta”,13.5,9,“抵押权人” 98339-93-05-1,PROD,3107.760661.200,6.142,“Slate_Pro_MediumIta”,13.5,1,“/” 98339-93-05-1,PROD,3115.920661.200,31.138,“Slate_Pro_MediumIta”,13.5,5,“其他” 98339-93-05-1,PROD,3149.280,661.200,42.081,“Slate_Pro_MediumIta”,13.5,8,“利息” 98339-93-05-1,PROD,3,45.120,645.600,11.720,“Slate_ProIta”,10.0,3,“第一” 98339-93-05-1,PROD,3,58.560,645.600,43.320,“Slate_ProIta”,10.0,9,“抵押权人” 98339-93-05-1,PROD,3244.080645.600,19.150,“Slate_ProIta”,10.0,4,“贷款” 98339-93-05-1,PROD,3264.960645.600,32.100,“Slate_ProIta”,10.0,6,“数字” 98339-93-05-1,PROD,3,45.120,631.680,26.040,“板岩之光”,10.0,6,“布莱恩特” 98339-93-05-1,PROD,3,72.960,631.680,19.910,“Slate_Pro_Light”,10.0,4,“银行” 98339-93-05-1,PROD,3,45.120,619.680,12.230,“Slate_Pro_Light”,10.0,2,“PO” 98339-93-05-1,PROD,3,59.040,619.680,14.710,“Slate_Pro_Light”,10.0,3,“Box” 98339-93-05-1,PROD,3,75.360,619.680,10.040,“Slate_Pro_Light”,10.0,2,“46” 98339-93-05-1,PROD,3,45.120,607.680,42.100,“Slate_Pro_Light”,10.0,11,“Huntsville,” 98339-93-05-1,PROD,3,89.040,607.680,9.770,“Slate_Pro_Light”,10.0,2,“AL”

因此,如果一列与关键字Slate\u Pro\u Bk\u Condensed匹配,那么我需要获取整个数据。 在上面的例子中,关键字与第3页中的匹配,因此现在我需要获取第3页中的所有数据

所以,请帮助我在这个问题上解决使用地图减少程序


提前感谢。

可能的解决方案是将页面拆分为文件,并使用FileInputFormat在MR中处理它们。
然后使用java正则表达式检查某个页面是否包含“Slate\u Pro\u Bk\u Condensed”等内容。
您可以迭代每个页面中的行以略微提高性能-一旦找到字符串,就可以跳到下一页