Nutch 预处理爬网文本时不要删除额外的行

Nutch 预处理爬网文本时不要删除额外的行,nutch,Nutch,在使用nutch进行爬网时,它会从爬网的文本中删除所有多余的行。我想保留文本和网站上出现的任何新行。例如:对此页面进行爬网时,预期输出为: \n\n\n\nSan Francisco, CA Dentist\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nWould you like to switch to the accessible version of this site?\nGo to accessible site\n\nClose modal wi

在使用nutch进行爬网时,它会从爬网的文本中删除所有多余的行。我想保留文本和网站上出现的任何新行。例如:对此页面进行爬网时,预期输出为:

\n\n\n\nSan Francisco, CA Dentist\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nWould you like to switch to the accessible version of this site?\nGo to accessible site\n\nClose modal window\n\n\n\n\n\nDon\'t need the accessible version of this site?\nHide the accessibility button\n\nClose modal window\n\n\n\n\n\n\nAccessibility View\n\n\nClose toolbar\n\n\n\n\nJavascript must be enabled for the correct page display\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nModern Family Dental Hao Tran, DMD\nDentist located in Laurel Heights, San Francisco, CA\n\n\n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n\n\n\n\n\n(415) 752-5244\n\n\n \n\n\n\n\n\n\n\n\n\nMenu\n\n\n\n\nHome\n\n\nServices\n \nLatest Equipment\n\n\nInsurance\n\n\nTeeth Whitening\n\n\nCrowns & Bridges\n\n\nSmile Makeovers\n\n\nResin Composite Bonding\n\n\nVeneers\n\n\nImplant Retained Dentures\n\n\nNight Guards\n\n\nMetal-Free Restoration\n\n\nInvisalign\n\n\nDental Examination
但nutch的输出是:

San Francisco, CA Dentist\nWould you like to switch to the accessible version of this site?\nGo to accessible site\nClose modal window\nDon\'t need the accessible version of this site?\nHide the accessibility button\n\nClose modal window\nAccessibility View\n\n\nClose toolbar\n\n\n\n\nJavascript must be enabled for the correct page display\nModern Family Dental Hao Tran, DMD\nDentist located in Laurel Heights, San Francisco, CA\n(415) 752-5244\nMenu\nHome\nServices\nLatest Equipment\nInsurance\nTeeth Whitening\nCrowns & Bridges\nSmile Makeovers\n\n\nResin Composite Bonding\nVeneers\nImplant Retained Dentures\nNight Guards\nMetal-Free Restoration\nInvisalign\nDental Examination
我需要在配置文件中进行哪些配置更改


p、 美国。如果我的问题很傻,请原谅,因为我是nutch的新手。

您可以继续从存储rawHtml的Content文件夹中阅读,而不是阅读form parseText文件夹。