使用java从文本文件中提取特定URL_Java_Regex_Text Parsing_Web Crawler

使用java从文本文件中提取特定URL

java regex web-crawler

使用java从文本文件中提取特定URL,java,regex,text-parsing,web-crawler,Java,Regex,Text Parsing,Web Crawler,我有一个文本文档，其中有一堆URL，格式为/courses/../../../../ 从这些URL中，我只想提取那些形式为/courses/../tablish notes的URL。表示以/课程开头，以/课堂讲稿结尾的URL。有谁知道用正则表达式或仅仅通过字符串匹配来实现这一点的好方法吗？这里有一种替代方法： Scanner s = new Scanner(new FileReader("filename.txt")); String str; while (null != (str = s

我有一个文本文档，其中有一堆URL，格式为

/courses/../../../../

从这些URL中，我只想提取那些形式为

/courses/../tablish notes

的URL。表示以

/课程

开头，以

/课堂讲稿

结尾的URL。

有谁知道用正则表达式或仅仅通过字符串匹配来实现这一点的好方法吗？

这里有一种替代方法：

Scanner s = new Scanner(new FileReader("filename.txt"));

String str;
while (null != (str = s.findWithinHorizon("/courses/\\S*/lecture-notes", 0)))
    System.out.println(str);

给定一个包含内容的

filename.txt

Here /courses/lorem/lecture-notes and
here /courses/ipsum/dolor/lecture-notes perhaps.

上面的代码片段打印出来

/courses/lorem/lecture-notes
/courses/ipsum/dolor/lecture-notes

这里有一个选择：

Scanner s = new Scanner(new FileReader("filename.txt"));

String str;
while (null != (str = s.findWithinHorizon("/courses/\\S*/lecture-notes", 0)))
    System.out.println(str);

给定一个包含内容的

filename.txt

Here /courses/lorem/lecture-notes and
here /courses/ipsum/dolor/lecture-notes perhaps.

上面的代码片段打印出来

/courses/lorem/lecture-notes
/courses/ipsum/dolor/lecture-notes

假设每行有1个URL，可以使用：

    BufferedReader br = new BufferedReader(new FileReader("urls.txt"));
    String urlLine;
    while ((urlLine = br.readLine()) != null) {
        if (urlLine.matches("/courses/.*/lecture-notes")) {
            // use url
        }
    }

假设每行有1个URL，可以使用：

    BufferedReader br = new BufferedReader(new FileReader("urls.txt"));
    String urlLine;
    while ((urlLine = br.readLine()) != null) {
        if (urlLine.matches("/courses/.*/lecture-notes")) {
            // use url
        }
    }

以下内容仅返回中间部分（即：排除

/courses/

和

/teachs notes/

）：

Pattern p = Pattern.compile("/courses/(.*)/lectures-notes");
Matcher m = p.matcher(yourStrnig);

if(m.find()).
  return m.group(1) // The "1" here means it'll return the first part of the regex between parethesis.

以下内容仅返回中间部分（即：排除

/courses/

和

/teachs notes/

）：

Pattern p = Pattern.compile("/courses/(.*)/lectures-notes");
Matcher m = p.matcher(yourStrnig);

if(m.find()).
  return m.group(1) // The "1" here means it'll return the first part of the regex between parethesis.

描述中没有任何内容阻止处理URL。此检查在循环中。除非您解释如何逐个标记（或至少逐行）遍历文本标记，否则此答案不完整。（此外，使用

匹配时不需要^
和$
）描述中没有任何内容阻止处理URL。此检查在循环中。除非您解释如何逐个标记（或至少逐行）遍历文本标记，否则此答案不完整。（此外，使用匹配时不需要^
和$
）