Java 是否有一种方法可以识别字符串中的标记，同时也可以通过最长的子字符串进行识别？_Java_Regex_Token_Lexer

Java 是否有一种方法可以识别字符串中的标记，同时也可以通过最长的子字符串进行识别？

java regex

Java 是否有一种方法可以识别字符串中的标记，同时也可以通过最长的子字符串进行识别？,java,regex,token,lexer,Java,Regex,Token,Lexer,我试图弄清楚如何正确地识别输入文件中的令牌，并在使用空格和新行分隔符时返回它应该是什么类型。 lexer应该识别的四种类型是： Identifiers = ([a-z] | [A-Z])([a-z] | [A-Z] | [0-9])* Numbers = [0-9]+ Punctuation = \+ | \- | \* | / | \( | \) | := | ; Keywords = if | then | else | endif | while | do | endwhile | s

我试图弄清楚如何正确地识别输入文件中的令牌，并在使用空格和新行分隔符时返回它应该是什么类型。 lexer应该识别的四种类型是：

Identifiers = ([a-z] | [A-Z])([a-z] | [A-Z] | [0-9])* 
Numbers = [0-9]+ 
Punctuation = \+ | \- | \* | / | \( | \) | := | ;
Keywords = if | then | else | endif | while | do | endwhile | skip

例如，如果文件中有一行，表示：

tcu else i34 2983 ( + +eqdQ

它应该标记并打印出：

identifier: tcu
keyword: else
identifier: i34
number: 2983
punctuation: (
punctuation: +
punctuation: +
identifier: eqdQ

我不知道如何让lexer在两种不同类型的子字符串相邻的情况下通过最长的子字符串

这就是我的尝试：

//start
public static void main(String[] args) throws IOException {

//input file//
File file = new File("input.txt");
//output file//
FileWriter writer = new FileWriter("output.txt");

//instance variables
String sortedOutput = "";
String current = "";
Scanner scan = new Scanner(file);
String delimiter = "\\s+ | \\s*| \\s |\\n|$ |\\b\\B|\\r|\\B\\b|\\t";
String[] analyze;
BufferedReader read = new BufferedReader(new FileReader(file));

//lines get read here from the .txt file
while(scan.hasNextLine()){
sortedOutput = sortedOutput.concat(scan.nextLine() + System.lineSeparator());
}
//lines are tokenized here
analyze = sortedOutput.split(delimiter);

//first line is printed here through a separate reader
current = read.readLine();
System.out.println("Current Line: " + current + System.lineSeparator());
writer.write("Current Line: " + current + System.lineSeparator() +"\n");

//string matching starts here
for(String a: analyze) 
    {
        //matches identifiers if it doesn't match with a keyword
        if(a.matches(patternAlpha))
        {
            if(a.matches(one))
            {
                System.out.println("Keyword: " + a);
                writer.write("Keyword: "+ a + System.lineSeparator());
            }
            else if(a.matches(two))
            {
                System.out.println("Keyword: " + a);
                writer.write("Keyword: "+ a + System.lineSeparator());
            }
            else if(a.matches(three))
            {
                System.out.println("Keyword: " + a);
                writer.write("Keyword: "+ a + System.lineSeparator());
            }
            else if(a.matches(four))
            {
                System.out.println("Keyword: " + a);
                writer.write("Keyword: "+ a + System.lineSeparator());
            }
            else if(a.matches(five))
            {
                System.out.println("Keyword: " + a);
                writer.write("Keyword: "+ a + System.lineSeparator());
            }
            else if(a.matches(six))
            {
                System.out.println("Keyword: " + a);
                writer.write("Keyword: "+ a + System.lineSeparator());
            }
            else if(a.matches(seven))
            {
                System.out.println("Keyword: " + a);
                writer.write("Keyword: "+ a + System.lineSeparator());
            }
            else if(a.matches(eight))
            {
                System.out.println("Keyword: " + a);
                writer.write("Keyword: "+ a + System.lineSeparator());
            }
            else
            {
                System.out.println("Identifier: " + a);
                writer.write("Identifier: "+ a + System.lineSeparator());
            }
        }
        //number check
        else if(a.matches(patternNumber))
        {
            System.out.println("Number: " + a);
            writer.write("Number: "+ a + System.lineSeparator());
        }
        //punctuation check
        else if(a.matches(patternPunctuation))
        {
            System.out.println("Punctuation: " + a);
            writer.write("Punctuation: "+ a + System.lineSeparator());
        }
        //this special case here updates the current line with the next line
        else if(a.matches(nihil)) 
        {
            System.out.println();
            current = read.readLine();
            System.out.println("\nCurrent Line: " + current + System.lineSeparator());
            writer.write("\nCurrent Line: " + current + System.lineSeparator() + "\n");
        }
        //everything not listed in regex is read as an error
        else 
        {
            System.out.println("Error reading: " + a);
            writer.write("Error reading: "+ a + System.lineSeparator());
        }
    }
//everything closes here to avoid errors
scan.close();
read.close();
writer.close();
    }
}

如有任何建议，我将不胜感激。提前谢谢。

这在没有解析器的情况下是绝对可以做到的，因为输入到解析器的令牌几乎总是可以由常规语言定义的（Unix工具Lex和Flex多年来一直在这样做。请参阅。我不想花时间手工将一些Python代码翻译成Java，但我花了几分钟的时间为您的示例修改了它。我确实做了一些我认为合适的更改。作为解析器的输入，您通常希望将他将

（

，

）

和

；

字符视为不同的标记。您还希望将每个保留字视为不同的标记类，而不是像我所做的那样将它们作为关键字（或单数关键字）放在一起

方法学

使用带有命名捕获组的正则表达式定义标记。确保有一个标记用于空格和注释（如果您的语言定义了注释）

包括一个将匹配任何单个字符的错误标记（使用正则表达式的

。

），以确保

find（）

始终返回匹配，直到输入用尽。此错误正则表达式必须是最后一个备用模式，如果匹配，则表示无法识别的标记

放置这些是一个列表，确保所有保留字的正则表达式位于标识符的正则表达式之前

通过使用“|”操作符“连接”列表中的项目，从步骤3创建一个正则表达式

搜索下一个匹配项。如果找到的实际匹配项是空格或注释，并且这些标记对解析器没有语义意义，则继续匹配。如果是错误标记，则将其返回给解析器，但不返回连续的错误标记。输入用完后，返回文件结束标记

快速Java实现

此版本的结构使得可以调用

next

方法来返回

Token

对象。此外，通常更方便的做法是将Token类型表示为整数，因为它最终将用于索引到解析表中：

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Lexer {

    public static class Token
    {
        public int tokenNumber;
        public String tokenValue;

        public Token(int tokenNumber, String tokenValue)
        {
            this.tokenNumber = tokenNumber;
            this.tokenValue = tokenValue;
        }
    }

    public static int WHITESPACE = 1; // group 1
    public static int PUNCTUATION = 2; // group 2 etc.
    public static int LPAREN = 3;
    public static int RPAREN = 4;
    public static int KEYWORD = 5;
    public static int IDENTIFIER = 6;
    public static int NUMBER = 7;
    public static int SEMICOLON = 8;
    public static int ERROR = 9;
    public static int EOF = 10;

    Matcher m;
    String text;
    boolean skipError;


    public static void main(String[] args) {
        Lexer lexer = new Lexer("tcu else i34 !!!! 2983 ( + +eqdQ!!!!"); // With some error characters "!" thrown in the middle and at the end
        for(;;) {
            Token token = lexer.next();
            System.out.println(token.tokenNumber + ": " + token.tokenValue);
            if (token.tokenNumber == EOF)
                break;
        }
    }

    public Lexer(String text)
    {

        String _WHITESPACE = "(\\s+)";
        String _PUNCTUATION = "((?:[+*/-]|:=))";
        String _LPAREN = "(\\()";
        String _RPAREN = "(\\))";
        String _KEYWORD = "(if|then|else|endif|while|do|endwhile|skip)";
        String _IDENTIFIER = "([a-zA-Z][0-9a-zA-Z]*)";
        String _NUMBER = "([0-9)]+)";
        String _SEMICOLON = "(;)";
        String _ERROR = "(.)"; // must be last and able to capture one character

        String regex = String.join("|", _WHITESPACE, _PUNCTUATION, _LPAREN, _RPAREN, _KEYWORD, _IDENTIFIER, _NUMBER, _SEMICOLON, _ERROR);

        Pattern p = Pattern.compile(regex);
        this.text = text;
        m = p.matcher(this.text);
        skipError = false;
    }

    public Token next()
    {
        Token token = null;
        for(;;) {
            if (!m.find())
                return new Token(EOF, "<EOF>");
            for (int tokenNumber = 1; tokenNumber <= 9; tokenNumber++) {
                String tokenValue = m.group(tokenNumber);
                if (tokenValue != null) {
                    token = new Token(tokenNumber, tokenValue);
                    break;
                }
            }
            if (token.tokenNumber == ERROR) {
                if (!skipError) {
                    skipError = true; // we don't want successive errors
                    return token;
                }
            }
            else {
                skipError = false;
                if (token.tokenNumber != WHITESPACE)
                    return token;
            }
        }
    }

}

import java.util.regex.Matcher；
导入java.util.regex.Pattern；
公共类Lexer{
公共静态类令牌
{
公共整数；
公共字符串标记值；
公共令牌（int-tokenNumber，String-tokenValue）
{
this.tokenNumber=tokenNumber；
this.tokenValue=tokenValue；
}
}
public static int WHITESPACE=1；//组1
公共静态int标点=2；//组2等。
公共静态int LPAREN=3；
公共静态int RPAREN=4；
公共静态int关键字=5；
公共静态int标识符=6；
公共静态整数=7；
公共静态int分号=8；
公共静态整数错误=9；
公共静态int EOF=10；
匹配器m；
字符串文本；
布尔Skiperor；
公共静态void main（字符串[]args）{
Lister-Lister-=新Lexer（“TCU IOR I34！！2983！（+eqdq！！！）”）/（或）有一些错误字符“！”在中间和结尾抛出！
对于（；；）{
Token-Token=lexer.next（）；
System.out.println（token.tokenNumber+“：”+token.tokenValue）；
if（token.tokenNumber==EOF）
打破
}
}
公共Lexer（字符串文本）
{
字符串_WHITESPACE=“（\\s+”；
字符串_标点=“（（？：[+*/-]|：=）”；
字符串\u LPAREN=“（\\（）”；
字符串\u RPAREN=“（\\）”；
字符串|关键字=“（if | then | else | endif | while | do | endwhile | skip）”；
字符串_IDENTIFIER=“[a-zA-Z][0-9a-zA-Z]*）”；
字符串_NUMBER=“（[0-9）]+）”；
字符串_分号=“（；）”；
字符串_ERROR=“（）”；//必须是最后一个，并且能够捕获一个字符
String regex=String.join（“|”、_空格、_标点符号、_LPAREN、_RPAREN、_关键字、_标识符、_数字、_分号、_错误）；
Pattern p=Pattern.compile（regex）；
this.text=文本；
m=p.matcher（this.text）；
Skiperor=假；
}
公共令牌下一步（）
{
令牌=null；
对于（；；）{
如果（！m.find（））
返回新令牌（EOF，“”）；
for（int-tokenNumber=1；tokenNumber-Pure-regex可能不是这里最好的方法。您需要编写一个解析器。@Tim来识别令牌？这是正则语言的规范用法。正则表达式通常就足以做到这一点，OP的情况肯定就是这样。
6: tcu
5: else
6: i34
9: !
7: 2983
3: (
2: +
2: +
6: eqdQ
9: !
10: <EOF>