Java 是否有一种方法可以识别字符串中的标记,同时也可以通过最长的子字符串进行识别?
我试图弄清楚如何正确地识别输入文件中的令牌,并在使用空格和新行分隔符时返回它应该是什么类型。 lexer应该识别的四种类型是:Java 是否有一种方法可以识别字符串中的标记,同时也可以通过最长的子字符串进行识别?,java,regex,token,lexer,Java,Regex,Token,Lexer,我试图弄清楚如何正确地识别输入文件中的令牌,并在使用空格和新行分隔符时返回它应该是什么类型。 lexer应该识别的四种类型是: Identifiers = ([a-z] | [A-Z])([a-z] | [A-Z] | [0-9])* Numbers = [0-9]+ Punctuation = \+ | \- | \* | / | \( | \) | := | ; Keywords = if | then | else | endif | while | do | endwhile | s
Identifiers = ([a-z] | [A-Z])([a-z] | [A-Z] | [0-9])*
Numbers = [0-9]+
Punctuation = \+ | \- | \* | / | \( | \) | := | ;
Keywords = if | then | else | endif | while | do | endwhile | skip
例如,如果文件中有一行,表示:
tcu else i34 2983 ( + +eqdQ
它应该标记并打印出:
identifier: tcu
keyword: else
identifier: i34
number: 2983
punctuation: (
punctuation: +
punctuation: +
identifier: eqdQ
我不知道如何让lexer在两种不同类型的子字符串相邻的情况下通过最长的子字符串
这就是我的尝试:
//start
public static void main(String[] args) throws IOException {
//input file//
File file = new File("input.txt");
//output file//
FileWriter writer = new FileWriter("output.txt");
//instance variables
String sortedOutput = "";
String current = "";
Scanner scan = new Scanner(file);
String delimiter = "\\s+ | \\s*| \\s |\\n|$ |\\b\\B|\\r|\\B\\b|\\t";
String[] analyze;
BufferedReader read = new BufferedReader(new FileReader(file));
//lines get read here from the .txt file
while(scan.hasNextLine()){
sortedOutput = sortedOutput.concat(scan.nextLine() + System.lineSeparator());
}
//lines are tokenized here
analyze = sortedOutput.split(delimiter);
//first line is printed here through a separate reader
current = read.readLine();
System.out.println("Current Line: " + current + System.lineSeparator());
writer.write("Current Line: " + current + System.lineSeparator() +"\n");
//string matching starts here
for(String a: analyze)
{
//matches identifiers if it doesn't match with a keyword
if(a.matches(patternAlpha))
{
if(a.matches(one))
{
System.out.println("Keyword: " + a);
writer.write("Keyword: "+ a + System.lineSeparator());
}
else if(a.matches(two))
{
System.out.println("Keyword: " + a);
writer.write("Keyword: "+ a + System.lineSeparator());
}
else if(a.matches(three))
{
System.out.println("Keyword: " + a);
writer.write("Keyword: "+ a + System.lineSeparator());
}
else if(a.matches(four))
{
System.out.println("Keyword: " + a);
writer.write("Keyword: "+ a + System.lineSeparator());
}
else if(a.matches(five))
{
System.out.println("Keyword: " + a);
writer.write("Keyword: "+ a + System.lineSeparator());
}
else if(a.matches(six))
{
System.out.println("Keyword: " + a);
writer.write("Keyword: "+ a + System.lineSeparator());
}
else if(a.matches(seven))
{
System.out.println("Keyword: " + a);
writer.write("Keyword: "+ a + System.lineSeparator());
}
else if(a.matches(eight))
{
System.out.println("Keyword: " + a);
writer.write("Keyword: "+ a + System.lineSeparator());
}
else
{
System.out.println("Identifier: " + a);
writer.write("Identifier: "+ a + System.lineSeparator());
}
}
//number check
else if(a.matches(patternNumber))
{
System.out.println("Number: " + a);
writer.write("Number: "+ a + System.lineSeparator());
}
//punctuation check
else if(a.matches(patternPunctuation))
{
System.out.println("Punctuation: " + a);
writer.write("Punctuation: "+ a + System.lineSeparator());
}
//this special case here updates the current line with the next line
else if(a.matches(nihil))
{
System.out.println();
current = read.readLine();
System.out.println("\nCurrent Line: " + current + System.lineSeparator());
writer.write("\nCurrent Line: " + current + System.lineSeparator() + "\n");
}
//everything not listed in regex is read as an error
else
{
System.out.println("Error reading: " + a);
writer.write("Error reading: "+ a + System.lineSeparator());
}
}
//everything closes here to avoid errors
scan.close();
read.close();
writer.close();
}
}
如有任何建议,我将不胜感激。提前谢谢。这在没有解析器的情况下是绝对可以做到的,因为输入到解析器的令牌几乎总是可以由常规语言定义的(Unix工具Lex和Flex多年来一直在这样做。请参阅。我不想花时间手工将一些Python代码翻译成Java,但我花了几分钟的时间为您的示例修改了它。我确实做了一些我认为合适的更改。作为解析器的输入,您通常希望将他将
(
,)
和;
字符视为不同的标记。您还希望将每个保留字视为不同的标记类,而不是像我所做的那样将它们作为关键字(或单数关键字)放在一起
方法学
。
),以确保find()
始终返回匹配,直到输入用尽。此错误正则表达式必须是最后一个备用模式,如果匹配,则表示无法识别的标记next
方法来返回Token
对象。此外,通常更方便的做法是将Token类型表示为整数,因为它最终将用于索引到解析表中:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Lexer {
public static class Token
{
public int tokenNumber;
public String tokenValue;
public Token(int tokenNumber, String tokenValue)
{
this.tokenNumber = tokenNumber;
this.tokenValue = tokenValue;
}
}
public static int WHITESPACE = 1; // group 1
public static int PUNCTUATION = 2; // group 2 etc.
public static int LPAREN = 3;
public static int RPAREN = 4;
public static int KEYWORD = 5;
public static int IDENTIFIER = 6;
public static int NUMBER = 7;
public static int SEMICOLON = 8;
public static int ERROR = 9;
public static int EOF = 10;
Matcher m;
String text;
boolean skipError;
public static void main(String[] args) {
Lexer lexer = new Lexer("tcu else i34 !!!! 2983 ( + +eqdQ!!!!"); // With some error characters "!" thrown in the middle and at the end
for(;;) {
Token token = lexer.next();
System.out.println(token.tokenNumber + ": " + token.tokenValue);
if (token.tokenNumber == EOF)
break;
}
}
public Lexer(String text)
{
String _WHITESPACE = "(\\s+)";
String _PUNCTUATION = "((?:[+*/-]|:=))";
String _LPAREN = "(\\()";
String _RPAREN = "(\\))";
String _KEYWORD = "(if|then|else|endif|while|do|endwhile|skip)";
String _IDENTIFIER = "([a-zA-Z][0-9a-zA-Z]*)";
String _NUMBER = "([0-9)]+)";
String _SEMICOLON = "(;)";
String _ERROR = "(.)"; // must be last and able to capture one character
String regex = String.join("|", _WHITESPACE, _PUNCTUATION, _LPAREN, _RPAREN, _KEYWORD, _IDENTIFIER, _NUMBER, _SEMICOLON, _ERROR);
Pattern p = Pattern.compile(regex);
this.text = text;
m = p.matcher(this.text);
skipError = false;
}
public Token next()
{
Token token = null;
for(;;) {
if (!m.find())
return new Token(EOF, "<EOF>");
for (int tokenNumber = 1; tokenNumber <= 9; tokenNumber++) {
String tokenValue = m.group(tokenNumber);
if (tokenValue != null) {
token = new Token(tokenNumber, tokenValue);
break;
}
}
if (token.tokenNumber == ERROR) {
if (!skipError) {
skipError = true; // we don't want successive errors
return token;
}
}
else {
skipError = false;
if (token.tokenNumber != WHITESPACE)
return token;
}
}
}
}
import java.util.regex.Matcher;
导入java.util.regex.Pattern;
公共类Lexer{
公共静态类令牌
{
公共整数;
公共字符串标记值;
公共令牌(int-tokenNumber,String-tokenValue)
{
this.tokenNumber=tokenNumber;
this.tokenValue=tokenValue;
}
}
public static int WHITESPACE=1;//组1
公共静态int标点=2;//组2等。
公共静态int LPAREN=3;
公共静态int RPAREN=4;
公共静态int关键字=5;
公共静态int标识符=6;
公共静态整数=7;
公共静态int分号=8;
公共静态整数错误=9;
公共静态int EOF=10;
匹配器m;
字符串文本;
布尔Skiperor;
公共静态void main(字符串[]args){
Lister-Lister-=新Lexer(“TCU IOR I34!!2983!(+eqdq!!!)”)/(或)有一些错误字符“!”在中间和结尾抛出!
对于(;;){
Token-Token=lexer.next();
System.out.println(token.tokenNumber+“:”+token.tokenValue);
if(token.tokenNumber==EOF)
打破
}
}
公共Lexer(字符串文本)
{
字符串_WHITESPACE=“(\\s+”;
字符串_标点=“((?:[+*/-]|:=)”;
字符串\u LPAREN=“(\\()”;
字符串\u RPAREN=“(\\)”;
字符串|关键字=“(if | then | else | endif | while | do | endwhile | skip)”;
字符串_IDENTIFIER=“[a-zA-Z][0-9a-zA-Z]*)”;
字符串_NUMBER=“([0-9)]+)”;
字符串_分号=“(;)”;
字符串_ERROR=“()”;//必须是最后一个,并且能够捕获一个字符
String regex=String.join(“|”、_空格、_标点符号、_LPAREN、_RPAREN、_关键字、_标识符、_数字、_分号、_错误);
Pattern p=Pattern.compile(regex);
this.text=文本;
m=p.matcher(this.text);
Skiperor=假;
}
公共令牌下一步()
{
令牌=null;
对于(;;){
如果(!m.find())
返回新令牌(EOF,“”);
for(int-tokenNumber=1;tokenNumber-Pure-regex可能不是这里最好的方法。您需要编写一个解析器。@Tim来识别令牌?这是正则语言的规范用法。正则表达式通常就足以做到这一点,OP的情况肯定就是这样。
6: tcu
5: else
6: i34
9: !
7: 2983
3: (
2: +
2: +
6: eqdQ
9: !
10: <EOF>