Java 从字符串中提取数据的快速方法
我收到OkHttpClient的回复,如:Java 从字符串中提取数据的快速方法,java,regex,parsing,arraylist,Java,Regex,Parsing,Arraylist,我收到OkHttpClient的回复,如: {"CUSTOMER_ID":"928941293291"} {"CUSTOMER_ID":"291389218398"} {"CUSTOMER_ID":"1C4DC4FC-02Q9-4130-S12B-762D97FS43C"} {"CUSTOMER_ID":"219382198"} {"CUSTOMER_ID":"282828"} {"CUSTOMER_ID":"21268239813"} {"CUSTOMER_ID":"1114445184"}
{"CUSTOMER_ID":"928941293291"}
{"CUSTOMER_ID":"291389218398"}
{"CUSTOMER_ID":"1C4DC4FC-02Q9-4130-S12B-762D97FS43C"}
{"CUSTOMER_ID":"219382198"}
{"CUSTOMER_ID":"282828"}
{"CUSTOMER_ID":"21268239813"}
{"CUSTOMER_ID":"1114445184"}
{"CUSTOMER_ID":"2222222222"}
{"CUSTOMER_ID":"99218492183921"}
我想提取minId和maxId之间的所有Long类型的customerId(然后跳过1C4DC4FC-02Q9-4130-S12B-762D97FS43C)。
这是我的实现:
final List<String> customerIds = Arrays.asList(response.body().string()
.replace("CUSTOMER_ID", "")
.replace("\"", "")
.replace("{", "").replace(":", "")
.replace("}", ",").split("\\s*,\\s*"));
for (final String id : customerIds) {
try {
final Long idParsed = Long.valueOf(id);
if (idParsed > minId && idParsed < maxId) {
ids.add(idParsed);
}
} catch (final NumberFormatException e) {
logger.debug("NumberFormatException", e);
}
}
final List customerIds=Arrays.asList(response.body().string())
.replace(“客户ID”和“”)
.替换(“\”,“”)
.replace(“{”,”).replace(“:”,”)
.replace(“}”和“,”).split(“\\s*,\\s*”);
for(最终字符串id:CustomerID){
试一试{
最终长idParsed=长数值(id);
如果(idParsed>minId&&idParsed
我有一个很长的customerId列表(大约1M),那么性能真的很重要。这是我行为的最佳实现吗?既然你有一个大文件,那么逐行读取内容可能是一种方法,不要替换客户ID,而是定义一个更好的regex模式 按照您的方法:替换用户ID并使用正则表达式:
String x = "{\"CUSTOMER_ID\":\"928941293291\"}{\"CUSTOMER_ID\":\"291389218398\"}{\"CUSTOMER_ID\":\"1C4DC4FC-02Q9-4130-S12B-762D97FS43C\"}"
+ "{\"CUSTOMER_ID\":\"99218492183921\"}";
x = x.replaceAll("\"CUSTOMER_ID\"", "");
Pattern p = Pattern.compile("\"([^\"]*)\"");
Matcher m = p.matcher(x);
while (m.find()) {
System.out.println(m.group(1));
}
或者实现一个正则表达式来匹配:“和”}
String x = "{\"CUSTOMER_ID\":\"928941293291\"}{\"CUSTOMER_ID\":\"291389218398\"}{\"CUSTOMER_ID\":\"1C4DC4FC-02Q9-4130-S12B-762D97FS43C\"}"
+ "{\"CUSTOMER_ID\":\"99218492183921\"}";
Pattern p = Pattern.compile(":\"([^\"]*)\"}");
Matcher m = p.matcher(x);
while (m.find()) {
System.out.println(m.group(1));
}
因此,无需替换客户ID,我将使用BufferedReader逐行读取字符串 对于每一行,我将减少替换的数量
String id= line.replace({"CUSTOMER_ID":", "");
id = id.substring(0, id.length-2); //to avoid one more replace
然后应用解析长逻辑的尝试,将成功的尝试添加到列表中。您可以忽略所有非数字字段
long[] ids =
Stream.of(response.body().string().split("\""))
.mapToLong(s -> parseLong(s))
.filter(l -> l > minId && i < maxId)
.toArray();
static long parseLong(String s) {
try {
if (!s.isEmpty() && Character.isDigit(s.charAt(0)))
return Long.parseLong(s);
} catch (NumberFormatException expected) {
}
return Long.MIN_VALUE
}
long[]id=
Stream.of(response.body().string().split(“\”))
.mapToLong(s->parseLong(s))
.filter(l->l>minId&&i
或者如果您使用的是Java7
List<Long> ids = new ArrayList<>();
for (String s : response.body().string().split("\"")) {
long id = parseLong(s);
if (id > minId && id < maxId)
ids.add(id);
}
List id=new ArrayList();
对于(字符串s:response.body().String().split(“\”)){
long id=parseLong(s);
if(id>minId&&id
您可以使用来流式传输文件中的数据。这里我演示如何使用列表中的流
List<String> sample = Arrays.asList(
"{\"CUSTOMER_ID\":\"928941293291\"}",
"{\"CUSTOMER_ID\":\"291389218398\"}",
"{\"CUSTOMER_ID\":\"1C4DC4FC-02Q9-4130-S12B-762D97FS43C\"}",
"{\"CUSTOMER_ID\":\"219382198\"}",
"{\"CUSTOMER_ID\":\"282828\"}",
"{\"CUSTOMER_ID\":\"21268239813\"}",
"{\"CUSTOMER_ID\":\"1114445184\"}",
"{\"CUSTOMER_ID\":\"2222222222\"}",
"{\"CUSTOMER_ID\":\"99218492183921\"}"
);
static final long MIN_ID = 1000000L;
static final long MAX_ID = 1000000000000000000L;
public void test() {
sample.stream()
// Extract CustomerID
.map(s -> s.substring("{\"CUSTOMER_ID\":\"".length(), s.length() - 2))
// Remove any bad ones - such as UUID.
.filter(s -> s.matches("[0-9]+"))
// Convert to long - assumes no number too big, add a further filter for that.
.map(s -> Long.valueOf(s))
// Apply limits.
.filter(l -> MIN_ID <= l && l <= MAX_ID)
// For now - just print them.
.forEach(s -> System.out.println(s));
}
List sample=Arrays.asList(
“{\'CUSTOMER\'u ID\':\'928941293291\'”,
“{\“客户ID\”:\“291389218398\”,
“{\“客户ID\”:\“1C4DC4FC-02Q9-4130-S12B-762D97FS43C\”,
“{\'CUSTOMER\'u ID\':\'219382198\'”,
“{\'客户ID\':\'282828\'”,
“{\'CUSTOMER\u ID\':\'21268239813\'”,
“{\“客户ID\”:\“1114445184\”}”,
“{\'CUSTOMER\'u ID\':\'2222\'”,
“{\'CUSTOMER\'u ID\':\'99218492183921\'”
);
静态最终长度最小值=1000000ml;
静态最终最大长ID=10000000000000000L;
公开无效测试(){
sample.stream()
//提取客户ID
.map(s->s.substring(“{\'CUSTOMER\u ID\”:\”“.length(),s.length()-2))
//删除任何不好的,比如UUID。
.filter(s->s.matches(“[0-9]+”))
//转换为long-假设没有太大的数字,为其添加进一步的过滤器。
.map->Long.valueOf
//施加限制。
.filter(l->MIN_ID尝试避免异常!当10%-20%的数字解析失败时,它需要10倍多的时间来执行并执行(您可以为它编写一个小测试)
如果您的输入与您展示的完全相同,则应使用廉价操作:
使用BufferedReader
逐行读取文件(如前所述),或者(如果您将整个数据作为字符串)使用StringTokenizer
来处理每行分隔符。
每一行都以{“CUSTOMER\u ID”:“
开头,以”}
结尾。不要使用replace
或regex(这更糟糕)来删除它!只需使用一个简单的子字符串即可:
String input = line.substring(16, line.length() - 2)
为了避免异常,您需要找到区分id和UUID(?)的指标,这样您的解析工作就不会出现异常。例如,您的id是正数,但UUID包含负号,或者长的只能包含20个数字,但UUID包含35个字符。因此,这是一个简单的if-else,而不是try-catch
对于那些认为在解析数字时不捕获数字格式异常是不好的人:如果有一个无法解析的id,则整个文件都已损坏,这意味着您不应该尝试继续,而应该努力失败
这是一个小测试,旨在了解捕获异常和测试输入之间的性能差异:
long REPEATS = 1_000_000, startTime;
final String[] inputs = new String[]{"0", "1", "42", "84", "168", "336", "672", "a-b", "1-2"};
for (int r = 0; r < 1000; r++) {
startTime = System.currentTimeMillis();
for (int i = 0; i < REPEATS; i++) {
try {
Integer.parseInt(inputs[i % inputs.length]);
} catch (NumberFormatException e) { /* ignore */ }
}
System.out.println("Try: " + (System.currentTimeMillis() - startTime) + " ms");
startTime = System.currentTimeMillis();
for (int i = 0; i < REPEATS; i++) {
final String input = inputs[i % inputs.length];
if (input.indexOf('-') == -1)
Integer.parseInt(inputs[i % inputs.length]);
}
System.out.println("If: " + (System.currentTimeMillis() - startTime) + " ms");
}
long REPEATS=1_000_000,开始时间;
最终字符串[]输入=新字符串[]{“0”、“1”、“42”、“84”、“168”、“336”、“672”、“a-b”、“1-2”};
对于(int r=0;r<1000;r++){
startTime=System.currentTimeMillis();
for(int i=0;i
我的结果是:
- 约20ms(测试)和约200ms(捕捉),20%无效输入
- 约22ms(测试)和约130ms(捕获),输入无效率为10%
由于JIT或其他优化,这些类型的性能测试很容易做对。但我认为您可以看到一个方向。首先,您应该尝试逐行读取文件。然后从每一行提取与模式匹配的id,并将其收集到数组中。下面是类似的解决方案
import re
# Open the file
with open('cids.json') as f:
# Read line by line
for line in f:
try:
# Try to extract matching id with regex pattern
_id = re.search('^{[\w\W]+:"([A-Z\d]+-[A-Z\d]+-[A-Z\d]+-[A-Z\d]+-[A-Z\d]+)"}', line).group(1)
customer_ids.append(_id)
except:
print('No match')