如何在java中检查已解析数据的语言_Java_Json_Io_Jsoup

如何在java中检查已解析数据的语言

java json io

如何在java中检查已解析数据的语言,java,json,io,jsoup,Java,Json,Io,Jsoup,我正在解析谷歌的一项服务。它会以多种语言生成数据。而我只需要英文数据。我怎样才能保证语言。请建议 String url = "https://newsapi.org/v2/top-headlines?sources=google-news&apiKey=89c8009165774e0fad3742f78b50c6da"; URL url1 = new URL(url); URLConnection uc = url1.openConnec

我正在解析谷歌的一项服务。它会以多种语言生成数据。而我只需要英文数据。我怎样才能保证语言。请建议

String url = "https://newsapi.org/v2/top-headlines?sources=google-news&apiKey=89c8009165774e0fad3742f78b50c6da";

            URL url1 = new URL(url);
            URLConnection uc = url1.openConnection();

            InputStreamReader input = new InputStreamReader(uc.getInputStream());
            BufferedReader in = new BufferedReader(input);

            String inputLine;
            String fullline = "";

            while ((inputLine = in.readLine()) != null) {
                fullline = fullline.concat(inputLine);
            }

            JSONObject rootObject = new JSONObject(fullline);

            JSONArray rows1 = (JSONArray) rootObject.get("articles");

样本数据为：

    {
  "status": "ok",
  "totalResults": 1100,
  "articles": [
    {
      "source": {
        "id": null,
        "name": "Ua-football.com"
      },
      "author": "Спорт.ua",
      "title": "Шаран може покинути Олександрію після закінчення цього сезону",
      "description": "Контракт наставника закінчується влітку",
      "url": "https://www.ua-football.com/ua/ukrainian/high/1521800049-sharan-mozhe-pokinuti-oleksandriyu-pislya-zakinchennya-cogo-sezonu.html",
      "urlToImage": "https://static.ua-football.com/img/upload/18/24f537.jpeg",
      "publishedAt": "2018-03-23T10:22:21Z"
    },
    {
      "source": {
        "id": null,
        "name": "Nikkansports.com"
      },
      "author": null,
      "title": "西武菊池雄星、開幕へ万全 ＯＰ戦ラス投５回無失点",
      "description": null,
      "url": "https://www.nikkansports.com/baseball/news/201803230000734.html",
      "urlToImage": null,
      "publishedAt": "2018-03-23T10:20:46Z"
    },
    {
      "source": {
        "id": null,
        "name": "Siol.net"
      },
      "author": null,
      "title": "Picomat na Koroškem je postal prava atrakcija #video",
      "description": "Slovenj Gradec se je pred kratkim obogatil s pridobitvijo, s katero se lahko pohvalita tudi Dubaj in Dunaj. Na Koroškem je za pravo revolucijo poskrbel picomat, ki je postal pravi magnet za odrasle in mladino. Uporabniki morajo samo pritisniti na gumb in svež…",
      "url": "https://siol.net/trendi/kulinarika/picomat-na-koroskem-je-postal-prava-atrakcija-video-463111",
      "urlToImage": "https://siol.net/media/img/9a/c1/8b62129ba4efcbf0faf9-picomat.jpeg",
      "publishedAt": "2018-03-23T10:19:56Z"
    },
    {
      "source": {
        "id": null,
        "name": "Nikkansports.com"
      },
      "author": null,
      "title": "明秀日立・金沢監督「勝ちに不思議な勝ちあり」"
    }
  ]
}

您正在寻找一种识别文本语言的方法，这是一个很难解决的问题

您很可能需要集成库或依赖第三方API

这里有一些有用的链接。您也可以使用API。

处理词频。以最常见的单词为例，最好了解这些单词在正常文本中所占的百分比，并进行检查

public boolean isEnglish(String text) {
    Set<String> mostFrequentWords = new HashSet<>();
    Collections.addAll(mostFrequentWords,
        "the", "of", "and", "a", "to", "in", "is", "be", "that", "was", "he", "for",
        "it", "with", "as", "his", "i", "on", "have", "at", "by", "not", "they",
        "this", "had", "are", "but", "from", "or", "she", "an", "which", "you", "one",
        "we", "all", "were", "her", "would", "there", "their", "will", "when", "who",
        "him", "been", "has", "more", "if", "no", "out", "do", "so", "can", "what",
        "up", "said", "about", "other", "into", "than", "its", "time", "only", "could",
        "new", "them", "man", "some", "these", "then", "two", "first", "may", "any",
        "like", "now", "my");

    int wordCount = 0;
    int hits = 0;

    Pattern wordPattern = Pattern.compile("\\b\\p{L}+\\b");
    Matcher m = wordPattern.matcher(text);
    while (m.find() && wordCount < 100) {
        String word = m.group().toLowerCase(Locale.ENGLISH);
        ++wordCount;
        if (mostFrequentWords.contains(word)) {
           ++hits;
        }
    }
    return hits * 100 / wordCount >= 30; // At least 30 percent
}

请注意，某些函数之间的关系，如逗号式引号、项目符号、破折号，不是ASCII。或者像mañana和façade这样的外来词。

找到了解决办法。按照预期工作

private static boolean isEnglish(String text) {
        CharsetEncoder asciiEncoder = Charset.forName("US-ASCII").newEncoder();
        CharsetEncoder isoEncoder = Charset.forName("ISO-8859-1").newEncoder();
        return  asciiEncoder.canEncode(text) || isoEncoder.canEncode(text);
    }

您可以进行

String#匹配（“.\\b\\w+\\b.*）

并过滤掉结果，如果正则表达式不匹配。这至少可以让你们摆脱像中文和日文这样的非拉丁字符串。谢谢你们的回复。但我不希望除了英语之外还有其他语言，使用这个正则表达式法语或类似英语的语言不会被识别。也许这可能会让你感兴趣：这可能不起作用，因为你可以在

UTF-8

中使用英语，这往往是一个标准代码页。

private static boolean isEnglish(String text) {
        CharsetEncoder asciiEncoder = Charset.forName("US-ASCII").newEncoder();
        CharsetEncoder isoEncoder = Charset.forName("ISO-8859-1").newEncoder();
        return  asciiEncoder.canEncode(text) || isoEncoder.canEncode(text);
    }