Tesseract is giving junk data as an output for Japaneses languageAsk Question


I'm trying to build a sample application in java for Japaneses language that will read an image file and just output the text extracted from the image. I found one sample application on net which is running perfect for English Language but not for Japanees it is giving unidentified text, following is my code:

BytePointer outText;

    TessBaseAPI api = new TessBaseAPI();
    // Initialize tesseract-ocr with japanees, without specifying tessdata path
    if (api.Init(".", "jpn") != 0) {
        System.err.println("Could not initialize tesseract.");

    // Open input image with leptonica library
    PIX image = pixRead("test.png");
    // Get OCR result
    outText = api.GetUTF8Text();
    String string = outText.getString();
    System.out.println("OCR output:\n" + string);

    // Destroy used object and release memory

my output is: OCR output: ETCカー-ード申 込書 �申�込�日 09/02/2017 ETC FeatureID ETCFFL ー申込枚輩交 画 枚

i has used jpn.tessdata and my application is reading tessdata file also. is any more configration needed? i'm using Tessaract 3.02 version with very clean image.


Yes! i got the solution, what we need to do is to set the locale in our java code as follows: olocale = new Locale.Builder().setLanguage("ja").setRegion("JP").build(); we can set locale for English language also in order to extract both Japanese as well as English text from Image.

now it is working like charm for me!!

标签: tesseract python-tesseract tess4j
© 2014 TuiCode, Inc.