相关文章推荐
好帅的扁豆  ·  fabricjs ...·  8 月前    · 
慈祥的皮带  ·  python ...·  11 月前    · 
Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.

Learn more about Collectives

Teams

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Learn more about Teams

Result of program using pdfbox built with maven-shade-plugin is different than normal NetBeans Run

Ask Question

I have program in java which uses PDFBox 1.7.1 and it is build with maven-shade-plugin 2.0.

Here is the code which uses PDFBox api:

public class PdfFile {
    protected PDDocument document = null;
    public boolean load(byte[] bytes) throws IOException {
        InputStream is = new ByteArrayInputStream(bytes);
        PDFParser parser = new PDFParser(is);
        parser.parse();
        COSDocument cosDoc = parser.getDocument();
        this.document = new PDDocument(cosDoc);
        return true;
    public byte[] extractText() throws IOException {
        PDFTextStripper pdfStripper = new PDFTextStripper();
        byte[] text = pdfStripper.getText(this.document).getBytes();
        return text;
    public void close() throws IOException {
        if(this.document != null) {
            this.document.close();

So basicly method load() loads pdf document from byte array and method extractText() returns text extracted from PDF as a byte array. It works when I run program from NetBeans Run button, but when I run it from single jar built with maven-shade-plugin the returned text is in wrong character encoding. For example word:

odpowiadająca (normal polish characters)
odpowiadajšca (netbeans run)
odpowiadajÄca (single shade jar)

I know it's exactly same file (byte array) which comes as argument to PdfFile.load() on both runs. So the problem is with PDF box returning text in two different formats...

I have 3 questions:

  • Why in jar built with shade plugin encoding is different?
  • How I can controll/set the encoding used by jar built with shade plugin?
  • How I can force PDF box to return text in correct format?
  • I know that in command line PDFBox there is option to set encoding:

    java -jar {$jar_path} ExtractText -encoding UTF-8
    

    But I can't find it in PdfBox api...

    Solved: I had to change

    pdfStripper.getText(this.document).getBytes();
    
    pdfStripper.getText(this.document).getBytes("UTF8");
    
  • According this code : the default output encoding is UTF-8.
  • There is a PDFTextStripper constructor taking the output encoding as an argument.
  • For question 1 and 3:

    I think your problem is more related to the way you transform the byte[] returned by extractText() into a String.

    new String(byte[]) use the platform encoding. So, doing this within netbeans or in shell can give different results since I expect that the platform encoding can be different when running within Netbeans.

    Posting the code handling the result of your extractText() can be helpful.

    Thanks, you were right about string - In above code I use pdfStripper.getText(this.document).getBytes(); which is String.getBytes() - I had to change this line to pdfStripper.getText(this.document).getBytes("UTF8"); and it solved problem, thanks! – user606521 Feb 3, 2013 at 10:38

    Thanks for contributing an answer to Stack Overflow!

    • Please be sure to answer the question. Provide details and share your research!

    But avoid

    • Asking for help, clarification, or responding to other answers.
    • Making statements based on opinion; back them up with references or personal experience.

    To learn more, see our tips on writing great answers.