Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.

Learn more about Collectives

Teams

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Learn more about Teams

Result of program using pdfbox built with maven-shade-plugin is different than normal NetBeans Run

Ask Question

I have program in java which uses PDFBox 1.7.1 and it is build with maven-shade-plugin 2.0.

Here is the code which uses PDFBox api:

public class PdfFile {
    protected PDDocument document = null;
    public boolean load(byte[] bytes) throws IOException {
        InputStream is = new ByteArrayInputStream(bytes);
        PDFParser parser = new PDFParser(is);
        parser.parse();
        COSDocument cosDoc = parser.getDocument();
        this.document = new PDDocument(cosDoc);
        return true;
    public byte[] extractText() throws IOException {
        PDFTextStripper pdfStripper = new PDFTextStripper();
        byte[] text = pdfStripper.getText(this.document).getBytes();
        return text;
    public void close() throws IOException {
        if(this.document != null) {
            this.document.close();
So basicly method load() loads pdf document from byte array and method extractText() returns text extracted from PDF as a byte array. It works when I run program from NetBeans Run button, but when I run it from single jar built with maven-shade-plugin the returned text is in wrong character encoding. For example word:
odpowiadająca (normal polish characters)
odpowiadajšca (netbeans run)
odpowiadajÄca (single shade jar)
I know it's exactly same file (byte array) which comes as argument to PdfFile.load() on both runs. So the problem is with PDF box returning text in two different formats...
I have 3 questions:
Why in jar built with shade plugin encoding is different?
How I can controll/set the encoding used by jar built with shade plugin?
How I can force PDF box to return text in correct format?
I know that in command line PDFBox there is option to set encoding:
java -jar {$jar_path} ExtractText -encoding UTF-8
But I can't find it in PdfBox api...
Solved: I had to change 
pdfStripper.getText(this.document).getBytes();
pdfStripper.getText(this.document).getBytes("UTF8");
According this code : the default output encoding is UTF-8.
There is a PDFTextStripper constructor taking the output encoding as an argument.
For question 1 and 3:
I think your problem is more related to the way you transform the byte[] returned by extractText() into a String.
new String(byte[]) use the platform encoding. So, doing this within netbeans or in shell can give different results since I expect that the platform encoding can be different when running within Netbeans.
Posting the code handling the result of your extractText() can be helpful. 
                Thanks, you were right about string - In above code I use pdfStripper.getText(this.document).getBytes(); which is String.getBytes() - I had to change this line to pdfStripper.getText(this.document).getBytes("UTF8"); and it solved problem, thanks!
– user606521
                Feb 3, 2013 at 10:38
        Thanks for contributing an answer to Stack Overflow!
Please be sure to answer the question. Provide details and share your research!
But avoid …
Asking for help, clarification, or responding to other answers.
Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.