Collectives™ on Stack Overflow
Find centralized, trusted content and collaborate around the technologies you use most.
Learn more about Collectives
Teams
Q&A for work
Connect and share knowledge within a single location that is structured and easy to search.
Learn more about Teams
Ask Question
I have program in java which uses PDFBox 1.7.1 and it is build with maven-shade-plugin 2.0.
Here is the code which uses PDFBox api:
public class PdfFile {
protected PDDocument document = null;
public boolean load(byte[] bytes) throws IOException {
InputStream is = new ByteArrayInputStream(bytes);
PDFParser parser = new PDFParser(is);
parser.parse();
COSDocument cosDoc = parser.getDocument();
this.document = new PDDocument(cosDoc);
return true;
public byte[] extractText() throws IOException {
PDFTextStripper pdfStripper = new PDFTextStripper();
byte[] text = pdfStripper.getText(this.document).getBytes();
return text;
public void close() throws IOException {
if(this.document != null) {
this.document.close();
So basicly method load()
loads pdf document from byte array and method extractText()
returns text extracted from PDF as a byte array. It works when I run program from NetBeans Run
button, but when I run it from single jar built with maven-shade-plugin the returned text is in wrong character encoding. For example word:
odpowiadająca (normal polish characters)
odpowiadajšca (netbeans run)
odpowiadajÄca (single shade jar)
I know it's exactly same file (byte array) which comes as argument to PdfFile.load()
on both runs. So the problem is with PDF box returning text in two different formats...
I have 3 questions:
Why in jar built with shade plugin encoding is different?
How I can controll/set the encoding used by jar built with shade plugin?
How I can force PDF box to return text in correct format?
I know that in command line PDFBox there is option to set encoding:
java -jar {$jar_path} ExtractText -encoding UTF-8
But I can't find it in PdfBox api...
Solved: I had to change
pdfStripper.getText(this.document).getBytes();
pdfStripper.getText(this.document).getBytes("UTF8");
According this code : the default output encoding is UTF-8.
There is a PDFTextStripper constructor taking the output encoding as an argument.
For question 1 and 3:
I think your problem is more related to the way you transform the byte[]
returned by extractText()
into a String.
new String(byte[])
use the platform encoding. So, doing this within netbeans or in shell can give different results since I expect that the platform encoding can be different when running within Netbeans.
Posting the code handling the result of your extractText()
can be helpful.
–
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.