Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.

Learn more about Collectives

Teams

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Learn more about Teams

I am exploring tools to convert PDF documents to PDF/A. Ghostscript seems to give out of the box support for such a conversion. One issue seems to be that some true type fonts that are a part of the original PDF document are not converted correctly. If I copy a text from the converted PDF/A document, and paste it in notepad, the copied text appears to be garbled text.

The original document text can be copied to notepad just fine.

I am using the following script:

gswin64 -dPDFA -dBATCH -dNOPAUSE -dUseCIEColor -sProcessColorModel=DeviceCMYK -sDEVICE=pdfwrite -sPDFACompatibilityPolicy=1 -sOutputFile=FilteredOutput.pdf Filtered1Page.pdf

I have uploaded a sample 1 page source PDF in Google Drive: SampleInput

A sample output PDF/A document generated from the command is in Google drive here: SampleOutput

Running the above query on this PDF in a windows machine will reproduce the issue.

Are there any settings / commands make the PDF/A conversion to be handled properly?

Copy and paste from a PDF is not guaranteed. Subset fonts will not have a usable Encoding (such as ASCII or UTF-8), in which case they will only be amenable to cut/paste/search if they have an associated ToUnicode CMap, many PDF files do not contain ToUnicode CMaps.

Of course, the PDF/A specification states (oddly in my opinion) that you should not use subset fonts, but its not always possible to tell whether a font is subset (not all creators follow the XXXXX+ convention), and even if the font isn't subset there still isn't any guarantee that its Encoding is one that is usable.

Looking at the file you have posted, it does not contain one of the fonts it uses (Arial,Bold) and so Ghostscript substitutes with DroidSansFallback, and the font it does contain (FreeSansBold) is a subset (FWIW this font doesn't actually seem to be used....). The fallback font is a CIDFont, so there is no real prospect of the text being 'correct'.

I believe that if you make a real font available to Ghostscript to replace Arial,Bold then it will probably work correctly. This would also fix the rather more obvious problem of the spacing of the characters being incorrect (in one place, wildly incorrect), which is caused by the fallback font having different widths to the original.

NB as the warning messages have already told you don't use -dUseCIEColor.

The fact that you cannot copy/paste/search a PDF does not mean that it is not a valid PDF/A-1b file though, so thsi does not mean that the creation (NOT conversion) of the PDF/A-1b is not 'proper'.

a. I updated the cidfmap file to have this line /Arial << /FileType /TrueType /Path (c:/windows/fonts/arial.ttf) /SubfontID 0 /CSI [(Identity) 0] >> ;1 text looks better as you said, but still can't copy text correctly - drive.google.com/file/d/0B3Aklxzb8KfcX2VReXROUm5NQnc/… b. Not using -dUseCIEColor, converted PDF/A fails validation "A device-specific color space (DeviceGray) without an appropriate output intent is used" Is there a way to correct this c. Is there a way to get the true type font embedded and copy-able when converting to PDF/A using Ghostscript? – Praveen Nayak Feb 1, 2016 at 16:13 Your PDF file is using a CIDFont which is not embedded (bad practice) and has no ToUnicode CMap. Realistically, once you pass that through pdfwrite you aren't going to get a PDF with a ToUnicode CMap, because we don't invent one if there wasn't one to start with. So no, that's not going to work. You should be using -sColorConversionStrategy=CMYK for CMYK output, you can drop the ProcessCOlorModel if you do that, its set auto-magically. – KenS Feb 1, 2016 at 19:45 On using your suggestion for CMYK output gswin64 -dPDFA -dBATCH -dNOPAUSE -sColorConversionStrategy=CMYK -sDEVICE=pdfwrite -sPDFACompatibilityPolicy=1 -sOutputFile=FilteredOutput2.pdf Filtered1Page.pdf I get the error Unrecoverable error: rangecheck in .putdeviceprops. Anything I am missing? Also, can I assume there is no way that the document I shared can be converted to PDF/A and copying text still work? – Praveen Nayak Feb 1, 2016 at 22:16 Possibly you are using an old version of Ghostscript, the current version is 9.18. You are not (apparently) sending a pdfa_def.ps file, so the output file will not be a valid PDF/A file anyway. And finally, no, not using Ghostscript. – KenS Feb 2, 2016 at 7:58

Thanks for contributing an answer to Stack Overflow!

  • Please be sure to answer the question. Provide details and share your research!

But avoid

  • Asking for help, clarification, or responding to other answers.
  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.