Start a new topic

multipage PDFs not OCRing


Is there a delay in the OCR of multipage PDF when ingesting in to Preservica?   My single page PDF appear to be OCR'd, but my multipaged PDFs are not.  I have noticed this due to search results.

Hi Abra,

We do not actually perform any OCR operations on uploaded files as this is not available in any Starter version, only on enterprise editions of the full Preservica system. However Starter does perform full text indexing of all incoming files that are identified as text, this includes text based .pdfs but not image based .pdfs

So this is performed during the upload and ingest stage and to avoid bloating the index file it does the first 50Mb only, so it may be that your multipage files contain images that we are not able to index or the files exceed the 50Mb limit after the first couple of pages.

Indexing does run as a background task, so your upload will finish and the indexing will carry on but I wouldn't expect much of a delay before the search index was available.

Thank you for the clarification OCR and indexing, Steve.  I think it just took some time to index the larger PDF file, as it is now pulling up keyword results.  

Login or Signup to post a comment