Wanting to know how to OCR PDFs as they are added to SharePoint automatically?
The Encodian connector provides an OCR action named ‘OCR a PDF Document‘ which checks a PDF document for the presence of a text layer and if one isn’t present it will perform OCR and add the text layer to the PDF document before returning the newly OCR’d PDF document.
The ‘OCR a PDF Document’ action can also perform a wide array of clean-up operations such as auto-rotate, deskew, despeckle, etc. Please review the documentation for further details.
Scenario
This article details a simple Flow for automatically performing OCR on PDF documents added to a SharePoint library to ensure the contents of the files can be indexed by SharePoint and can be more easily found by users.
Guide
1. Create a new Flow using the ‘Automated cloud flow‘ option

2. Enter a name for the Flow, select the ‘When a file is created (properties only)‘ SharePoint trigger action, click ‘Create‘
3. Configure the ‘When a file is created (properties only)‘ SharePoint trigger action
3.a. Site Address: Enter the location of the SharePoint site where the target library / folder is held
3.b. Library Name: Select the SharePoint document library
3.c. Folder: (Optional) Select the SharePoint folder which will be monitored for new PDF documents

4. Add the SharePoint ‘Get file content‘ action
4.a. Site Address: Set as per the trigger actions value
4.b. File Identifier: Select the ‘Identifier‘ property provided by the ‘When a file is created (properties only)‘ SharePoint trigger action
5. Add a ‘Condition‘ action
5.a. Configure the condition action as per the image below using the ‘File name with extension‘ property provided by the ‘When a file is created (properties only)‘ SharePoint trigger action. This will ensure that the Flow only attempts to OCR PDF documents

6. Add a ‘Terminate‘ action within the ‘No‘ branch of the condition added in step #4
6.a. Status: Select ‘Succeeded’

7. Add an ‘OCR a PDF Document‘ action within the ‘Yes‘ branch of the condition added in step #4
7.a. Filename: Select the ‘File name with extension‘ property provided by the ‘When a file is created (properties only)‘ SharePoint trigger action
7.b. File Content: Select the ‘File Content‘ property from the ‘Get file content‘ SharePoint action

OPTIONAL SETTINGS
Please review and change the following advanced options as required:
Language: Select the preferred language, the default is set to ‘English‘

Clean Operations: When setting to ‘Default‘ the OCR action will perform a default collection of clean-up operations including auto-rotate, auto deskew and auto despeckle. To select a specific set of clean-up operations, select ‘Specific‘ and then enable required clean-up operations.

Guide – Continued
8. Add a SharePoint ‘Update file‘ action
8.a. Site Address: Set to the value of the SharePoint site set in step #3.a
8.b File Identifier: Select the ‘Identifier‘ property provided by the ‘When a file is created (properties only)‘ SharePoint trigger action
8.d. File Content: Select the ‘File Content‘ property provided by the ‘OCR a PDF Document‘ Encodian action

The completed flow should follow this construct:

9. Now let’s test the flow!

10. Select ‘I’ll perform the trigger action‘ and click ‘Save & Test‘
11. Add a PDF document to the SharePoint folder set in step #3.b

12. Validate a text layer has been added to the PDF document

Finally…
Hopefully this post provides a good guide for ensuring PDF documents in your SharePoint libraries have been correctly OCR’d.
We hope you’ve found this guide useful, and as ever, please share any feedback or comments – all are welcome!
6 Comments
Hi, I’ve just tested this and it works but my PDFs are ending up roughly 10x larger than they started.
Is there a setting for quality anywhere? Or do I have to disable the processing items to keep file size comparable?
Hi Max,
Yes, this happens when ‘Clean Operations’ are specified. When these are enabled each page of the PDF document is broken down into a 300 DPI image (500K to 1MB) in size before the selected image optimisation operations are applied, on completion, the PDF document is regenerated from the new images resulting in larger file sizes. To disable the make sure you have set the ‘Clean Operations’ parameter to ‘none’, this will then attach the generated text layer back to the original document rather than building a new document from the enhanced images.
Hope this helps
Jay
Hi,
Ive added a pdf as per the instructions but unfortunately the pdf is still not ocr searchable. Is there a setting I need to enable for it to be more sensitive? The pdf I used contains a table with words and numbers. I used OneDrive Scan option (on my mobile) to generate the original pdf.
Hi Tita, can you please email your flow configuration and document to support@encodian.com? Typically, we see this reported where a document is provided to the Encodian action, but the output (the OCR’d PDF) hasn’t been used… i.e. you might use the SharePoint ‘Update File’ action to overwrite the source PDF document.
HTH
Jay
Once this flow is built, will I be able to use the search function in SharePoint to search text in the documents, or will I only be able to search on a selected document? Would it make more sense to use a PowerApp on top of SharePoint with Encodian running in the background for this functionality?
Yes, when you add a text layer to a PDF document through OCR, SharePoint will index the new text layer thus the document will appear in M365 search.