Automatically OCR PDF Documents added to a SharePoint Library

October 10th 2019

Wanting to know how to OCR PDFs as they are added to SharePoint automatically?

The Encodian connector provides an OCR action named ‘OCR a PDF Document‘ which checks a PDF document for the presence of a text layer and if one isn’t present it will perform OCR and add the text layer to the PDF document before returning the newly OCR’d PDF document.

The ‘OCR a PDF Document’ action can also perform a wide array of clean-up operations such as auto-rotate, deskew, despeckle, etc. Please review the documentation for further details.

Scenario

This article details a simple Flow for automatically performing OCR on PDF documents added to a SharePoint library to ensure the contents of the files can be indexed by SharePoint and can be more easily found by users.

Guide

1. Create a new Flow using the ‘Automated — from blank‘ option

2. Enter a name for the Flow, select the ‘When a file is created in a folder‘ SharePoint trigger action, click ‘Create

3. Configure the ‘When a file is created in a folder‘ SharePoint trigger action

3.a. Site Address: Enter the location of the SharePoint site where the target library / folder is held

3.b. Folder Id: Select the SharePoint folder which will be monitored for new PDF documents

4. Add a ‘Condition‘ action

4.a. Configure the condition action as per the image below, which will ensure that the Flow only attempts to apply the OCR action to PDF documents

5. Add a ‘Terminate‘ action within the ‘No‘ branch of the condition added in step #4

5.a. Status: Select ‘Succeeded’

6. Add an ‘OCR a PDF Document‘ action within the ‘Yes‘ branch of the condition added in step #4

6.a. Filename: Select the ‘Filename‘ property from the ‘ When a file is created in a folder‘ SharePoint trigger action

6.b. File Content: Select the ‘File Content‘ property from the ‘ When a file is created in a folder‘ SharePoint trigger action

OPTIONAL SETTINGS

Please review and change the following advanced options as required:

Language: Select the preferred language, the default is set to ‘English

Clean Operations: When setting to ‘Default‘ the OCR action will perform a default collection of clean-up operations including auto-rotate, auto deskew and auto despeckle. To select a specific set of clean-up operations, select ‘Specific‘ and then enable required clean-up operations.

Guide – Continued

7. Add a SharePoint ‘Create file‘ action

7.a. Site Address: Set to the value of the SharePoint site set in step #3.a

7.b Folder Path: Set to the same value of the ‘Folder Id‘ property set in step #3.b

7.c. File Name: Select the ‘Filename‘ property from the ‘OCR a PDF Document‘ Encodian action

7.d. File Content: Select the ‘File Content‘ property from the ‘OCR a PDF Document‘ Encodian action

8. The completed flow should follow this construct:

9. Now let’s test the flow!

10. Select ‘I’ll perform the trigger action‘ and click ‘Save & Test

NOTE: You can ignore the recursive event warning as the Flow is configured to overwrite and existing document which will not re-fire the event. If you rename the file, thus creating a brand new file, the Flow will re-run. To avoid recursive event triggers review our post of the Power Automate Community Blog: SharePoint – Managing Recursive Events in Flow

11. Add a PDF document to the SharePoint folder set in step #3.b

12. Validate a text layer has been added to the PDF document

13. Repeat the test with a non-PDF document

Finally…

Hopefully this post provides a good guide for ensuring PDF documents in your SharePoint libraries have been correctly OCR’d.

We hope you’ve found this guide useful, and as ever, please share any feedback or comments – all are welcome!

Author
Jay Goodison

Managing Director

You might also be interested in...