Analyzing Office 365 OCR Data using Power BI
I’m so excited to see Optical Character Recognition (OCR) working natively now in Office 365! I got my start in government where we scanned a lot of documents. There was a lot of data locked away in those images but no easy way to mine it. OCR was explored as a possible solution, but it was still in its infancy and not very accurate or scalable.
Fast forward to today and now OCR is simply part of our cloud infrastructure and with the assistance of Machine Learning, very accurate. The dream of finally unlocking that data is here! Or is it? Read on to find out more.
Intelligent Search and Unlocking Your Image Data
SharePoint’s ability to do native OCR processes was first shown at Ignite 2017. There, Naomi Moneypenny did a presentation on Personalized intelligent search across Microsoft 365, where you saw Office 365 OCR in action. https://techcommunity.microsoft.com/t5/Microsoft-Search-Blog/Find-what-you-want-discover-what-you-need-with-personalized/ba-p/109590
She uploaded an image of a receipt and was able to search for it, based on the contents of the receipt image. It was a seamless demo of how Office 365 can intelligently mine the data in image files.
Now, let’s see if we can access the OCR data across multiple files and do something with it.
In the following procedure, I’ll show you how to connect to Office 365 and get to the data that the OCR process returns. In the following post, I’ll show you how to process receipt data to get the total amount spent, solely from the OCR data.
There are three steps to the process to start mining your OCR data. First, you have to add image content that contains text to a SharePoint folder.
Finding OCR Data
The OCR process that runs against the files in a SharePoint document folder are called Media Services. All derived data is stored in columns that contain Media Services in them.
Unfortunately, I’ve discovered that this feature has not been implemented consistently across the Shared Documents folder, custom folders and OneDrive. There is good news in that there’s a less obvious way to get to the data consistently across all three, using Properties. As shown below, you see the normal column names and where they appear. Only the ones in Properties appear consistently across all. We are only going to cover the basic information but the Properties collection has a lot more data in which to consume.
Adding Image Content to a SharePoint Document Folder
When you upload an image to a SharePoint document folder in Office 365, the OCR process kicks off automatically. I’ve had it take up to 15 minutes but the OCR process will analyze the image for text and return the text in a field called MediaServiceOCR if present and always in Properties.vti_mediaserviceocr.
These columns contain any text that was recognized in the graphics file. The current structure of the returned data is a bit different that what is in the source image. Each instance of the discovered text is returned on a separate line, using a Line Feed character as a delimiter. For example, if you had a two-column table of Term and Meaning, it would return the data like this:
While it’s great you can get to the data, the current returned format makes it exceptionally complex to reconstitute the context of the data. Also, the more complex your layout, the more “interesting” your transformations may need to be. I’d strongly recommend this post (https://eriksvensen.wordpress.com/2018/03/06/extraction-of-number-or-text-from-a-column-with-both-text-and-number-powerquery-powerbi/) and this post (https://blog.crossjoin.co.uk/2017/12/14/removing-punctuation-from-text-with-the-text-select-m-function-in-power-bi-power-query-excel-gettransform/ ) to give you the basics of text parsing in Power Query M.
Accessing the OCR Data in Power BI
The OCR columns are item level columns. The normal tendency would be to connect to your SharePoint site using the Power BI SharePoint Folder connector. You’ll be disappointed to find that the Media Services columns aren’t there.
Instead, connect to the document folder using the SharePoint Online List connector. By doing so, you’ll get access to the Media Services columns. Once in the dataset, you can use Power Query M to parse the data and start analyzing.
Let’s walk through how to access the data and manipulate it using Power BI. In this scenario, I have two receipts that have been uploaded in a document folder and I’m going to get the total spent on these receipts by analyzing the OCR data.
What about OneDrive for Business?
Yes, it works there too! The Media Service property fields are here as well. In fact, you get more information in an additional column called MediaServicesLocation. Based on my usage, it seems to be specifically populated for image files. If the image contains EXIF data, the MediaServicesLocation will contain the Country, State/Province, and City information of where it was created. Within the Properties collection, you can actually get more detailed information about the photo, like the type of camera that took it and more.
To connect to OneDrive where this will work, you need your OneDrive URL. I normally right-click on the OneDrive base folder in File Explorer and select View Online, as shown below.
Potential for GDPR Issues
One aspect to consider if you look to do this is a production manner in Europe is that you will likely encounter information that falls under GDPR regulation. Consider this your prompt to think about how this capability would fit into your overall GDPR strategy.
Want a copy of the Power BI model?
Fill out the form below and it will emailed to you automatically, by virtue of the magic of Office Forms, Flow, and Exchange!
I hope you liked this post. If you have comments or questions, post them below in the comments.