Data Extraction in the Cloud – GOOG, MSFT, AMZN – 1 of 4


The three largest providers of cloud infrastructure and platform services now offer OCR-based data extraction solutions in the cloud. This article provides an overview of the three vendors, and I will follow up with separate articles to provide more detailed analysis on each of their offerings.

Alphabet Inc. (Google) reported Cloud Revenues of $7.2B in a total of $282.2B in 2022. Google Cloud Vision GA release on May 18, 2017, included Document Text Detection with the ability to return full text annotations for dense OCR text.

Microsoft Inc. reported Intelligent Cloud Revenues of $75.25B in a total of $198.3B in 2022. Azure Form Recognizer was announced on May 2, 2019.

Amazon Web Services revenues were $80.1B in a total of $514B. Amazon Textract GA was released on May 29, 2019, as a fully managed web service.

The top line descriptions on each of their web sites describe the services available. Google Document AI: “Extract structured data from documents and analyze, search and store this information.” Azure Form Recognizer: “Form Recognizer is an AI service that applies advanced machine learning to extract text, key-value pairs, tables and structures from documents automatically and accurately.” Amazon Textract: “… is a machine learning (ML) service that automatically extracts text, handwriting, and data from scanned documents. It goes beyond simple optical character recognition (OCR) to identify, understand, and extract data from forms and tables.”

Each of the vendors offers a set of pre-built components designed to perform both general tasks and extract structured data from specific document types. Google Document AI component functions are called Processors. Azure Form Recognizer component functions are called Models. Amazon Textract component functions are called APIs.

In terms of OCR, both Google and Microsoft offer alternatives specifically suited for computer vision applications such as recognizing street signs and words in video vs. data extraction for document processing. Google provides a call for TEXT_DETECTION for computer vision, and DOCUMENT_TEXT_DETECTION for conventional OCR applications such as forms processing. Microsoft Read (OCR) is available as Computer Vision and Form Recognizer versions designed for the two different purposes.

Varying processing tasks are priced by the functions and features employed by the customer. All are designed to manage very high volumes of documents. All offer integrations with other related web services offered by the provider.

Each of the vendors offers both Synchronous processing, for immediate response of results on smaller documents, and Asynchronous processing for large input files.

All vendors process standard image input files as well as PDF Image and PDF Normal files. Microsoft also accepts .docx, .xlsx, .pptx and HTML as input documents.

All three vendors offer output from their OCR and data extraction services in the form of JSON files. Google offers both Full and Shortened JSON files. Amazon outputs .ZIP files, which contain JSON file output and varying .CSV files specific to the API used in processing. All the JSON files include three major elements: recognized text, geometry (word location) and word confidence.

All three vendors offer publicly accessible demo sites to evaluate the processing options on user samples across their various Processors, Models and APIs. They offer a variety of views of the results and provide the ability to download the JSON and ZIP output files.

I tested the pure OCR accuracy using four sets of images, but as is always the case, the only OCR accuracy that matters is the customer’s own documents. I used the historic UNLV test docs which the Tessaract community shares (I personally sold and installed the Calera 9000 in 1991 used in those original OCR comparisons), as well as three other sets of documents. In testing over 100 documents, I focused on only the poorest quality images because these three engines rarely make recognition errors even on poor quality scans, dark backgrounds, low res fax images and other challenging samples. To assess the form processing and field recognition functionality, I used a set of structured and semi-structured documents such as air waybills, statements, and invoices.

All three solutions investigated here are primarily designed for forms processing data extraction applications. This functionality specifically requires the ability to recognize corresponding Field Names with Field Values, or as Azure Form Recognizer describes them: “Key-Value pairs.” But the AI and ML techniques go beyond handling pure forms to less structured documents such as invoices, ID documents and even lending packages.

I will cover each vendor in detail in the next three articles. I will not offer any “OCR accuracy” comparison primarily because as mentioned above, the only accuracy that matters is how the products perform on the customer’s own documents. But more importantly in this comparison, recognition of data in accurate key-value pairs is the primary objective of any forms processing or data extraction application.

One more note: because these applications are focused on data extraction, none of them offer formatted output for typical desktop purposes such as creating well-formatted MS Office documents, so they do not compete with traditional desktop OCR.

PDF Expert – Master PDF and OCR

Copyright © 2023 Tony McKinley. All rights reserved.

Email: amckinley1@verizon.net