From Books to the Web
The On-line OCR Lab

Optical Character Recognition -_-_- Document Understanding -_-_- Text Searching _-_-_ Digital Libraries _-_-_-

Intelligent ImagingPO Box 537, Berwyn PA 19312Ph: 610.647.5570 – Fax: 610.647.7354Email: tonymck@imagebiz.comConsulting & AnalysisTo Build Digital Libraries: Capture & Recognition

We provide consulting and analysis for Document Imaging, Recognition and Digital Library creation projects. Based on over 15 years dedicated experience in the field of scanning and OCR, and hands-on knowledge of hundreds of implementations, we provide uncommonly deep and broad insight on document digitization, high volume capture and conversion, and LAN/WAN and Intranet search and retrieval systems.

1996 Clients: Adobe, Bell Atlantic, Canon, Excalibur, Reed Technology & Information Systems, Silicon Biology, University of Pennsylvania, Xerox, Fujitsu, Zylab, … and you?

NEW: Published Analysis of OCR and Document Recognition available in HTML & Bookmarked PDF. Three new Classic OCR test images added in TIFF and PDF.

Remember to use the BACK Button on your Browser! You can easily open up these articles, read ’em or print ’em, and quickly pop right back to this page by clicking on the BACK Button on your Browser!

World’s First Review of Adobe Acrobat 3 Acrobat 3, an Electronic Publishing Milestone: Preview of the Cornucopia : HTML File
Acrobat 3, an Electronic Publishing Milestone: Preview of the Cornucopia : PDF file

Adobe Acrobat 3 offer advances in every aspect, from the newly ‘optimized’ PDF file format itself, to beefed up versions of Distiller, PDFWriter and the all new Capture Plug-in. Verity’s new SearchPDF makes Catalog collections searchable over the web, with highlight words in the text – and it’s free. UNBELIEVABLE!

Web Searchers: Smart HTML vs. Spamdex – AltaVista, Excite, Infoseek, Open Text Reviewed The Truth is out there: Beyond Lycos and Yahoo – the Full Text Retrieval Engines on the Web : HTML File
The Truth is out there: Beyond Lycos and Yahoo – the Full Text Retrieval Engines on the Web : PDF file

Full Text Retrieval Engines promise access to the vast and growing universe of information on the Web. Spamdex gunks up the system, and the big engines have already defeated spamdexing. Here’s a Seeker’s-eye-view of a few of the big engines.

Coming Soon!OCR Lab Tests Acrobat Capture Plug-In vs. TextBridge Pro 96 & OmniPage Pro 7 on Windows 95 OCR Lab 10 – Original RTF results of TextBridge 96 : PDF Normal File
OCR Lab 10 – Original RTF results of OmniPage 7.0 : PDF Normal File
OCR Lab 10 – Original PDF Normal results of Capture Plug-in : PDF Normal File
OCR Lab 10 – Original .tif imagesimported by Capture : PDF Image File
OCR Lab 10 – Original .tif images :Image Files

Page & Character Recognition comparison of leading OCR products and Acrobat Capture. Native RTF output of OCR printed to PDF by PDFWriter. All results are untouched. OCR results are designed to be edited, Capture results are designed to be published. The differences are dramatic!

OCR Lab Tests TextBridge Pro 96 vs. OmniPage Pro 7 on both Windows 95 OCR Lab Word Accuracy Comparison, Summer ’96 : HTML File
OCR Lab Word Accuracy Comparison, Summer ’96 : PDF file

OCR Comparison: Word Accuracy on 10 Documents, from memos to magazines. Commentary on Format Recognition and HTML Output.

Adobe Acrobat Capture: White Paper on Business Applications vs. Alternatives for creating digital docs Adobe Acrobat Capture: Product Overview : PDF file

Top level brief on Adobe Acrobat Capture and the place it may play in the new application of digital documents on the Intranets, including comparisons to alternative means of document storage, access and distribution.

OCR Lab Tests TextBridge Pro 3 vs. OmniPage Pro 6 on both Windows and Mac OCR Lab Word Accuracy Comparison, Spring ’96 : HTML File
OCR Lab Word Accuracy Comparison, Spring ’96 : PDF file

OCR Comparison: Word Accuracy on 27 Documents, from memos to magazines.

The Digital Document IS the NEWS at AIIM 96! Intranet Info Systems Need a New Document: HTML & PDF PDF and Intranet News from AIIM ’96 : HTML File
PDF and Intranet News from AIIM ’96 : PDF file

There are four ways to VIEW DOCUMENTS: 1. Convert to HTML ‘on the fly’; 2. Use a Viewer to look at native files; 3. Download files for a special application; 4. Net-centric files.
Text retrieval, Web-enabled document imaging and Internet document management and philosophy from many vendors, including Adobe, Excalibur, ZyLAB, Verity, Open Text, Fulcrum and many others.

Adobe Acrobat Capture and Catalog Adobe Acrobat: Convert Directly from Paper to Rich Electronic Documents :HTML
Adobe Acrobat: Convert Directly from Paper to Rich Electronic Documents : PDF file

Beyond OCR: Direct Path from Paper to Rich Electronic Format – PDFPortable Document Format, readable by Windows, Mac, Unix GUI and DOS users.

Kofax Ascent Capture: kofax.htm :HTML or PDF: kofax.pdf

Get Paper into OCR: Large volume scanning and indexing of paper documents . Kofax has always been a leader in scanning and imaging software performance.

Cornerstone Input/Accel: iwiptacw.htm :HTML or PDF: iwiptacw.pdf

Get Paper into OCR: Large volume scanning and indexing of paper documents . End-to-end document digitization requires systematic tools and procedures, high end Image Processing.

Intrafed StageWorks: intrafed.htm :HTML or PDF: intrafed.pdf

Get Paper into OCR: Large volume scanning and indexing of paper documents . These guys have tons of experience, including DocEx, a document exploitation project conducted during Gulf War.

The Birth of a Virtual Agency: Saving the Legacy of the Office of Technology Assessment iwotaweb.htm : HTML File vs. PDF file :iwotaweb.pdf

Beyond OCR: Digital Library creates a Virtual Agency

Digital Library Projects: Reality Now iwlibweb.htm : HTML File vs. PDF file : iwlibweb.pdf

More than OCR: Digital Library Success Stories This document includes Web Links in the PDF file for instant access to the inspirational sources of the Digital Library story. This story is a guided tour of ongoing SGML, TEI, PDF, SUPRA and other efforts to create digital libraries.

The free Adobe Acrobat Reader latest versions for Windows, Mac, Unix and DOS are always available at http://www.adobe.com or ftp.adobe.com.

The above files were created in Microsoft Word for Windows and converted in two ways. The documents were “printed” via PDF Writer as Acrobat Portable Document Format. And, the documents were imported and saved through SoftQuad HoTMetaL Pro 2.0. The documents themselves are Test Articles for anybody thinking of building a Digital Library. The HTML versions of the OCR Test Reports include tables that collapsed during conversion – HoTMetaL Pro 3 is on the WAY!

Caere WordScan iwcaerew.htm

Evaluating OCR: WordScan 4.0, the first technology offspring of the new Caere. WordScan 4 is 43% more accurate than WordScan 3 and 58% more accurate than OmniPage Pro 5

Xerox TextBridge iwxeroxw.htm

Evaluating OCR: Xerox TextBridge Pro 3, ExperVision TypeReader Pro 3 strong on formatting. Productive packages provide quick payoff by capture of format and content.

Project Diana: iwnickw.htm :HTML or PDF: iwnickw.pdf

Beyond OCR: SGML Encoding to Preserve and Provide Deep Access . “We need to think of pages in books on shelves in libraries, not pages in documents in folders on desktops!”

Automating the Global Law Firm: iwmlbw.htm :HTML or PDF: iwmlbw.pdf

Pioneering OCR: And other paths to electronic documents . One of the world’s largest law firms uses all forms of office automation and document processing strategies, and the NETWORK is the Primary Advantage technology.

Business Advantage on the Internet: iwccbi3w.htm :HTML or PDF: iwccbi3w.pdf

Early adopters of Internet technology enjoy outstanding business edge . Intelligent agents, video conferencing, and the Web can all confer substantial competitive leverage

If you would like to know more about the author, Resume: Tony McKinley. tmresume.htm

Introduction: On the Path to Digital Libraries

For the past 15 years I have been dedicated to the Noble Task of turning paper documents into digital form. My Hot Rods were the Compuscan, Hendrix and Dest, the Kurzweil Intelligent Scanning System and the Calera Compound Document Processor. They were all OCR monsters of their time. Today’s OCR software finally gets us there. And now that the machines have caught up to the task, the World Wide Web is here to publish all the world’s libraries on the Internet.

The Inspiration for this page comes from Buckminster Fuller , in his 1962 book Education Automation. In that typically freewheeling talk Bucky proposed a universally accessible digital library that would enable anyone, anywhere to study, learn and grow. Bucky figured that this intellectual freedom of the masses would bring humanity’s best ideas to Reality.

This page is full of my field notes, including test images and results from the on-line OCR Lab, as published in Imaging Magazine, Imaging World, and Work Process Improvement TODAY. The goal of all my work is to transform paper documents into digital documents, from paper to bitmaps to SGML immortality in future electronic libraries, universal fonts of accessible knowledge.

OCR Lab Resources

A suite of documents is available here for independent testing and review. These are the images that were used in testing for the above articles on Xerox TextBridge, WordScan, OmniPage, TypeReader and Acrobat Capture. These documents were selected to show relative text and page format recognition on a wide variety of applications, from simple fax memos to complicated lists, newspaper and magazine pages. These pages were chosen because they illustrate the particular challenges to any scanning and recognition system, and they provide a basis for comparison among the leading programs. The original images are here for any interested individual to re-create the experiments and compare the results.

The Test ImagesSubjects of OCR and Document Understanding Analysis: Independent Testing and Feedback Requested. ocrimage.htm
The Test ResultsResults of OCR and Document Understanding Tests: Independent Feedback Requested. ocresult.htm

Pioneer StoriesOCR 70,000 pages per week at 99.985% Accuracy. This article describes a high volume OCR scanning production system, including multiple scanners and ExperVision RTK engines on a network.invwdave.htm or invwdave.pdfScientific journals scanned and converted to highly structured citation and abstract database. A network based Calera M-Professional scanning and OCR system, and the Quality Management concerns in building scientific secondary publications.invwsam.htm or invwsam.pdf
Linked Resources

On April 1, 1991, funnily enough, the author and John Solomon installed the initial suite of OCR engines on the Network at the UNLV to evaluate ALL of the best OCR in the World. Besides the OCR Lab here, one of the only other organizations dedicate to test and evaluation of true OCR performance, offers a ton of research papers at for FTP access. ftp.isri.unlv.edu

OCR Lab Results of Adobe Acrobat Capture

All of the test images were processed with Acrobat Capture, and are availabe here in uncorrected PDF output. In addition to demonstrating the performance of Acrobat Capture, the freely distributed Acrobat Reader viewer allows users to easily see what the pages look like.

3col.pdf

A laser printed (HP LJ/4P, 600 dpi) three column document, in a tiny Times Roman font. The one kind of document that some OCR programs actually recognize at 100 % accuracy.

Of course, the output of computers should be recognizable by computers.

awst.pdf

Aviation Week & Space Technology, B-2 color photo. People often say: “Why did they pick this page as an example?” Part of the answer is the perfect example of magazine layout that this page offers, but a big part of the answer involves the way binary images of color magazine pages turn out, and how they make the B-2 virtually indistinguishable in the photo. Stealth via b/w scanning of high quality color image.

biol.pdf

A Scientific Journal article, complete with Citation, Title, Authors, Abstract and other special conventions. The text itself contains a lot of Latin and italic, which tests OCR in a linguistic light.

chrt.pdf

A simple page representative of financial reports and prospectuses. The top of half of the page is simple text, the bottom half of the page is what “should be” a simple chart.

fax2_4.pdf

Three faxes: a quote, a price list, a memo.

nytl.pdf

Okay, here’s a full page span article from the New York Times. How does it look to you? This is a very important question.

pcmg.pdf

An example of very complex page layout. Illustrates the fact that info thrown in your face now is difficult to access on an ongoing basis.

stmt.pdf

A common Statement, a pre-printed form filled in by a high speed printer. It is possible to build systems that read these at close to 100 % accuracy.

NEW: Published Analysis of OCR and Document Recognition

PDF “Portable Document Format” Versions of Published Articles

The Best Site on the Web for all things PDF is EMERGE, where you will find The PDF Zone, the PDF-L mailing list archive (a treasure trove of questions and answers on PDF, the Capture-L mailing list, and all of the powerful Plug-Ins for Acrobat.
If you are interested in learning more about PDF or even trying out the Capture Evaluation program, you should consider browsing to EMERGE .

iwcaerew.pdf

Originally appeared in Imaging World, 6/95

Caere and Calera, Software Offspring

iwxeroxw.pdf

Originally appeared in Imaging World, 7/95

From Character Recognition to Page Recognition

TextBridge from Xerox and TypeReader from Expervision

iwacrobw.pdf

Originally appeared in Imaging World, 10/95

End to End Document Digitization

Adobe Acrobat Capture

All feedback on this page should be directed to Tony McKinley via e-mail to tonymck@imagebiz.com. This page is continuously under construction. More previously published articles on recognition and e-docs will soon be added to this page. This page produced by Intelligent Imaging.

digilib1.pdf

Private, limited access, do not download w/out password.

Alternative site:http://sunsite.unc.edu/elvis/elvishom.html The Elvis Home Page.

Elvis Navigation: Coming Soon! tonymck.gif

There’s a very cool fake Elvis GIF attached here, and as soon as the Web gets faster, and we all get ISDN at home, Elvis will be an Image Map beyond compare.

The greatest recorded song ever: Elvis doing “Unchained Melody” live in Vegas.

The second greatest recorded song ever: Whitney Houston doing the Star Spangled Banner at the Super Bowl.

IMHO
https://web.archive.org/web/19961022180342/http://onix.com:80/tonymck/ocrlab.htm

1st – Online OCR Lab – 1996

From Books to the WebThe On-line OCR Lab

World’s First Review of Adobe Acrobat 3 Acrobat 3, an Electronic Publishing Milestone: Preview of the Cornucopia : HTML FileAcrobat 3, an Electronic Publishing Milestone: Preview of the Cornucopia : PDF file

Web Searchers: Smart HTML vs. Spamdex – AltaVista, Excite, Infoseek, Open Text Reviewed The Truth is out there: Beyond Lycos and Yahoo – the Full Text Retrieval Engines on the Web : HTML FileThe Truth is out there: Beyond Lycos and Yahoo – the Full Text Retrieval Engines on the Web : PDF file

OCR Lab Tests TextBridge Pro 96 vs. OmniPage Pro 7 on both Windows 95 OCR Lab Word Accuracy Comparison, Summer ’96 : HTML FileOCR Lab Word Accuracy Comparison, Summer ’96 : PDF file