EPDM’s Content Search with OCR

EmailFacebookGoogle+LinkedInTwitterShare

SOLIDWORKS Enterprise PDM’s Content Search gives you the ability to search for files based on the content of a document, not just its properties or datacard values. It uses IFilters and Microsoft’s file indexing, so it is a great way to find Office or PDF documents knowing nothing more than a few keywords within the document.

Yesterday I noticed, built into Windows, is a TIFF IFilter that uses OCR (optical character recognition) to index documents! This means that when you install the TIFF IFilter, you can search the content of TIFF images.

How good is the OCR in this IFilter? Here are the results of my testing:

  • Text documents: Fantastic and fast. I scanned in five pages from five different books. They were indexed within a few seconds and I was able to find the documents by picking out any word on the page.
  • Handwritten documents: Terrible. On a clean white sheet of paper I wrote words and numbers, printing and script. Not a single thing was indexed. Perhaps Great Aunt Eleanore is correct, maybe I do have bad handwriting?
  • Scanned drawings: I had mixed results. Words out away from the parts [notes or text on leaders] seemed to do okay but text near the part or within the titleblock were not indexed.

The documentation on this IFilter does say it does not do well with documents that have a lot of graphics on them.

I was really hoping for good results of reading the data inside of a titleblock, it would be very helpful when scanning in old paper drawings. I wonder if there is a good commercial TIFF OCR IFilter out there that would do a better job?

Regardless, the text documents did so well, quality documents, order acknowledgements, etc. May give good results if you scanned them into your EPDM vault.