News > Demystifying DoxCycle's Optical Character Recognition (OCR)

DoxCycle uses Optical Character Recognition (OCR) to "read" the amounts on source documents you scan or import. We get lots of questions about what OCR is and does—and why DoxCycle's OCR is so special. We hope this blog post will help demystify the process for you.

What is OCR?

Optical Character Recognition (OCR) is the translation of an electronic image into machine-readable text. If you scan in a document and don't run it through a program that uses OCR, you can't select text to copy and paste.

Another way to think of it is if you try to read a document that is in a different language: it may be pretty to look at, but you can't actually read it. You'll need a translator to make it useful to you. For DoxCycle, OCR is that translator.

Why is OCR so difficult?

OCR is actually remarkably sophisticated technology. It's often hard for us to get our heads around how cool it is because we read text in an instant, at a glance. For instance, we don't usually have to think about the letters when we read a word—it just makes sense.

Software that uses OCR doesn't read text like we do. It has to interpret the document and figure out what it can read. It matches the patterns of light and dark on the page with the font (serif or sans-serif), case (upper or lower), punctuation and numbers to extract what we call "text."

Think back to when you were learning to read and you had to make sense of the scribbles on the page. OCR does this every time it runs. Pretty, cool, no?

What makes DoxCycle's OCR even more amazing?

DoxCycle goes beyond ordinary OCR software. DoxCycle's OCR reads and understands. This is the difference between a human pronouncing a word and understanding what it means. Once you understand it, you can use it for other things.

DoxCycle categories
DoxCycle sorts and groups
documents by taxpayer and type.

We've taught DoxCycle to:

  • Determine the type of document. It does this by looking for a key words and standard layouts on the page. DoxCycle recognizes over 60 types of documents. It then organizes them into categories. 

  • Recognize the taxpayer's name. Ordinary OCR doesn't know the difference between a street name and a taxpayer's name. We've taught DoxCycle to look for the taxpayer's name in every document you scan in, so it knows in an instant to whom the document belongs and categorizes it accordingly.

  • Distinguish between box numbers and amounts on T4 slips. On slips, the text used to print box numbers is usually very small and close to the borders, making it difficult to recognize. So, we've built logic into DoxCycle to "guess" the box number based on what we expect to see in layout and words on the slip. 

On the surface, the above things seem simple—obvious even—but when you take into account the variation in slips, forms and documents that can occur (such as color, shading, margins, font, wrinkles, stains etc.), adapting to these nuances is pretty impressive. Whew!

Data Extraction vs. Data Entry

To prove the process of taking document from scanning to extracting the data for use in preparing a return, we've used the T4 slip. Scan it in and DoxCycle recognizes the box numbers and the associated amounts. Once you review it, you can post the results straight into ProFile® T1.

So, what about other slips and forms? This is where we need your help and feedback...

We know that OCR can never be 100% percent accurate. We've come a long way, but perfect recognition of every amount every time is unrealistic. Before we add other slips and forms, we need to know whether what we've done with the T4 slip data extraction is good enough to make your job easier.

Even if the extraction isn't perfect, is it still faster for you to correct the few amounts that are incorrectly interpreted than to manually type all the slip data into a return? Please let us know what you think.


