Many note-taking apps nowadays offer to take a picture of a document and turn it into text. I was curious how it works and decided to dig a little deeper to see what exactly was going on there.

Having done a little research I came across Optical Character Recognition - a field of research in pattern recognition and AI revolving around precisely what we are interested in, that is reading text from an image. There is a very promising JavaScript library implementing OCR called tesseract.js, which not only works in Node but also in a browser - no server needed!

What I would like to focus on here is to work out how to add tesseract.js to an application and then check how well it does its job by creating a function to mark all the matched words in an image.

Here’s a link to the repository.

Tesseract.js

To add tesseract to a project we can simply type in the terminal:

npm install tesseract.js

After importing it to our codebase everything should work as expected. At least according to the package’s docs. In reality, though, I kept getting an error about missing worker.js file, and since the docs and very thorough googling wasn’t of much help I used a workaround. I copied a file called worker.min.js from node_modules/tesseract.js, and pasted it to my public folder from which I serve my static files. After that I changed the path to the worker inside tesseract like so:

tesseract.workerOptions.workerPath = 'http://localhost:8080/worker.min.js';

and everything worked correctly.

Application

Let’s create a simple application to recognize text in an image. We would like it to render the image twice: once to show the user their original image of choice and once to highlight the words that were matched. Finally, we would also like for our app to display for the user at all times the progress it has made thus far.

HTML Markup

<label for="recognition-image-input">Choose image</label>
<input type="file" accept="image/jpeg, image/png" id="recognition-image-input" /><br />

<label for="recognition-confidence-input">Confidence</label>
<input type="number" max="100" min="0" id="recognition-confidence-input" value="70" /><br />

<label for="recognition-progress">File recognition progress:</label>
<progress id="recognition-progress" max="100" value="0">0%</progress>

<div id="recognition-text"></div>

<div id="recognition-images">
  <div id="original-image"></div>
  <div id="labeled-image"></div>
</div>

<input type="file"> lets the user choose an image and <input type="number"> - the desired confidence, which indicates how certain of the result would the user like the app to be. Matches which do not meet the confidence requirement won’t show up in the result. <progress> informs the user how far along the recognition is, <div id="recognition-text"> shows the recognized text and <div id="recognition-images"> works as a placeholder for the images.

By listening on the change event of the <input type="file" /> we can get the user’s image of choice and render the results.

Before that, however, let’s save the references to the HTML elements in variables for the future code snippets to be more readable:

const recognitionImageInputElement = document.querySelector(
 '#recognition-image-input',
);
const recognitionConfidenceInputElement = document.querySelector(
 '#recognition-confidence-input',
);
const recognitionProgressElement = document.querySelector('#recognition-progress');
const recognitionTextElement = document.querySelector('#recognition-text');

const originalImageElement = document.querySelector('#original-image');
const labeledImageElement = document.querySelector('#labeled-image');

Listening on the change event

When the user selects an image on their computer the change event is fired.

The <input type="file"> element has a property called files which holds all the files the user has selected. We are not accepting multiple files, however, so there will always be just one file at the 0th index.

recognitionImageElement.addEventListener('change', () => {
 if (!recognitionImageElement.files) {
   return null;
 }

 const file = recognitionImageElement.files[0];
})

How to recognize an image

Tesseract has a method called recognize which accepts two arguments - an imageLike and options. An imageLike can be many things. In our case we are going to use a File object that will be available to us once a user chooses an image. options are only used to set the language of the image or (in some advanced cases) to change the defaults of tesseract. We won’t, however, be interested in that here.

Every text recognized by tesseract has a confidence value (from 0 to 100) that tells us how sure tesseract is of the result.

A note about confidence

Confidence can be tricky because of two things.

First, paragraphs have their own confidence, as do words and symbols. The confidence of a line is equal to the lowest amongst confidences of its constituent words. By the same principle, the confidence of a word is equal to the confidence of a symbol tesseract least confident about.

This means that just because confidence of a line is low doesn’t necessarily mean that the whole line was misrecognized - it could be just one word causing trouble.

Secondly, confidence indicates how much an object resembles a certain character.

If the image is, for instance, somebody’s face then the iris of their eye might be mistaken for the letter ‘O’ with a fairly high confidence. This often means that filtering out everything below a given confidence level won’t leave us with nothing but good matches.

Recognizing an image

Now that we have a file let’s extract text from it by calling the .recognize() method. Also, by adding a handler to the .progress() method we can update the <progress> element.

return tesseract
  .recognize(file, {
    lang: 'eng',
  })
  .progress(({ progress, status }) => {
    if (!progress || !status || status !== 'recognizing text') {
      return null;
    }

  const p = (progress * 100).toFixed(2);

  recognitionProgressElement.textContent = `${status}: ${p}%`;
  recognitionProgressElement.value = p;
})

Inside the .progress() handler we are given the following information: progress (which is a number ranging from 0 to 1) tells us how far along the processing is, and status which is simply a message telling us what’s going on.

We multiply progress by a hundred, so that as a result in status we see 50 instead of 0.50.

Dealing with the result

The result of the .recognition() method is confusing to say the least. It is not well documented and so we have to deduce some things on our own.

{
    blocks: Array[1]
    confidence: 87
    html: "<div class='ocr_page' id='page_1' ..."
    lines: Array[3]
    oem: "DEFAULT"
    paragraphs: Array[1]
    psm: "SINGLE_BLOCK"
    symbols: Array[33]
    text: "Hello World↵from beyond↵the Cosmic Void↵↵"
    version: "3.04.00"
    words: Array[7]
}

html is the extracted text embedded into HTML tags. text is the extracted text,

paragraphs, words and symbols (which are paragraphs, words and characters in the text respectively) are arrays of objects that look something like this:

Tesseract's result
Tesseract's result

We are going to use the paragraphs property to show the extracted text to the user inside the <p> elements, and the words property to create black-bordered boxes and place them on the second picture to show the user exactly what the positions were of the matched words.

Showing extracted text to the user

We want to render the paragraphs to the user and the best way to do so is to create a <p> element for each paragraph. Each paragraph has a text property that can be set as the <p> element’s textContent.

Inside the previously created <div id="#recognition-text"> element we can render the paragraphs with the .append() method:

const paragraphsElements = res.paragraphs.map(({ text }) => {
  const p = document.createElement('p');

  p.textContent = text;

  return p;
});

recognitionTextElement.append(...paragraphsElements);

Rendering images

To render the images we have to create them first because so far we only have the <div> elements that work as containers.

const originalImage = document.createElement('img');

const labeledImage = originalImage.cloneNode(true);

There is a little problem, however, with setting their src property as we don’t have the URL that points to the image - instead we have a File object.

To render a File object inside the <img> tag we have to use the FileReader constructor like so:

const setImageSrc = (image: HTMLImageElement, imageFile: File) => {
 return new Promise((resolve, reject) => {
   const fr = new FileReader();

   fr.onload = function() {
     if (typeof fr.result !== 'string') {
       return reject(null);
     }

     image.src = fr.result;

     return resolve();
   };

   fr.onerror = reject;

   fr.readAsDataURL(imageFile);
 });
};

We pass the File object to the .readAsDataURL() method and then wait for the handler passed to the .onload() method to fire with the result. The result can now be set as the src of the image.

The code will look like this:

const originalImage = document.createElement('img');

await setImageSrc(originalImage, file);

const labeledImage = originalImage.cloneNode(true);

Marking the matched words

To show the box on every matched word we have to first filter out every word whose confidence is below the value previously set (inside the <input id="recognition-confidence-input"> element).

const wordsElements = res.words
  .filter(({ confidence }) => {
    return confidence > parseInt(recognitionConfidenceInputElement.value, 10);
})

Then thanks to a bbox property that is available on each word object we know the coordinates of every matched word. The coordinates are x0, x1, y0and y1, where:

x0 - start of the word on the horizontal axis, it becomes the left CSS property

y0 - start of the word on the vertical axis, it becomes the top CSS property

x1 - end of the word on the horizontal axis (by subtracting x1 - x0 we get the width property)

y1 - end of the word on the vertical axis (by subtracting y1 - y0 we get the height property)

const wordsElements = res.words
  .filter(({ confidence }) => {
    return confidence > parseInt(recognitionConfidenceInputElement.value, 10);
  })
  .map((word) => {
    const div = document.createElement('div');
    const { x0, x1, y0, y1 } = word.bbox;

    div.classList.add('word-element');

    Object.assign(div.style, {
      top: `${y0}px`,
      left: `${x0}px`,
      width: `${x1 - x0}px`,
      height: `${y1 - y0}px`,
      border: '1px solid black',
      position: 'absolute',
    });

    return div;
});

The last thing to do is to append both the images and the words to their respective parents which are <div class="original-image"> for the original image and <div class="labeled-image"> for image with the marked matches.

originalImageElement.appendChild(originalImage);
labeledImageElement.appendChild(labeledImage);
labeledImageElement.append(...wordsElements);

To get the boxes with position: absolute; to be displayed on the image let’s add the required CSS:

#labeled-image {
 position: relative;
}

With this out of the way, let’s see the app in action!

Testing it out

I have taken a screenshot of my recent post to see how well it handles a well formatted text on a single-color background.

Original image:

Screenshot of the post
Screenshot of the post

Labeled image:

Labeled screenshot of the post
Labeled screenshot of the post

Here is the extracted text:

Recently on Facebook David Smooke (the CEO of Hackernoon) posted an article in which he listed 2018’s Top Tech Stories. He also mentioned that if someone wished to make a similar list about say JavaScript he would be happy to feature it on the frontpage of Hackernoon.

In a constant struggle to get more people to read my work I could not miss this opportunity, sol immediately started to plan how to approach making such a list.


Extracting text out of documents seems to be working great. Let’s see how the app fares against something harder like a meme:

Original image:

Meme with Samwise Gamgee
Meme with Samwise Gamgee

Labeled image:

Labeled meme with Samwise Gamgee
Labeled meme with Samwise Gamgee

Extracted text:

When you escape North Korea but get captured by the Chinese police. > A, ‘\ f”; . \ Well, I’m back. 7

Summary

The tesseract.js library provides us with a ready-to-use OCR implementation that is efficient and, for the most part, accurate. Additional advantage of the library is its immense flexibility thanks to being compatible with both Node.js and a browser. There is even an option to include custom training data which could make it work better for your specific applications.

Also available on Medium.