Converting Documents To Web Pages for SEO Using OCR: Optical Character Recognition

Introduction

Websites with a lot of products tend to be filled with technical information. This type of information is typically stored inside a document, like a PDF or an image. The reason why this information is in a PDF is because you can share it even when the Internet is down, unlike a web page. Also, the format in a PDF will not change unlike a web page. A web page has to be created for both desktop and mobile. So these are a couple of advantages of using a document to store information. As I discuss later, there are some problems with documents in regards to SEO and that is the search engines ability to read and index them. We can resolve some of these issues with optical character recognition.

What is Optical Character Recognition (OCR)?

OCR, or optical character recognition is a way to scan through documents and extract the text from them. Sometimes you can copy and paste the text in a PDF but depending on the formatting, it might be hard to just copy and paste and get a clean text string. Depending on how long and how many pages the document is, copy and pasting might not be an option. So this is why we should consider using OCR. OCR is capable of scanning a document, recognizing the words and letters inside that document and giving you something that you can use on a web page.

Why is OCR Important for SEO?

The reason why OCR is important for SEO is because you can get a lot of content out of your documents. This means that you will not have to rewrite all of your content. Instead, you can just convert your documents into something usable. This will not only save you time but you also will also extract content that is “trapped” inside a document. This makes repurposing content easier and you have it in a format that can be transferred to a web page, image, audio podcast or even a video. Recreating all of these different forms of content is easier if you have your content in just text. Since search engines need content to read, converting your document is a good idea.

Can Search Engines Read Text Inside a Document?

Here is the thing about documents with embedded text, search engines might be able to read that text. This is the same issue with JS, it all has to do with content that appears in the HTML on page load. With JS, you can load a lot of content after page load, but it requires activating events like click or mouse over. Text embedded into a document does not appear in the HTML. This poses a problem because even if search engines can read embedded text, that would be extra work that you are making the search engines do. We want to limit the extra strain that search engines do when they go through the content on a website.

Documents Hold a Wealth of Information

Documents tend to hold a lot of valuable technical information because it tends to be extra details that normally would not be a web page. Think about how long some of these documents are. These documents can be several pages long, not just one or two. With a document of this many pages, there is a lot of good content in there. It makes keeping all of this content in a document logical because it might be too much for a web page. If you were to extract all of this content and place it on a web page, your web page will be too long. As a recurring concept on this website, long scrolling pages are generally frowned upon because they are too long and it takes forever for users to find the information that they need. This becomes even more apparent in mobile devices where we typically browse web pages in portrait mode, which is vertical over horizontal.

What Types of Documents Should You Convert to Web Pages?

Here a list of documents that are best ones to convert to web pages:

  • Images
  • PDFs
  • JSON
  • Spreadsheets
  • Word
  • XML

I will go through each one and explain why these are prime candidates to convert over to web pages. Some of these document types do not need OCR at all.

Images: Turn Your Infographics Into Web Pages

It should come to no surprise that images make great web pages because they have text embedded on them. The most common types of images are infographics. Infographics are basically just an image that shows data and information in a very nice and readable format. These types of images are very pretty and focus on color and design to illustrate their point. If you do not have the original raw file like a PSD for Photoshop for example, then try using OCR to extract the content. With images, you cannot select and copy the text. That text is stuck inside the image so you either use OCR or you have to retype the content out. Since we are trying to save time and extract as much content out from an image, OCR is the way to go. If your image is clean with a nice high resolution, then the quality of the image should be good enough for OCR. If your image resolution is too low and the text is very blurry and hard to read, then OCR might not be able to extract everything from it.

PDFs: Similar to Images But You Can Copy and Paste

PDFs are very popular because they keep their formatting, which makes it easier to use OCR. Unlike images, you can copy and paste text in PDFs into another location. Just remember to remove the formatting like any fonts or bolding that might carry over to your destination. The reason why is because you just want the text and not any formatting because that formatting will most likely not match the web page that you are transferring that text to. 

If you want to copy and paste text from a PDF over, then paste the text without the formatting using ctrl+shift+v. This can be tricky using three fingers instead of just a standard copy with ctrl+v. I actually do not like using three fingers and that is a little too much for me. Instead, I like to just copy the text and paste into something that removes the formatting. The address bar in a web browser works just fine. Once you remove the formatting, then just copy the text with the removed formatting and then paste it into your destination.

JSON

This is a data format that is popular when retrieving data from an API (Application Programming Interface). The reason for its popularity is because it is much smaller than XML. JSON only uses one label to designate a data field. XML uses two labels and they are called tags because there is an open and closed tag for every label in XML. JSON’s smaller size makes it better when transferring data via APIs. JSON documents do not necessarily need OCR but I mentioned it because JSON is still a good source of data for creating web pages.

Spreadsheets: Convert Them Into HTML Tables

You might have data contained in Microsoft Excel or Google Spreadsheets. Spreadsheets are great for web pages because the content in spreadsheets are organized into tables. So that means every single row in the table will have data for its columns. Columns are very important for organizing your data and ensuring that it is complete. So what you can do is convert the spreadsheet into an HTML table. Then you can just paste the table into your web page. You can try using these different methods in accomplishing this. Depending on what you used to create your spreadsheet, you will need to look for something that will work for you. Since there are several methods online to do this, the best one for you might not work for something else. My main goal here is just to give you an idea and then you can take it and look for the perfect solution for you elsewhere.

Word: These Are Basically Articles

If you have any documents in either Microsoft Word or Google Documents, then you should create web pages from them. These types of documents are some of the easiest ones that do not need OCR because they are already formatted for web pages. You literally should be able to copy and paste everything from your Word document into your web page. Depending if your document matches the formatting of your web page, you might need to modify the formatting at the destination. Unlike images and PDFs, Word documents will give you very little hassle in creating new content. Even though this article is about OCR, I also want to talk about different types of documents and how to create new web page content from them.

XML: Similar to JSON

XML documents are pretty much just JSON but their data format is slightly different. JSON only uses one label to define fields. XML uses a tag pair to define fields with an open and close tag. These make good web pages too because their data usually come from an API just like JSON. You will not need OCR to create content pages from XML but I did want to include it on the list.

Converting Documents into Web Pages: Repurpose Content

Now that I have discussed some of the most common document types, I want to talk about repurposing content. If you need some ideas on how to take the content that you have extracted from these documents and generate new content from them, then I wrote an article about this.

Repurpose Documents and Link to the Document

When you repurpose a document, you are not taking all of the content extracted from the document and creating a page from it. You are simply taking the best content that is not only keyword-rich but useful for your visitors too. What you should do is link to the original document on the web page. This way people can get the full details on the document. Your web page is mostly for those who want to read a summary or just enough details that might lead to a conversion. You should let your users decide if they just want a description, or the whole thing.

Extract the Most Important Details from Your Document

Highlight the best details from your document when you create your web page from it. Think about what the search engines are looking for when you decide what content to use. If this is a web page that you want people to find from the search engines, then you will need to get enough good content on it to attract people and the search engines. Just having a web page that tells users to click on a download link is a missed opportunity. As long as you clearly state that the full details are in the document, then people will understand that they can learn more if they choose to. This is why content on a web page is so important but if you do it right, then you are only helping yourself be found online.

Summary

Documents are a great way to get more content for your website. Especially if you have tons of documents already available and written prior to the launch of your website. This means that you do not have to recreate all of your content. Depending on what kinds of documents you have, you might need to OCR to extract your content from them. This is definitely the case for images because images have text embedded on them. PDFs might need OCR if you cannot select and copy the text from them. Other document types including spreadsheets, word, JSON and XML do not need XML and I explain why in their respective sections in this article.

The more content that you can actually get onto your page, the better. The reason why is because search engines prefer to have an easier time reading the content on your web page. That is better done when the content is directly contained within the HTML of the page and not embedded and stuck inside a document. Some say that search can read text embedded in documents but you should make that task as easy as possible for Google. Do not make Google go out of their way to read your website. This really could mean the difference between getting added to Google or not.

When repurposing your documents to create web pages, make sure to use your best content and just link to your original document. You should describe what your document is and what it offers to people. If people want to know more, then they can click on your document for full details.

Leave a Comment