Site icon GenealogyBank Blog

Genealogy Search Engine Types & Tips: OCR vs. Indexed Databases

Introduction: Gena Philibert-Ortega is a genealogist and author of the book “From the Family Kitchen.” In this guest blog post, Gena explains the difference between searching for genealogy content in indexed databases, as opposed to genealogy content (such as historical newspapers) that is searched using OCR (Optical Character Recognition).

What’s the biggest benefit of being a family history researcher in 2014? Well at the top of my list is the ability to access countless documents right from my home computer or mobile device. Modern-day genealogy researchers are lucky to have so many options at their fingertips—but just having access to information isn’t enough. One needs to be able to navigate various website search engines to find and sift through results.

The way you search for your ancestor online is going to differ depending on what type of information is hosted by the website. What’s one of the big differences between GenealogyBank’s content and some of the other websites you use to research your family history?

It’s all in the search.

Photo: magnifying glass. Credit: Wikipedia.

Indexed Database vs. OCR

Both indexed databases and optical character recognition search engines are essential to your genealogy research, but you do need to know the difference in order to conduct a thorough search.

While the search engine on GenealogyBank looks similar to the search engine you’ll find on other familiar websites, there is one important difference. GenealogyBank’s newspapers, documents and books are searchable via Optical Character Recognition (OCR). In many cases, genealogists are accustomed to content that is indexed.

On websites that house such content as vital records or the census, volunteers or paid staff go through the documents and choose certain keywords and dates to index. Keywords could be words such as a first and last name, a location, an age or an occupation. Once these keywords are indexed and the data is made available online, those fields and keywords become “searchable” meaning that a person can insert those words into the search engine and get results based on those keywords. For example, if I enter the name “Oscar Philibert” in a census search on an indexed website, I would expect to see that name or perhaps versions of that name in my result’s list.

Caution: your ability to find results in indexed content can be hampered by such things as misspellings, name variations, the readability of the document, or an error on the part of the person indexing the document.

Enter Last Name










OCR Makes Newspapers Searchable

Indexing newspapers is too time-consuming a process, so it’s not practical to make the content available to genealogy researchers that way. You’d have to hire a huge team to read every word of every article and index millions of keywords. So instead, GenealogyBank and other similar newspaper websites use Optical Character Recognition or OCR.

What is OCR? It’s an abbreviation for Optical Character Recognition. It’s a search technology that allows a scanned document to be “read” by the computer. Websites that provide digitized books or historical newspapers use this technology to make their content searchable. The computer is programmed to recognize shapes it “sees” as letters. So when you type in a name or a keyword, the system looks for articles that match those shapes you provided.

Caution: are there problems with OCR technology? Of course. The readability of a newspaper can cause the system to have difficulty matching characters. Original older newspapers and microfilmed copies can be prone to tears, ink blots and smudges. Newspapers contain various font types and sizes as well as pages that might be black type on white background or (in the case of an advertisement) white on black background. In some cases, letters can be mistaken for similar letters. These imperfections can cause you to receive false positive results in your search.

Knowing how a website’s content was made searchable can help you try different search strategies to get better results.

A Name Is a Name, or Is It?

When searching on websites that have indexed information, it’s important to mind how you enter a first and last name because you are telling the search a specific command, to find that exact name in the exact way you have entered it. With OCR technology, you are actually telling the search engine to find two keywords (in the case of a first name and surname) that occur within two words of each other. For the OCR technology, it doesn’t know it’s looking for a name; instead, it is looking for words that you have entered—more specifically, characters you have entered. (This is not true for all of the content on GenealogyBank: its SSDI collection and recent obituary archives are indexed collections not reliant on OCR technology.)

Your search strategy should take into consideration what type of data you are searching and what problems may exist. With a search on indexed data, you want to be concerned about data that was incorrectly transcribed. For example, the “Mc” in McDonald might have been indexed as a middle name leaving the “Donald” as the surname.

Making the Most of Your Search in GenealogyBank

Make sure to utilize all aspects of the GenealogyBank search tools. For your initial search, it’s usually best to start with a broad search using the basic search form.

If your initial search turned up too many results to make it practical to look through them all, then it’s a good strategy to limit your search by a place or time period; do that especially in cases where you know from other research the exact place or time you want. In the case of a letter that could be confused for another, like an “o” for an “e” or an “l” for an “I,” try varying your search terms to take that into consideration—or even use other search terms or additional words.

See the “Advanced Search” link on the basic search form? Clicking on that brings up a new search box with more options.

Sometimes You Need to Set Search Limits

Consider limiting your search in some cases. For example: once you conduct a broad newspaper search and have your list of results, you can limit your research to a state or a city. You can even search just on a single newspaper title. If you are looking for a certain “type” of newspaper article like an obituary or advertisement, limit your search to that type of article.

Utilize the advanced search’s features by adding keywords to include and/or exclude. For example: with a surname that is also a noun such as “Race,” you may want to type in keywords for the search engine to exclude such as “car” or “track.” In other cases you might want to include keywords. If your ancestor was a railroad worker and you’re hoping to find mentions of that, include the word “railroad” or their job title. Also consider limiting your search by a date or date range.

Need more hints about using GenealogyBank? Watch these helpful YouTube videos.

It’s all in the search. Knowing what type of data you are looking for and how a search engine works can mean the difference between family history research frustration and success.

[bottom_post_ad]

Exit mobile version