Genealogy Search Engine Types & Tips: OCR vs. Indexed Databases

Introduction: Gena Philibert-Ortega is a genealogist and author of the book “From the Family Kitchen.” In this guest blog post, Gena explains the difference between searching for genealogy content in indexed databases, as opposed to genealogy content (such as historical newspapers) that is searched using OCR (Optical Character Recognition).

What’s the biggest benefit of being a family history researcher in 2014? Well at the top of my list is the ability to access countless documents right from my home computer or mobile device. Modern-day genealogy researchers are lucky to have so many options at their fingertips—but just having access to information isn’t enough. One needs to be able to navigate various website search engines to find and sift through results.

The way you search for your ancestor online is going to differ depending on what type of information is hosted by the website. What’s one of the big differences between GenealogyBank’s content and some of the other websites you use to research your family history?

It’s all in the search.

photo of a magnifying glass

Photo: magnifying glass. Credit: Wikipedia.

Indexed Database vs. OCR

Both indexed databases and optical character recognition search engines are essential to your genealogy research, but you do need to know the difference in order to conduct a thorough search.

While the search engine on GenealogyBank looks similar to the search engine you’ll find on other familiar websites, there is one important difference. GenealogyBank’s newspapers, documents and books are searchable via Optical Character Recognition (OCR). In many cases, genealogists are accustomed to content that is indexed.

On websites that house such content as vital records or the census, volunteers or paid staff go through the documents and choose certain keywords and dates to index. Keywords could be words such as a first and last name, a location, an age or an occupation. Once these keywords are indexed and the data is made available online, those fields and keywords become “searchable” meaning that a person can insert those words into the search engine and get results based on those keywords. For example, if I enter the name “Oscar Philibert” in a census search on an indexed website, I would expect to see that name or perhaps versions of that name in my result’s list.

Caution: your ability to find results in indexed content can be hampered by such things as misspellings, name variations, the readability of the document, or an error on the part of the person indexing the document.

Enter Last Name










OCR Makes Newspapers Searchable

Indexing newspapers is too time-consuming a process, so it’s not practical to make the content available to genealogy researchers that way. You’d have to hire a huge team to read every word of every article and index millions of keywords. So instead, GenealogyBank and other similar newspaper websites use Optical Character Recognition or OCR.

What is OCR? It’s an abbreviation for Optical Character Recognition. It’s a search technology that allows a scanned document to be “read” by the computer. Websites that provide digitized books or historical newspapers use this technology to make their content searchable. The computer is programmed to recognize shapes it “sees” as letters. So when you type in a name or a keyword, the system looks for articles that match those shapes you provided.

Caution: are there problems with OCR technology? Of course. The readability of a newspaper can cause the system to have difficulty matching characters. Original older newspapers and microfilmed copies can be prone to tears, ink blots and smudges. Newspapers contain various font types and sizes as well as pages that might be black type on white background or (in the case of an advertisement) white on black background. In some cases, letters can be mistaken for similar letters. These imperfections can cause you to receive false positive results in your search.

Knowing how a website’s content was made searchable can help you try different search strategies to get better results.

A Name Is a Name, or Is It?

When searching on websites that have indexed information, it’s important to mind how you enter a first and last name because you are telling the search a specific command, to find that exact name in the exact way you have entered it. With OCR technology, you are actually telling the search engine to find two keywords (in the case of a first name and surname) that occur within two words of each other. For the OCR technology, it doesn’t know it’s looking for a name; instead, it is looking for words that you have entered—more specifically, characters you have entered. (This is not true for all of the content on GenealogyBank: its SSDI collection and recent obituary archives are indexed collections not reliant on OCR technology.)

Your search strategy should take into consideration what type of data you are searching and what problems may exist. With a search on indexed data, you want to be concerned about data that was incorrectly transcribed. For example, the “Mc” in McDonald might have been indexed as a middle name leaving the “Donald” as the surname.

Making the Most of Your Search in GenealogyBank

Make sure to utilize all aspects of the GenealogyBank search tools. For your initial search, it’s usually best to start with a broad search using the basic search form.

screenshot of the Simple Search search box on GenealogyBank

If your initial search turned up too many results to make it practical to look through them all, then it’s a good strategy to limit your search by a place or time period; do that especially in cases where you know from other research the exact place or time you want. In the case of a letter that could be confused for another, like an “o” for an “e” or an “l” for an “I,” try varying your search terms to take that into consideration—or even use other search terms or additional words.

See the “Advanced Search” link on the basic search form? Clicking on that brings up a new search box with more options.

screenshot of the Advanced Search search box on GenealogyBank

Sometimes You Need to Set Search Limits

Consider limiting your search in some cases. For example: once you conduct a broad newspaper search and have your list of results, you can limit your research to a state or a city. You can even search just on a single newspaper title. If you are looking for a certain “type” of newspaper article like an obituary or advertisement, limit your search to that type of article.

screenshot of the Search Results page in GenealogyBank showing the different types of newspaper articles available

Utilize the advanced search’s features by adding keywords to include and/or exclude. For example: with a surname that is also a noun such as “Race,” you may want to type in keywords for the search engine to exclude such as “car” or “track.” In other cases you might want to include keywords. If your ancestor was a railroad worker and you’re hoping to find mentions of that, include the word “railroad” or their job title. Also consider limiting your search by a date or date range.

Need more hints about using GenealogyBank? Watch these helpful YouTube videos.

It’s all in the search. Knowing what type of data you are looking for and how a search engine works can mean the difference between family history research frustration and success.

ad for gift subscriptions to GenealogyBank

Two more newspapers added to GenealogyBank

GenealogyBank adds more newspapers.

Gainesville, Florida
Gainesville Sun, The (Gainesville, FL)
Obituaries: 03/19/1995 – Current
Death Notices: 02/18/1995 – 08/31/2008

Newport, Vermont
Newport Daily Express, The (Newport, VT)
Obituaries: 07/24/2008 – Current
Death Notices: 07/25/2008 – Current

I can’t find my ancestor – what am I doing wrong?

For most searches on GenealogyBank it is easy to find your ancestor. You type in their name and in an instant you spot them in the search results list.

So - what do you do when your ancestor’s name doesn’t come right up in the search hits?
Just like any other genealogical resource you need to step back and see what your options are and try various ways to search on the site.

Consider your search strategy.
1. Sometimes less is more.
Be careful how you type in your ancestor’s name.
His full name might have been: Willard Jacob Teskey …. but the newspaper article may have simply called him:

Willard Teskey
Willard J. Teskey
W.J. Teskey
Bill Teskey
or only: Teskey

Try typing in variations of the person’s name.
I have found that typing in only the surname can quickly get you the best results.

Tip: You almost never want to type in a person’s “middle” name. Newspapers rarely use a person’s full name.

Be Careful How You Limit Your Search
It is tempting to limit your search to only one state or even to one newspaper. That can often be the most appropriate search strategy. However, if your searches did not locate the obituary or article about your ancestor – try your search again and this time do not limit your search geographically.

If that produces too many hits – then repeat your search and limit it by the likely starting and ending years when your ancestor. Be sure add a few years in both directions so you will bring up the most possible hits.

Tip: Newspapers often published brief biographies and articles years after a person died. So be careful how you limit your search or you might miss the articles you are looking for.

GenealogyBank brings together newspapers, books, reports and documents from over 300 years. During that time printers had access to varying qualities of newsprint; pieces of type and printing presses.

1. Newspapers have been printed on newsprint paper of varying quality. Some are smooth and some pages are rough.

2. Printers had only so many pieces of type and the newspaper had a deadline. It would be easy when they set the type for the day’s newspaper to swap in an “m” for a “w” or switch a “d” and a “p” or a “1″ and a “l”. The reader in 1843 would hardly notice the difference. But a modern computer might struggle to interpret each word if the piece of type was a different letter or had been damaged.

Let me give you a similar example that has circulated on the Internet for years:

Cna yuo raed tihs?
i cdnuolt blveiee taht I cluod aulaclty uesdnatnrd waht I was rdanieg. The phaonmneal pweor of the hmuan mnid, aoccdrnig to a rscheearch at Cmabrigde Uinervtisy, it dseno’t mtaetr in waht oerdr the ltteres in a wrod are, the olny iproamtnt tihng is taht the frsit and lsat ltteer be in the rghit pclae. The rset can be a taotl mses and you can sitll raed it whotui t a pboerlm. Tihs is bcuseae the huamn mnid deos not raed ervey lteter by istlef, but the wrod as a wlohe. Azanmig huh? yaeh and I awlyas tghuhot slpeling was ipmorantt!

This is an extreme example that shows the problems that computers have reading the old newspapers and documents. Individuals reading an old newspaper quickly adjust to the look, feel of the newspaper and learn how to read it. GenealogyBank has been working on these issues for years and improved and enhanced our OCR capability.

GenealogyBank uses state of the art OCR software and we have teams of indexers that review and tag each item – focusing on names, obituaries, births, marriages and other data of high importance to genealogists.

3. Still can’t find your ancestor? Then, its time to dig in and search the target newspapers, page by page. GenealogyBank makes it easy to bookmark a specific newspaper, combination of newspapers or locations. You could then go through the newspapers – month by month – clicking from page to page to quicly see if your ancestors were mentioned.

.

GenealogyBank.com: core military history books

Cadets graduating at West Point

All month GenealogyBank.com has been highlighting its extensive military resources.

GenealogyBank has the core military reference books that you will rely on in documenting your ancestors with military service.

For example GenealogyBank has the two volume set:

Heitman, Francis B. Historical Register and Dictionary of the United States Army, from its Organization, September 29, 1789, to March 2, 1903. Washington, DC: Government Printing Office, 1903. 2 volumes. (Serial Set Vol. No. 4535, Session Vol. No.96; Report: H.Doc. 446 pt. 1 & 2).

This handbook has the military record of all Army officers from 1789 to 1903 and the details on all battles fought by the Army during that same period.

Another standard reference book for documenting US Army officers is:

The Centennial of the United States Military Academy at West Point, New York. Washington, DC: Government Printing Office, 1904. 2 volumes. (Serial Set Vol. No. 4751, Session Vol. No.125; Report: H.Doc. 789 pt. 1 & 2).

GenealogyBank.com is packed with military information.

Books, newspapers and historical documents.

GenealogyBank has the resources genealogists actually use and rely on to document their family tree.
.

US Navy Register Online – A Genealogist Writes

Yesterday the GenealogyBank Blog wrote about the US Navy Register going online. It has been very popular. Today I received this note from a genealogist about what she found:

Tom:
I just spent a couple of hours pulling up and printing out [pages from the US Navy Register] from 1922 to 1947 for my father-in-law. What a treat, my husband will be thrilled.

Only missing one year, 1945, but that may be an OCR problem. I’ll work on it later.

Then, just for chuckles, I pulled up my husband’s first ten years — but the server’s timing out on me. Hmm. Guess this is really popular right now!

Thanks Tom!
Pat

In Tucson
.