Genealogy Search Engine Types & Tips: OCR vs. Indexed Databases

Introduction: Gena Philibert-Ortega is a genealogist and author of the book “From the Family Kitchen.” In this guest blog post, Gena explains the difference between searching for genealogy content in indexed databases, as opposed to genealogy content (such as historical newspapers) that is searched using OCR (Optical Character Recognition).

What’s the biggest benefit of being a family history researcher in 2014? Well at the top of my list is the ability to access countless documents right from my home computer or mobile device. Modern-day genealogy researchers are lucky to have so many options at their fingertips—but just having access to information isn’t enough. One needs to be able to navigate various website search engines to find and sift through results.

The way you search for your ancestor online is going to differ depending on what type of information is hosted by the website. What’s one of the big differences between GenealogyBank’s content and some of the other websites you use to research your family history?

It’s all in the search.

photo of a magnifying glass

Photo: magnifying glass. Credit: Wikipedia.

Indexed Database vs. OCR

Both indexed databases and optical character recognition search engines are essential to your genealogy research, but you do need to know the difference in order to conduct a thorough search.

While the search engine on GenealogyBank looks similar to the search engine you’ll find on other familiar websites, there is one important difference. GenealogyBank’s newspapers, documents and books are searchable via Optical Character Recognition (OCR). In many cases, genealogists are accustomed to content that is indexed.

On websites that house such content as vital records or the census, volunteers or paid staff go through the documents and choose certain keywords and dates to index. Keywords could be words such as a first and last name, a location, an age or an occupation. Once these keywords are indexed and the data is made available online, those fields and keywords become “searchable” meaning that a person can insert those words into the search engine and get results based on those keywords. For example, if I enter the name “Oscar Philibert” in a census search on an indexed website, I would expect to see that name or perhaps versions of that name in my result’s list.

Caution: your ability to find results in indexed content can be hampered by such things as misspellings, name variations, the readability of the document, or an error on the part of the person indexing the document.

Enter Last Name










OCR Makes Newspapers Searchable

Indexing newspapers is too time-consuming a process, so it’s not practical to make the content available to genealogy researchers that way. You’d have to hire a huge team to read every word of every article and index millions of keywords. So instead, GenealogyBank and other similar newspaper websites use Optical Character Recognition or OCR.

What is OCR? It’s an abbreviation for Optical Character Recognition. It’s a search technology that allows a scanned document to be “read” by the computer. Websites that provide digitized books or historical newspapers use this technology to make their content searchable. The computer is programmed to recognize shapes it “sees” as letters. So when you type in a name or a keyword, the system looks for articles that match those shapes you provided.

Caution: are there problems with OCR technology? Of course. The readability of a newspaper can cause the system to have difficulty matching characters. Original older newspapers and microfilmed copies can be prone to tears, ink blots and smudges. Newspapers contain various font types and sizes as well as pages that might be black type on white background or (in the case of an advertisement) white on black background. In some cases, letters can be mistaken for similar letters. These imperfections can cause you to receive false positive results in your search.

Knowing how a website’s content was made searchable can help you try different search strategies to get better results.

A Name Is a Name, or Is It?

When searching on websites that have indexed information, it’s important to mind how you enter a first and last name because you are telling the search a specific command, to find that exact name in the exact way you have entered it. With OCR technology, you are actually telling the search engine to find two keywords (in the case of a first name and surname) that occur within two words of each other. For the OCR technology, it doesn’t know it’s looking for a name; instead, it is looking for words that you have entered—more specifically, characters you have entered. (This is not true for all of the content on GenealogyBank: its SSDI collection and recent obituary archives are indexed collections not reliant on OCR technology.)

Your search strategy should take into consideration what type of data you are searching and what problems may exist. With a search on indexed data, you want to be concerned about data that was incorrectly transcribed. For example, the “Mc” in McDonald might have been indexed as a middle name leaving the “Donald” as the surname.

Making the Most of Your Search in GenealogyBank

Make sure to utilize all aspects of the GenealogyBank search tools. For your initial search, it’s usually best to start with a broad search using the basic search form.

screenshot of the Simple Search search box on GenealogyBank

If your initial search turned up too many results to make it practical to look through them all, then it’s a good strategy to limit your search by a place or time period; do that especially in cases where you know from other research the exact place or time you want. In the case of a letter that could be confused for another, like an “o” for an “e” or an “l” for an “I,” try varying your search terms to take that into consideration—or even use other search terms or additional words.

See the “Advanced Search” link on the basic search form? Clicking on that brings up a new search box with more options.

screenshot of the Advanced Search search box on GenealogyBank

Sometimes You Need to Set Search Limits

Consider limiting your search in some cases. For example: once you conduct a broad newspaper search and have your list of results, you can limit your research to a state or a city. You can even search just on a single newspaper title. If you are looking for a certain “type” of newspaper article like an obituary or advertisement, limit your search to that type of article.

screenshot of the Search Results page in GenealogyBank showing the different types of newspaper articles available

Utilize the advanced search’s features by adding keywords to include and/or exclude. For example: with a surname that is also a noun such as “Race,” you may want to type in keywords for the search engine to exclude such as “car” or “track.” In other cases you might want to include keywords. If your ancestor was a railroad worker and you’re hoping to find mentions of that, include the word “railroad” or their job title. Also consider limiting your search by a date or date range.

Need more hints about using GenealogyBank? Watch these helpful YouTube videos.

It’s all in the search. Knowing what type of data you are looking for and how a search engine works can mean the difference between family history research frustration and success.

ad for gift subscriptions to GenealogyBank

Tips & Tricks to Search Online Newspapers at GenealogyBank

Introduction: Mary Harrell-Sesniak is a genealogist, author and editor with a strong technology background. In this guest blog post, Mary shows some of the search techniques she uses when researching GenealogyBank’s newspapers collection—to help our readers do more efficient searches and save them time with their family history research.

Every American family has a heritage to celebrate—whether it is a connection with a specific event, such as the arrival of the Mayflower in 1620; a military event, such as the Civil War of 1861-1865; a particular country of origin; or person of interest, such as a president, suffragette or abolitionist.

I’m lucky to have proved connections in my family history to many of the above (alas, no president), and like most family researchers have jumped for joy at finding the documented proof.

Once we find the genealogical connections (sometimes with the help of others’ research), we feel enormous satisfaction. However, many genealogists don’t realize that search engines can be tweaked to shorten searches and make family history research more efficient— in particular the genealogy search engine within GenealogyBank.

The trick to more efficient searching is to experiment with specific targeted keywords, related to events or ancestry, along with adding wildcards (more on that below) that accommodate for variations.

Keyword Search: Lineal Descendancy

Let’s start with searches related to specific descendants, using the keywords “lineal descendant,” with or without an added surname.

In this example (long before lineage societies became popular), we read that Mr. Michael Kett, a Quaker, was a lineal descendant from Robert Kett, described as the famous tanner and political reformer in the reign of King Edward the Sixth.

Michael Kett obituary, Providence Gazette newspaper article 27 March 1784

Providence Gazette (Providence, Rhode Island), 27 March 1784, page 2

Doesn’t an ancestral report like that get a genealogist excited!

Most of us are happy to research to an immigrant’s arrival in America, but this gentleman had reportedly traced his ancestry to King Edward VI of England, whose brief life occurred between 1537 and 1553, having been crowned at the young age of nine.

Search Newspapers for Events

Another suggested query is to incorporate the word descendant with a specific event, such as the arrival of the Mayflower in 1620.

Enclosing the search in quotes, “Mayflower descendant,” produces a different result than if you searched on each term without the quotation marks. The difference is that when you simply input terms without quotes, the search engine will find results whenever the two words are located anywhere within the same article—but if you enclose the terms in quotation marks, the terms have to be next to each other in an article in order to show up on the search results page.

Note: generally the “s” is ignored, along with capitalization, so don’t worry about entering “Mayflower descendants” or “mayflower descendant”; either will suffice.

Mayflower Descendants, Daily Inter Ocean newspaper article 14 April 1896

Daily Inter Ocean (Chicago, Illinois), 14 April 1896, page 10

This obituary for Sarah Harlow of 13 March 1823 mentions that she was a descendant from “Mr. Richard Warren, who came in the Mayflower, in 1620, of the 4th generation.” It was found without using quotation marks around the words Mayflower and descendant.

Sarah Harlow obituary, Repertory newspaper article 13 March 1823

Repertory (Boston, Massachusetts), 13 March 1823, page 4

Accommodating Spelling Variations with Wildcards

Try variations of queries that accommodate spelling variations, by using either a question mark (?) or an asterisk (*). Known as wildcards, the first option replaces a single character in a word, and the other takes the place of several characters.

For example, “Harrell” can be spelled in a variety of ways, such as “Harrall” or “Herrell.”

If you want to search for all of these variations at once, substitute vowels with question marks. In addition, many early newspapers sometimes abbreviated “Samuel” as “Saml,” so try entering the given name as “Sam*” or “Sam*l.”

screenshot of GenealogyBank's search box looking for Samuel Harrell

When I search for American Revolutionary War patriots, I often find the war described in various ways. One article might mention the Revolution, and another might describe someone as a Revolutionary War patriot. The solution is to abbreviate the term and add a specific surname.

screenshot of GenealogyBank's search box looking for Gilman

Don’t forget that you can direct the genealogy search engine to ignore certain words using the “Exclude Keywords” box.

If you are looking for one of George Washington’s namesakes, it might be useful to ignore the title President, whether it is abbreviated or spelled in full. And if you are repeating a previous search, you might wish to limit the query to the content added to GenealogyBank since your last search. Simply select the “Added Since” drop-down arrow, and limit by date.

screenshot of GenealogyBank's search box looking for George Washington

These newspaper search techniques usually carry over to your favorite Internet search engines.

Many search engines, such as Google Chrome, have advanced search options. However, if you can’t spot how to do that, you can still succeed. Without complicating things, you can apply what is known as a Boolean operator to a search query.

The three most common Boolean operators are AND, OR and NOT (in capitals).

  1. AND is usually a given in searches, but if you wish to be specific for search engines that ignore certain terms, be sure to add it.
  2. NOT is equivalent to adding a minus sign (-), and indicates that you want a search that does not include something.
  3. OR is an option that tells the search engine to find one thing or another.
  • Harrell OR Herrell OR Harrall
  • “George Washington” NOT President
  • “George Washington” -President
  • George Washington AND Adams

Occasionally you’ll find additional operators, such as the mostly undocumented NEAR in Bing, or AROUND in Google, as well as the ability to search by date ranges.

  • “Susan Smith” 1940…1950 (finds references for this person between two dates)
  • “Egbert Jones” 38…48 (finds a range of numbers connected with this person, such as a specified age)

You’ll need to experiment with the various search engines, and browse their help features. Click here to find a reference on search operators from GoogleGuide’s list: http://www.googleguide.com/advanced_operators_reference.html.

In addition, you’ll find that many popular social network and e-mail programs have additional shortcuts and search options that can be useful for searching.

Please let us know your favorite search techniques in GenealogyBank. Other readers may find them useful!