Structured data on genealogy websites

Originally posted on Medium

In my previous post, I listed several problems with publishing and using genealogy data on the Web. In short, the issues are:

it should be easier to copy data from Web resources to a researcher’s database
data on the Web may be ambiguous
good links to sources should be provided where possible
software could exist to help with the above

Before we can try to figure out what solutions are viable, let’s see what the current state of genealogy websites is. What I wanted to do is to take some websites that contain information interesting for genealogists and assess them in terms of how easy it is to collect data from them in an automatic and unambiguous way and also be able to link back to the site as the source of information.

Tim Berners Lee proposed a way of assessing Web data sources in terms of machine-readable access. It is called 5-star Linked Data and the criteria are described as follows:

★ Available on the web (whatever format) but with an open licence, to be Open Data

★★ Available as machine-readable structured data (e.g. excel instead of image scan of a table)

★★★ as (2) plus non-proprietary format (e.g. CSV instead of excel)

★★★★ All the above plus, Use open standards from W3C (RDF and SPARQL) to identify things, so that people can point at your stuff

★★★★★ All the above, plus: Link your data to other people’s data to provide context

I started assigning these stars to genealogy websites but it soon turned out that I’ll be making my own rating system.

Firstly, I definitely don’t expect all sites to expose data as Open Data. There are paid genealogy sites and private sites and this is fine. It is clear that some data actually should be made private, especially when talking about living individuals. However, a paid site or a private one can still be machine-readable and citeable. For example, FamilySearch exposes its data in GedcomX format and is a paid site. Also, WikiTree exposes structured data even for living individuals for those that have appropriate access.

The “excel instead of image scan of a table” star is also not very relevant to genealogy data. Actually, some sources are exactly that — scans and photos of original materials, and this is what we expect them to be. We don’t want them being replaced by transcribed data, we want them extended by transcribed data. Instead, I would argue that we should distinguish structured data from data that needs to be parsed out from HTML contents.

Having a non-proprietary data format is also not a big issue, especially that I haven’t seen genealogy data published in Excel files.

The fourth and fifth stars are actually relevant. Using standard data formats is a big plus. Linking to other data even more so.

Let me phrase my own “star” system:

★☆☆☆☆ Data is referenceable by a unique URL

It should be possible to reference each item of data individually. In particular, a single record, a single person, a single grave, a single family should be referenceable. In contrast, some sites only provide ways to reference lists or groups of items.

☆★☆☆☆ Data is machine-readable

It should be possible to create simple rules that extract data from the site. This means that fields are separated (e.g. first and last name) and are consistently named. In contrast, most sites require HTML scraping with complex rules for dealing with names and dates.

☆☆★☆☆ Data is in a well-known format

Instead of having each website publish data in its own format, it is much better to have only a few formats that everyone uses. It seems that the most popular now is using Microdata annotations with schema.org semantics but GedcomX and DBpedia’s ontologies are also a good choice.

☆☆☆★☆ Data contains links to other entities

For example, an entry for a person links to entries for the person’s parents.

☆☆☆☆★ Data contains links to entities on other websites

Not only links inside the same site but also links to other websites. Example — DBpedia linking to WikiTree

Here are some websites I looked at:

Structured data on genealogy websites

★★★★★ DBpedia

★★★★☆ Billion Graves

★★★★☆ FamilySearch

★★★★☆ Find a Grave

★★★★☆ Geni

★★☆★★ WeRelate

★★★★☆ WikiTree

★★★☆☆ Graves in Belarus

★★☆☆☆ MyHeritage record

★☆☆★☆ webtrees

★☆☆☆☆ FreeBMD

★☆☆☆☆ FreeCen

★☆☆☆☆ FreeReg

★☆☆☆☆ sejm-wielki.pl

★☆☆☆☆ Szukaj w archiwach

★☆☆☆☆ TNG

★☆☆☆☆ Zielona Góra — cmentarze

☆☆☆☆★ Geneteka

☆☆☆☆☆ straty.pl