Structured data on genealogy websites
Originally posted on Medium
In my previous post, I listed several problems with publishing and using genealogy data on the Web. In short, the issues are:
- it should be easier to copy data from Web resources to a researcher’s database
- data on the Web may be ambiguous
- good links to sources should be provided where possible
- software could exist to help with the above
Before we can try to figure out what solutions are viable, let’s see what the current state of genealogy websites is. What I wanted to do is to take some websites that contain information interesting for genealogists and assess them in terms of how easy it is to collect data from them in an automatic and unambiguous way and also be able to link back to the site as the source of information.
Tim Berners Lee proposed a way of assessing Web data sources in terms of machine-readable access. It is called 5-star Linked Data and the criteria are described as follows:
★ Available on the web (whatever format) but with an open licence, to be Open Data
★★ Available as machine-readable structured data (e.g. excel instead of image scan of a table)
★★★ as (2) plus non-proprietary format (e.g. CSV instead of excel)
★★★★ All the above plus, Use open standards from W3C (RDF and SPARQL) to identify things, so that people can point at your stuff
★★★★★ All the above, plus: Link your data to other people’s data to provide context
I started assigning these stars to genealogy websites but it soon turned out that I’ll be making my own rating system.
Firstly, I definitely don’t expect all sites to expose data as Open Data. There are paid genealogy sites and private sites and this is fine. It is clear that some data actually should be made private, especially when talking about living individuals. However, a paid site or a private one can still be machine-readable and citeable. For example, FamilySearch exposes its data in GedcomX format and is a paid site. Also, WikiTree exposes structured data even for living individuals for those that have appropriate access.
The “excel instead of image scan of a table” star is also not very relevant to genealogy data. Actually, some sources are exactly that — scans and photos of original materials, and this is what we expect them to be. We don’t want them being replaced by transcribed data, we want them extended by transcribed data. Instead, I would argue that we should distinguish structured data from data that needs to be parsed out from HTML contents.
Having a non-proprietary data format is also not a big issue, especially that I haven’t seen genealogy data published in Excel files.
The fourth and fifth stars are actually relevant. Using standard data formats is a big plus. Linking to other data even more so.
Let me phrase my own “star” system:
★☆☆☆☆ Data is referenceable by a unique URL
It should be possible to reference each item of data individually. In particular, a single record, a single person, a single grave, a single family should be referenceable. In contrast, some sites only provide ways to reference lists or groups of items.
☆★☆☆☆ Data is machine-readable
It should be possible to create simple rules that extract data from the site. This means that fields are separated (e.g. first and last name) and are consistently named. In contrast, most sites require HTML scraping with complex rules for dealing with names and dates.
☆☆★☆☆ Data is in a well-known format
Instead of having each website publish data in its own format, it is much better to have only a few formats that everyone uses. It seems that the most popular now is using Microdata annotations with schema.org semantics but GedcomX and DBpedia’s ontologies are also a good choice.
☆☆☆★☆ Data contains links to other entities
For example, an entry for a person links to entries for the person’s parents.
☆☆☆☆★ Data contains links to entities on other websites
Not only links inside the same site but also links to other websites. Example — DBpedia linking to WikiTree
Here are some websites I looked at:
★★★★★ DBpedia
RDF and all kinds of serializations of the RDF data model
This is the only website that has all the stars.
★★★★☆ Billion Graves
Microdata + schema.org
★★★★☆ FamilySearch
GedcomX
Getting a GedcomX response requires a special HTTP request header:
accept: application/x-gedcomx-v1+json
★★★★☆ Find a Grave
Microdata + schema.org
★★★★☆ Geni
Microdata + schema.org
★★☆★★ WeRelate
Custom data format
Links to external websites are stored as source citations.
★★★★☆ WikiTree
Microdata + schema.org
Links to other sites are embedded in a free-text field.
★★★☆☆ Graves in Belarus
Open Graph
★★☆☆☆ MyHeritage record
Custom HTML properties
★☆☆★☆ webtrees
★☆☆☆☆ FreeBMD
★☆☆☆☆ FreeCen
★☆☆☆☆ FreeReg
★☆☆☆☆ sejm-wielki.pl
★☆☆☆☆ Szukaj w archiwach
Open Graph annotations
★☆☆☆☆ TNG
There is an option that can be turned on to enable downloading a GEDCOM file.
★☆☆☆☆ Zielona Góra — cmentarze
☆☆☆☆★ Geneteka
One can only link to search results but not individual records from the results. The records may contain links to scans hosted on other websites.
☆☆☆☆☆ straty.pl
This is the only website that didn’t receive any stars. Unique URLs are not available on the website but can be provided by an external service.
I don’t have access to paid sites, so some of the popular ones are missing in the table. The genscrape library is a valuable source for finding out how data can be obtained from genealogy websites.
As usual, not everything is binary, there are nuances in the data representations. For instance, although Geni uses Microdata, the links between family members are not included in the metadata, only in HTML links. Some websites that don’t explicitly expose structured data are quite easy to parse (consistent table headers) while others are not.
Since this is just some random sites I have chosen and the information in the table may become outdated (hopefully), I’m going to set up a GitHub repository to collect a more comprehensive table. EDIT: Repository created at github.com/PeWu/genealogy-websites
In the next post, I’d like to take a closer look at existing research about genealogy data formats including the popular ones already used. Are the data formats used today enough to express what is needed for genealogy or do we need something different? Maybe GEDCOM is everything we will ever need?