Basic guide for database design

I have recently worked on a one-year pilot for Linked Data at the University of Oxford. During this pilot I kept coming up to database design decisions which made publishing Linked Data difficult. This page includes some points to consider when designing your database:

Basic guide for database design

To assist work for Linked Data, when designing a database, please observe the following:

Summarise data

Do not use a different file for each record. It is easier to process data automatically if the records are all in the same table. If you want to present or print the data on a per-record basis, then consider a template to "read" records from the table and produce nice-looking pages. To build Linked Data the summarised table is by far more useful.

Avoid free-text

Instead of:

Manuscript with shelfmark MS-Iliad carrying text by Homer.

It should be:

Field Data
shelfmark MS-Iliad
author Homer

Why? It is difficult for software to process free text, remove the syntax and identify the entities we are talking about (i.e. MS-Iliad and Homer). It is much easier to identify these if there is no syntax.

Keep information separate

Avoid bundling together different entities. For example instead of a record being:

Dimension
height: 20, width: 10, thickness: 5

It should be:

Dimension Value
height 20
width 10
thickness 5

Why? In Linked Data, each entity needs to stand on its own. Splitting a bundled field programmatically is difficult as often there are no consistent formulas that fields are bundled up.

Do not merge cells or use line breaks

When using spreadsheets to produce records do not use the merge cells function.

Instead of:

Value Unit
Height 20 cm
Width 10 cm
Thickness Max thickness 8 cm
Min thickness 5 cm

It should be:

Dimension Value Unit
height 20 cm
width 10 cm
min thickness 5 cm
max thickness 8 cm

Similarly do not use linefeeds within cells to indicate multiple records. Use instead a delimeter like | which is much easier to process.

Why? It is much easier to "read" the data if it is all in a canonical table on a row-by-row basis. Merged cells and linefeeds break that canonical structure or confuse the rows.

Use identifiers

Give identifiers to entities contributing to a record. For example, instead of:

Shelfmark Author
MS-Iliad Homer

It should be:

Manuscript ID Shelfmark Author ID Author
1234 MS-Iliad 5678 Homer

Why? Not having an identifier means that the included entity (e.g. Homer) is "hidden" in text and cannot be matched to other occurences across the database. Some institutions choose to produce UUIDs as identifiers for each entity.
Note that if there are multiple authors either a new table would be neccessary or multiple rows of MS-Iliad would be required, each with a different author. This indicates the requirement for a so-called one-to-many relationship across entities which is difficult to replicate on a single spreadsheet.

Use external authorities

Allow space for external identifiers of entities. Instead of:

Author ID Author name
5678 Homer

It should be:

Author ID Author name External Authority External ID
5678 Homer VIAF 224924963
5678 Homer WikiData Q6691

Why? Linked data depend on establishing links with other datasets. This process is known as reconciliation or disambiguation. For example 5678 "Homer" is the same person as the one described in VIAF: https://viaf.org/viaf/224924963. It is useful for a database to be able to store external identifiers for entities (even external labels) to enable this linking. There are many authority files and thesauri which publish identifiers for their records. Make sure that when you are building your records you can capture these.

Reference images

Do not insert image files in your records. Only add the location as a reference to where the image can be seen. Preferably this should be a URL but in theory local paths are equally useful.