Basic guide for database design

I have recently worked on a one-year pilot for Linked Data at the University of Oxford. During this pilot I kept coming up to database design decisions which made publishing Linked Data difficult. This page includes some points to consider when designing your database:

Basic guide for database design

To assist work for Linked Data, when designing a database, please observe the following:

Summarise data

Do not use a different file for each record. It is easier to process data automatically if the records are all in the same table. If you want to present or print the data on a per-record basis, then consider a template to "read" records from the table and produce nice-looking pages. To build Linked Data the summarised table is by far more useful.

Avoid free-text

Instead of:

Manuscript with shelfmark MS-Iliad carrying text by Homer.

It should be:

Field	Data
shelfmark	MS-Iliad
author	Homer

Why? It is difficult for software to process free text, remove the syntax and identify the entities we are talking about (i.e. MS-Iliad and Homer). It is much easier to identify these if there is no syntax.

Keep information separate

Avoid bundling together different entities. For example instead of a record being:

Dimension
height: 20, width: 10, thickness: 5

It should be:

Dimension	Value
height	20
width	10
thickness	5

Why? In Linked Data, each entity needs to stand on its own. Splitting a bundled field programmatically is difficult as often there are no consistent formulas that fields are bundled up.

Do not merge cells or use line breaks

When using spreadsheets to produce records do not use the merge cells function.

Instead of:

		Value	Unit
Height		20	cm
Width		10	cm
Thickness	Max thickness	8	cm
Thickness	Min thickness	5	cm

It should be:

Dimension	Value	Unit
height	20	cm
width	10	cm
min thickness	5	cm
max thickness	8	cm

Similarly do not use linefeeds within cells to indicate multiple records. Use instead a delimeter like | which is much easier to process.

Why? It is much easier to "read" the data if it is all in a canonical table on a row-by-row basis. Merged cells and linefeeds break that canonical structure or confuse the rows.

Use identifiers

Give identifiers to entities contributing to a record. For example, instead of:

Shelfmark	Author
MS-Iliad	Homer

It should be:

Manuscript ID	Shelfmark	Author ID	Author
1234	MS-Iliad	5678	Homer

Why? Not having an identifier means that the included entity (e.g. Homer) is "hidden" in text and cannot be matched to other occurences across the database. Some institutions choose to produce UUIDs as identifiers for each entity.
Note that if there are multiple authors either a new table would be neccessary or multiple rows of MS-Iliad would be required, each with a different author. This indicates the requirement for a so-called one-to-many relationship across entities which is difficult to replicate on a single spreadsheet.

Use external authorities

Allow space for external identifiers of entities. Instead of:

Author ID	Author name
5678	Homer

It should be:

Author ID	Author name	External Authority	External ID
5678	Homer	VIAF	224924963
5678	Homer	WikiData	Q6691

Why? Linked data depend on establishing links with other datasets. This process is known as reconciliation or disambiguation. For example 5678 "Homer" is the same person as the one described in VIAF: https://viaf.org/viaf/224924963. It is useful for a database to be able to store external identifiers for entities (even external labels) to enable this linking. There are many authority files and thesauri which publish identifiers for their records. Make sure that when you are building your records you can capture these.

Reference images

Do not insert image files in your records. Only add the location as a reference to where the image can be seen. Preferably this should be a URL but in theory local paths are equally useful.