Metadata and Layers
I have been wandering around over the last six months or so with a couple of pictures in my head that I have at last managed to commit to paper. The pictures in question try and explain why we would want to layer our metadata and through the layers link data standards and, ultimately, why the line between those standards may well begin to grey and ultimately may disappear altogether.
So lets start by considering the writing of a protocol. Within the protocol document today, we typically see statements such as “Demographics (age, gender, height and weight)” or “The following baseline demographics and disease characteristics will be captured in the eCRF: Sex, age, body weight, height/length, and race.” Through statements such as these the protocol writer is giving us the first indication of what data is to be captured as part of the study but at a high level using the name of a “concept” such as AGE, SEX and so on.
Figure 1 - The flow of metadata
The reason I put the word concept in quotes is that this single word conveys many different meanings to many people, it is desperately overloaded. What I mean when using the term is a collective designator for one or more atomic items of data that in combination represent some meaningful notion or idea. AGE is a simple example. It is meaningful to most people, consists of value and units and we can provide a good definition for the concept.
So in the first diagram, I see the initial use of the concept within the protocol simply via use of its name. That protocol document is then converted into a study implementation that consists of a set of paper or electronic CRFs – it should be noted that I don’t distinguish the usefulness of metadata simply because of the use of paper CRFs, they need to follow the same rules – with the CRF expanding the concept into its constituent parts. In addition to the basic definition we will also be interested in how the data are to be captured, how the units are to be presented and other such considerations.
In my simple example AGE, as I said, will consist of two parts, the AGE value and the associated AGE UNITS and we will need to capture these items on a CRF. Now most of the time the units will be fixed, years most likely, may be months but we still need units to make sense of the value. I am of the school that likes to have the variable hidden within the system even if the user cannot change it simply for the purposes of consistency across studies when data are exported. And obviously this data will be collected for the N subjects within the study, hence the multiple CRFs in figure 1.
Once we have our data collected, we will want to take the values into a tabulation with columns holding the AGE and AGE UNITS values from the N subjects. As we move into the world of the tabulation from that of the CRF, we will need to think about the columns and definitions in relation to the dataset as a whole and information about the dataset itself such as its name.
Figure 2 - Extract The Common Parts
So what we see is that in each part of the above chain – I stopped at the tabulation rather than adding further use cases such as analysis dataset, warehousing, data aggregation etc – there is a common set of information used within each use case but that each use case also requires its own particular metadata. And for these reasons, I created the second diagram and show in figure 2 that we would like to extract that common core definition and bring it into one place to be shared across the use cases.
This then leads to a desire to organize the metadata in a layered fashion as shown in figure 3. In this simplified view, we want to put at the base of our metadata “stack” our code lists – the controlled terminology – and then in the layer above the definition of our concept. This provides our core building blocks, a definition that is independent of the use case but consistent across all use cases.
On top of these two lower levels, we can then layer on top the extra metadata applicable to the use case, the CRF for example with the presentation information and such items as range/edit checks that we would need to assist in collecting data. For the tabulation, we want to add precision around the column definitions, may be formats, whether a value is required or is optional, rules for missing values etc.
Figure 3 - The Metadata Layers
One thing to note about the diagram is the warehouse and that I have placed the Study Configuration linked to the warehouse. In reality, the Study Configuration will be created as part of the study build process earlier in the life-cycle but my thought was that the use cases that best illustrate its relevance in the metadata world is the desire to aggregate data across studies. This need drives the requirement that we know what is on each of those studies in the sense of the study construction all the way down to concepts, variables and code lists. Hence my decision to draw it there for the sake of an uncluttered picture.
Finally the submission and review, where the metadata needs to be passed to the regulatory authority, define.xml is one such item but annotated CRF and other such items are also applicable.
I have tried to keep this short and sweet but get the message across. We should be organising our metadata. It needs to be well structured.
It is good to be back writing again, hopefully a few more blogs will flow over the coming weeks.