Metadata and Layers
I have been wandering around over the last six months or so with a couple of pictures in my head that I have at last managed to commit to paper. The pictures in question try and explain why we would want to layer our metadata and through the layers link data standards and, ultimately, why the line between those standards may well begin to grey and ultimately may disappear altogether.
So lets start by considering the writing of a protocol. Within the protocol document today, we typically see statements such as “Demographics (age, gender, height and weight)” or “The following baseline demographics and disease characteristics will be captured in the eCRF: Sex, age, body weight, height/length, and race.” Through statements such as these the protocol writer is giving us the first indication of what data is to be captured as part of the study but at a high level using the name of a “concept” such as AGE, SEX and so on.
The reason I put the word concept in quotes is that this single word conveys many different meanings to many people, it is desperately overloaded. What I mean when using the term is a collective designator for one or more atomic items of data that in combination represent some meaningful notion or idea. AGE is a simple example. It is meaningful to most people, consists of value and units and we can provide a good definition for the concept.
So in the first diagram, I see the initial use of the concept within the protocol simply via use of its name. That protocol document is then converted into a study implementation that consists of a set of paper or electronic CRFs – it should be noted that I don’t distinguish the usefulness of metadata simply because of the use of paper CRFs, they need to follow the same rules – with the CRF expanding the concept into its constituent parts. In addition to the basic definition we will also be interested in how the data are to be captured, how the units are to be presented and other such considerations.
In my simple example AGE, as I said, will consist of two parts, the AGE value and the associated AGE UNITS and we will need to capture these items on a CRF. Now most of the time the units will be fixed, years most likely, may be months but we still need units to make sense of the value. I am of the school that likes to have the variable hidden within the system even if the user cannot change it simply for the purposes of consistency across studies when data are exported. And obviously this data will be collected for the N subjects within the study, hence the multiple CRFs in figure 1.
Once we have our data collected, we will want to take the values into a tabulation with columns holding the AGE and AGE UNITS values from the N subjects. As we move into the world of the tabulation from that of the CRF, we will need to think about the columns and definitions in relation to the dataset as a whole and information about the dataset itself such as its name.
So what we see is that in each part of the above chain – I stopped at the tabulation rather than adding further use cases such as analysis dataset, warehousing, data aggregation etc – there is a common set of information used within each use case but that each use case also requires its own particular metadata. And for these reasons, I created the second diagram and show in figure 2 that we would like to extract that common core definition and bring it into one place to be shared across the use cases.
This then leads to a desire to organize the metadata in a layered fashion as shown in figure 3. In this simplified view, we want to put at the base of our metadata “stack” our code lists – the controlled terminology – and then in the layer above the definition of our concept. This provides our core building blocks, a definition that is independent of the use case but consistent across all use cases.
On top of these two lower levels, we can then layer on top the extra metadata applicable to the use case, the CRF for example with the presentation information and such items as range/edit checks that we would need to assist in collecting data. For the tabulation, we want to add precision around the column definitions, may be formats, whether a value is required or is optional, rules for missing values etc.
One thing to note about the diagram is the warehouse and that I have placed the Study Configuration linked to the warehouse. In reality, the Study Configuration will be created as part of the study build process earlier in the life-cycle but my thought was that the use cases that best illustrate its relevance in the metadata world is the desire to aggregate data across studies. This need drives the requirement that we know what is on each of those studies in the sense of the study construction all the way down to concepts, variables and code lists. Hence my decision to draw it there for the sake of an uncluttered picture.
Finally the submission and review, where the metadata needs to be passed to the regulatory authority, define.xml is one such item but annotated CRF and other such items are also applicable.
I have tried to keep this short and sweet but get the message across. We should be organising our metadata. It needs to be well structured.
It is good to be back writing again, hopefully a few more blogs will flow over the coming weeks.




I think the picture is essentially right. You mentioned controlled terminology (codelists) as the “base”. However, I don’t think we treat codelists correctly in CDISC at this moment.
First, let us look at your example “age units”. In CDISC we defined our own codes for that (YEARS, MONTHS, WEEKS) whereas the rest of the healthcare world (100 times bigger than ours) consequently uses UCUM units. Why did we reinvent the wheel?
Now, let us look at the codelist for “laboratory test”. Wait a minute! We have two of them and they seem to be independent: one for “laboratory test code” and one for “laboratory test name”. With two independent lists, how can a system know that “APTT” (from CL.LBTESTCD) means “Activated Partial Thromboplastin Time” (from CL.LBTEST)? Well, it can’t.
The reason? The SDTM tables.
Now tables are two-dimensional and are fine to have VIEWS on data. But tables are not the reality, they just are “views”. In our diligence to create controlled terminology, we seem to have forgotten that.
The world is not flat (famous statement of an FDA representative), but it is not cubic either. And those knowing a bit more about geography will know that it is not spherical either.
It is good that we assign “allowable values” to our concepts. But we should stop starting thinking that these are simple values and not composite objects themselves (e.g. test code + test name).
In other cases we even need to go a step further: composites like:
test code – test name – test position (e.g. SYSBP, systolic blood pressure, sitting/standing/…). And for the units, these may be composite too (just like UCUM treats them): mm[Hg] (millimeter + Mercury) or m[H2O] (meter + water).
For other tests, we may need to add “test body location” (for example for “body temperature”).
So if we need a solid basis for our model, we might first need to completely rethink the way we treat codelists in CDISC.
Jozef
I would agree with your basic premise that code lists in he CDISC world need a good looking at. As you point out Test Names and Test Codes and the “linking” thereof is certainly an issue. But I still believe that controlled terminology sit as the base of the metadata “stack” in that as a general principle I think metadata should only make use of items in the lower levels and never use from a higher level.
As for tables, yes they are just views or presentations on the data, I would agree. The view of consistent use of metadata across the life cycle does lead to the idea of a use-case neutral data storage format that supports all use cases.
I am not so convinced about some of your ideas regarding some of the observational qualifiers such as position, location, method etc. I see those as data items with associated controlled terminology that qualify an observation. So they would be another atomic part of a “concept” rather than just part of a single coded value holding several items of information (i.e. as BRIDG has modelled them).
Dave
Hi Dave,
fully agree – I did not want to suggest that “location” and “position” should be part of a single coded value. As you state, it is better to treat them as qualifiers for an observation.
I am still puzzled why CDISC decided to create its own controlled terms for units though, ignoring UCUM.
Best regards,
Jozef
Jozef
With respect to UCUM, timing was probably an issue, the core work on SDTM was done say 2000 to 2005 or so – difficult to remember – what was the state of UCUM at that time? Knowledge levels of the team at that time was probably also an issue as well as maturity of UCUM itself along with its complexity.
There is always the argument for “lets get something done” versus the Rolls Royce solution. The problem sometimes – not always – with “lets get something done” is that it can leave you vulnerable in the future. It is a judgement call and 20/20 hindsight is a wonderful thing!
Dave
Speaking of timing, I was just presenting and preaching the same topic to my data management constituents. Someone from the audience asked, how do I tabulate the subject’s position (supine, sitting, etc) when blood pressure was measured. I answered by asking if the protocol specifies one way or another; and, whether we can distinguish the position from the CRF metadata. When the protocol specifies a sitting systolic and diastolic blood pressure, I hope the CRF is set up in such a way the position is discernible. For example, one can label the CRF question prompt accordingly; or, name the collection variable SITTING_SYSBP_VSORRES and SITTING_DIABP_VSORRES; or, add variables SYSBP_VSPOS and DIABP_VSPOS (hide them with a default when appropriate). This way, these essential attributes are apparent in the data transfer and downstream processor such as ETL can take advantage of them.
If I may apply the above back to he “concept” concept you mentioned, I would think clarity is ensured when people are instructed and trained to collect sitting systolic and diastolic blood pressure to mindful of 1) position; 2) measurement unit; and, 3) specific verbiages. So, the concept should be defined upfront. This is no different than SDLC where user requirements should be well documented and understood before implementation.
Lastly, it is good to see you writing again.
Anthony
I think your second para is the key
The idea of the “concept” is that it is defined upfront once with such items as position, location etc being either collected or pre-defined, but every study collects or they are fixed. That way the data from every study are of the same structure irrespective of whether the data are collected or not. As an aside the “don’t care” values need some thought.
The concepts are then stored in a metadata repository (MDR). The third figure, the code lists and AGE item are these pieces.
If we have consistent data then, as you say, downstream processes such as ETL etc can greatly benefit.
So for blood pressure, the common items such as test code(s), test name(s), position, method, location, time, date … and then value and units for both systolic and diastolic giving us at least 11 pieces of information. Here we need to be slightly careful, normally we repeat the test code etc for both sys and dia, so a little thought is necessary but there are ways to achieve the answer. I would define this once, store in my MDR and re-use study after study.
These “atomic” data elements, things I an split no further, combine into a “concept” of blood pressure that can be referenced in a protocol, expanded in the CRF, used to build a tabulation – as Jozef said a view of the data – using ETL tools and onwards.
Dave
Thanks for starting the discussion, Dave.
Layering of metadata is very important – the habit of defining objects as the combination of definitions and terminology has been hugely expensive to companies who find themselves with multiple objects with broadly (but not exactly)the same definition but with different names and different terminologies (and no cheap way to bring these together).
I see 4 layers:
1. Definitions of clinical concepts (e.g. systolic blood pressure) together with the identification of all component parts (e.g. method, body position). It should be possible for these to be used across the whole pharmaceutical industry.
2. Terminology used for component parts (e.g. a set of valid values for the body position component of the systolic blood pressure). It is desirable that these be industry standard too, but this will be hard/impossible to achieve for all clinical concepts.
3. Groupings of individual clinical concepts (e.g. those comprising the set of vital signs). It is possible to standardise the more common examples, but no more.
4. Standard definition of operational objects (e.g. as eCRFs, SDTM datasets, company specific datasets). Of these, only CDASH modules and SDTM datasets can be standardised across the industry.
Bullet 1 talks about defining clinical concepts together with their component parts. Including the component parts is very important: there are many ongoing initiatives to define Clinical Data Elements (CDEs) and these efforts only go partway to what industry needs.
Current work to define Clinical Data Elements (CDEs) does not deliver all the data re-use capabilities needed e.g. the recent Parkinson’s disease standards developed by the National Institute of Neurological Disorders and Stroke (NINDS) and National Institutes of Health (NIH) have no recorded relationships between CDEs (other than through human interpretation) and no model for developing these so these are often very specific and inconsistent in approach, limiting the ability to automate processes and limiting the downstream benefit. Here are 4 examples:
CDE1 (CDE is very specific; instructions require reference to CRF page): “Has participant/subject ever regularly taken ibuprofen-based non-aspirin medications, that is, at least two pills per week for 6 months or longer”
Instructions say “If No is answered, skip to question #2”
CDE2 (units are part of the CDE definition): “Record the pulse of the participant/ subject in beats per minute”
CDE3 (2 separate CDEs for weight and weight unit): “Record the weight of the participant/subject. To be collected at the visit, not self-reported. Also, indicate whether weight was measured in pounds (lbs) or kilograms (kg)”
CDE4: “Weight unit of measure, choose either Pounds (lb) or Kilograms (kg)”
I would like to add a layer “0″ to Simon’s nice list. And that is what you may call Real-World Phenomenon 1). And also, for layer 3 there is also the need to categorize/classify [clinical/scientific/research/observation] concepts (e.g. for lab tests hematology, urinanalysis), and also relate concepts to each other, beside grouping concepts together.
–
1) “On the one hand there is your blood pressure itself, the real-world phenomenon which obeys the laws described in a medical textbook (which will tell you about systemic arterial pressure, about systolic and diastolic phases, about fluid dynamics, etc., etc. complicated physics and physiology that will be of practical importance e.g. when designing an instrument that can accurately measure blood pressure or when dealing with a patient who has atrial fibrillation). On the other hand there is a blood pressure observation, another real-world phenomenon, but of an entirely different sort, involving factors such as:
- the position of the patient at the time of measuring (sitting, lying, etc.),
- the tilt of the surface on which the person is lying,
- the variation in measured blood pressure with respiration,
- the instrument used to measure the blood pressure,
- the size of the cuff if a sphygmomanometer is used,
…”
From a blog post on the HL7 Watch blog: http://hl7-watch.blogspot.com/2006/02/is-there-difference-between-person-and.html
See also http://ontology.buffalo.edu/smith/articles/Vital_Sign_Ontology.pdf