An odd title for a blog post on clinical data standards; the words come from a famous essay by Mary Schmich written for the Chicago Tribune entitled “Advice, like youth, probably just wasted on the young”. I was reminded of it on the flight home from the CDISC Interchange in Vienna because within the essay are the words “Do one thing every day that scares you”. I sat there willing the time to pass by pondering the presentations I had seen, the presentation I had given and the conversations I’d had at the conference and the implications for clinical trial data and the associated data standards. It is time to scare ourselves.
One conversation stuck in my head. Twice during the exchange I was told “you cannot do that”. I suspect the words had more to do with a commercial position rather than any other criterion but the words struck a chord. The discussion was centred around improving the quality of data within the clinical studies as we execute them today.
As the plane headed towards London and I pondered the events of the week memories of a 60s TV series entered my head. Max Smart and in particular a phone in his shoe. I suspect that back in the 1960s such a device was not technically possible but how many times have we seen science fiction devices portrayed on TV and in films turn into reality. A quick google search tells me that the first GSM handsets appeared in the early 1990s (this was after analogue systems that were the size of briefcases) rapidly reducing in size until we arrive at the handsets of today that allows me to stream a HD film while I sit waiting in the airport for my flight. Can’t isn’t a word that entered into the vocabulary of those developing digital phone networks. Today Max could quite happily have a handset built into his shoe. He would however, still look ridiculous using it.
So Why I am rambling about articles from 1997 that ended up being a pop hit and about the development of mobile phones? The answer is that I sense one of those junctions on the road of standards development and we really do need to grasp it. We are falling behind because we are frozen by fear of what might happen should we place less reliance on the tabular data structure that permeates everything we do today.
Over the past few months I have had quite a few discussions on the topic and last week I saw an article from Wayne Kubick in Applied Clinical Trials extolling the benefits of fire (FHIR) in clinical research. In the article, Wayne refers to biopharma and its tools and processes and closes with the words “it sometimes feels to me that biopharma is also wandering in a frigid wilderness of its own creation”.
I also feel that CDISC is hitting a point in the product lifecycle whereby adoption is on the increase as a result of the FDA mandate but the product itself has reached the end of its life; we are building wart upon wart and at some point there will be an almighty bang as something fails. The main CDISC products are now a decade or more old and I would suggest it is time to look to the future rather than build on the tired foundations of the past. It is time for CDISC 2.0.
And so the CDISC conference, the conversations, my musings on an aeroplane and Wayne’s article prompted me to put down my work for an hour or two.
As always it is easy to criticise and a lot harder to do. So I have been thinking about this for the past few months (this is a lie, it is more like the last decade as you will see below) and in March I wrote an informal vision paper that I shared with a few people. I eventually tidied it up and published it here. What I want to do is expand the vision here with a little more detail. I have kept the original words from the vision document in italics but then have gone on to expanded the ideas and explain my thinking further.
Our efforts should be centered and focused on the creation of BCs [Biomedical Concepts].
I have written at length about Biomedical Concepts, have a look back through the various blog posts. These are nothing more than a logical collection of variables that have some meaning in the real world. They are our raw data. We see such real-world observations within healthcare and there are various representation of the same knowledge already within healthcare. BCs are research’s representation of the same knowledge (the question might be asked why have a separate representation, well we have SDTM and we need continuity).
It is worth noting that these observations exist without SDTM; I can record my blood pressure without any knowledge of SDTM. SDTM however cannot exist without the observations. This is why I feel they should be kept separate, the data have a natural form and that form is not tabular.
I believe everything we do, Therapeutic Area standards, safety domain development, CDASH and SDTM needs to be based on BCs. Doing so will bring a leap in quality of the definitions we create. We need to aggressively train people to understand them and make it easy to create and curate them.
We should ensure that the existing standards are enhanced such that they are based on BCs; CDASH and SDTM would, in essence, be split into two with a higher-level ‘presentation’ – traditional CRFs and Tabulations – with a lower-level rooted in BCs based on the inherent structure of the data and not an artificial tabular structure constrained by the conflicting demands of transport, storage and presentation.
This is where I bring out a new picture. Pictures such as this are always dangerous; there are many in the industry who like to see particular words used in a certain way or they want all the detail immediately. They dislike uncertainty. Well they are going to be disappointed with the next few paragraphs. This is a vision and not everything is known. None of this is easy, if it was we would have done it already.
Lets walk through the diagram. At the top we have our business objects as we know and love today in a simplistic form, Protocol through to TLFs. I will restrict myself to these at the present time to get the ‘big picture’ across but I am aware it is not the complete set.
Below each top level title is a dotted box within which sits a model name. This represents a model that allows users to construct the desired business object in a form that a machine can understand. These models relate to the world we know today; SDTM for example. We have this model but it comes as a 400 page PDF document within which there is no precise statement of the model, there are many rules, assumptions abound and much is hidden. We need a better way of expressing these specifications. It is time for a formal model. Some excellent work has already been done in this area with the CDSIC2RDF project (yes I know it is not it’s official title, it is something like PhUSE CS Semantic Technology Working Group but CDISC2RDF is a nice short catchy name) but I feel we need to improve and go further rather than reflect.
These models would link to the data world below. The Protocol and CDASH models refer to BCs. For SDTM we will define in terms of BCs. However we will also need our ‘simple’ derived data used to construct tabulations today; here we hit the age-old question of whether SDTM should or should not contain derived data. I don’t care. We will need a ‘Derived Data’ model that takes raw inputs (data from within BCs) and derive further data. We then have our tabulation model that allows for the definitions of SDTM data domains that can choose to included derived data or not and this is why I don’t really care about the philosophical argument. In some cases I may want pure raw data in other circumstances raw and derived. But I can define tabulations that meet both use cases. The underlaying data remain the same. We start to gain flexibility.
ADaM and TLF may need more complex derived structures so I drawn these as separate from the simple structures needed by SDTM. In reality they could be used by any upper level model as desired. These models link to the BCs (raw) and simple derived models to allow the necessary complex analysis structures to be built.
Some important points:
- The data models reflect the data and it’s natural construction and not how we present it. In the current CDISC world presentation, structure and transport have all been compressed into one layer and we are paying the price.
- At the data level I have tried to draw the picture to indicate the linked nature of these models. The protocol refers to BCs, derived data are based on the BCs and then the analysis data are based on BCs and derived data, everything is linked and thus we meet our desire for an end-to-end process. I could traverse the model from protocol to analysis or in reverse. Mapping, traceability and such like become a lot easier and can be handed over to the machine.
- Being able to traverse the data models allows tools to start supporting key processes. For example, working from end points back to forms to ensure we are collecting all the required data becomes a reality.
- The ‘presentation’ models should be able to reflect our current world and thus allow people to continue working as they do today, they can move towards the better world as they are ready to do so and tool support becomes available.
- However, even if BCs are not used they can provide a rich source of metadata (in particular value-level metadata) in traditional form (e.g. spreadsheets) and permit those working in a traditional manner to produce higher-quality data. Industry gains even if they choose not to take everything on-board.
As a quick aside, note the sideways pyramid organisation of the nodes and links represented in the Analysis Data Model part of the diagram. This ‘graph’ is a picture I have held in my head for a long time as part of a standards vision. It emerged from a conversation I had at Raleigh-Durham airport over a decade ago when I sat discussing these issues for a couple of hours when travel arrangements were disrupted due to a thunderstorm. Our model should be navigable from protocol to endpoint.
Returning to the presentation models, consider the form model based on BCs. Systems can render what the user wants to see and feels comfortable with, a form. But what we now have is an operational entity, the form, that is based on common data models (BCs). It doesn’t matter how many different flavours of the form I create the data will be compatible because they are all based on the same underlying definition; the variability of the variable based world can be controlled and reigned in. It also means that across sponsors we can start to drive consistency at an observation level. SDTM will be defined in terms of BCs and thus we start to reduce the ‘where do I put X’ questions, we get our desired mapping and traceability for free.
You will note at the lower level there is a grey bar labelled Transport. This could be many things, some XML format, HL7 FHIR, Semantic Web formats, even tabular. By organising our world above correctly the transport format becomes much less important and we decouple our data and presentation from how we transport our data.
It is not necessary that everything be done in a ‘boil the ocean’ fashion. We obviously cannot do this but we can do this almost BC-by-BC, our world can improve day-to-day with the introduction of individual BCs. TAs can be asked to produced the core set of BCs in a consistent and electronic form from a library of well-constructed patterns / templates that are in turn based on a common framework (the BRIDG model). Once a core set is available additional BCs can be added.
We have a lot of the definitions for our metadata already. It is tucked away in spreadsheets, in define.xml files, in data and in models available in the healthcare world. We need to organise and we have to do it BC-by-BC. This allows us to improve our world day-by-day. What it does mean that we need to be able to mix and match our improved BC world with the variable-based world. What we do need to do is build the models to bring clarity to our world and do this relatively quickly.
I do stress we need to do this in an incremental way. As we move forward we will learn and we will need to adjust. We will need to test. Industry will move forward at different speeds and regulators will not be able to accommodate this all at once. We need to be smart.
The desire would then be to capture study data in a BC format. Again we cannot boil the ocean. Forms today are variable-based. But we can replace a small set of variables with the equivalent BC as the relevant BC becomes defined. Again we improve our world BC-by-BC, form-by- form, domain-by-domain, day-by-day.
I won’t add to this at the moment.
Captured data could then be kept in a structure based on the BC metadata. Tabular structures can be derived from the raw data.
This would be quite a step but I do believe it is possible. I hope to be able to demonstrate this by the end of the year.
This would then enable a move towards linked data and easier data integration.
As we move towards more of a linked data world with high-quality metadata integration becomes easier. As data share the same structure and the same definition the effort required to bring such data together reduces.
As we progress we start to see more consistent data across sponsors as consistent metadata is deployed.
We could then place such data and metadata into different transport formats. We will always need to be able to place such data into a rectangular form but BCs also allows us to ‘upgrade’ to formats such as HL7 FHIR. I have had some preliminary discussions on this topic with others in the industry.
I touched on this topic above. By organising our world better we remove our dependence upon the transport format. Then we are free to choose one, may be support multiple.
This would facilitate and ease the integration research and healthcare.
One benefit of supporting HL7 FHIR would be that it would assist in integrating with healthcare. It is early days but there is a lot of weight behind FHIR within the HL7 community. If their efforts go well it may well make sense to piggy back on that success. Early investigations suggest that BCs could be seen as FHIR resources. If so, linking to healthcare resources (observations) would be a natural step to take and using a common transport mechanism makes sense. Note that HL7 are scaring themselves; FHIR is really HL7 V4.
Such BCs would also allow for a different approach to the protocol. I have long considered that BCs would allow us to think of the protocol more as a subject timeline of assessments (each assessment being a BC) rather than the current rectangular schedule of assessments approach. Again, I have had preliminary discussions on this topic.
I have long pondered if some of the structures we use to create protocols today actually cause more issues than they solve. I have never really had a chance to investigate but I wonder if some of the structures we use stem from a paper world and the desire to see the study design on a single piece of paper rather than what is optimum. I just have a feeling that observations (BCs) on a subject timeline might offer some nice solutions, but as I say, this is something that is some way off.
Taking a small diversion, I have seen some discussion recently about linking eClinical systems and contrasting the use of such things as ODM versus architectures such as micro services and RESTful interfaces. I am starting to think that the data layer and the associated models would lend themselves to the definition of a range of services that tool vendors could then implement to facilitate integration across tools. It is one of those five minute conversations I have had with myself, not fleshed out, but interesting.
So, returning to Max Smart and Mary Schmich. I don’t want to hear “we can’t”. I am rather fed up of “can’t”. If we don’t adapt we will continue to add a wart here and a wart there until we lose sight of what was underneath. We need to scare ourselves, we need to scare ourselves every day for the next five years until we build something better. It is time to start building CDISC 2.0. and give industry their phone in their shoe.
Feel free to comment, we need a discussion. Finally, as we all know only too well, wear sunscreen.