This post relates to the presentation given on Wednesday 27th April at the CDISC EuroInterchange in Vienna. As in previous years this post provides a narrative to go with the slides as my slides tend to be rather pictorial. Hopefully the write-up will provide those who attended something useful to refer to post the event while, for those who could not attend, it will provide access to the talk.
The post consists of the submitted short abstract followed by a slide-by-slide commentary. The slides in PDF format can be found here.
With the impending arrival of the new version of define.xml (v2.1) that will support version management at the domain level, constant new terminology releases, core standards updates and the increasing number of Therapeutic Area User Guides version management is fast becoming an important topic. One of the biggest issues is determining the impact of any version change.
This presentation, based on past practical experience of a number of implementations that have used Biomedical Concepts (Research Concepts) and also a new semantic web implementation will:
- Show a sematic web MDR in action
- Show how the semantic technologies help to address the version management issues
- Demonstrate version management of terminology, forms and domains
- Show working automated impact assessments to assist users in seeing the impact of version management changes
The presentation will draw on practical experience of a new MDR implementation that employs Biomedical (Research) Concepts (BCs) based upon the ISO11179 Metadata Standard, the ISO21090 Healthcare Datatype Standard, the BRIDG model and BCs taken from the CDISC Therapeutic Area development work.
Slides 1 & 2
The usual title slides.
This slide is a scene setter. A new version of the SDTM IG is published by CDISC and we, of course, all jump for joy. We all know that is not quite the reaction. Even today the release mechanism is effectively paper. It is hard to consume, there is a lot to learn: what’s new, what’s changed etc.
Slide 4 – SDTM Versions
A reflection of what we have seen over the last 15 years; 5 SDTM versions since 2004 and v3.1.1. I went back to V3.1.1 as it could be argued that V3.1.1 is the first version of SDTM that made sense for sponsor to use for production work. The last major release was late 2013 and the next version, IG version 3.3 is scheduled for the middle of this year. As can be seen, we are seeing a version approximately every 2 years.
Slide 5 – Therapeutic Areas (TAs)
This slide contains a couple of images. The main one is a screen shot from the CDISC website taken a week or so ago. I just wanted to highlight the number of TAs that we will see published this year; 13 between Q2 and Q4. That is a big number. Even if each only impacts or creates a single domain, that is 13 domains.
Each TA usually impacts more than one domain. The second image is an extract from an analysis I performed about a year ago looking at a few TAs and the domains that each impacted. I looked at 11 TAs that had some form of impact on 47 domains. Managing this level of change is hard.
Slide 6 – Terminology Changes
And, of course, it is not just SDTM and domains. It is CDASH, SEND, ADaM, terminology and it is also the relationships between those items.
This is an extract from a report from an MDR that details changes to the submission values within the CDISC terminology (SDTM and QRS). The upper table indicates the code list and the item within that has changed and then a column for each CDISC release (the quarterly releases) up until the latest March 2016 release. The changes are marked with an index and the change detailed within the second table.
These changes are particularly important to be aware of as these submission values are those contained within study data. It is therefore important to know about them such that any impact can be accessed.
As an aside, it is worth noting the FDA’s Study Data Technical Conformance Guide and section 6.1.2 on the use of controlled terminologies. In the document FDA talks about the use of old and new versions and use of versions during the pooling of data.
Slide 7 – FDA Feedback
At the recent PhUSE Computational Sciences meeting in Silver Spring the FDA provided their now customary feedback on submissions to date (the presentations can be found here). I have just extracted a couple of slides that relate to terminology just to continue the theme from the previous slide and I have highlighted a couple of the statistics quoted by the FDA.
- Unit code lists for each domain. This requires better metadata from the design of the study (the forms used therein)
- 62% of applications had CT issues. This is a significant number and again shows the need for control and precision within studies.
As an aside, also note the request from the FDA to use define.xml version 2.0 and move away from version 1.0
Slide 8 – So …
So from this quick look at terminology and SDTM (ignoring CDASH, ADaM, SEND etc) we can see we need to help ourselves. We need control of our standards, we need precision about what is being used, we need visibility of what we are using at any point, we need to be aware of the relationships between standards (SDTM and terminology versions for example), we need visibility of changes and we need to be able to easily deploy the standards into an operational environment.
It is for these reasons that I decided that we need better tools to help us. You can do this in Excel for example but it is hard, so very hard.
So I have developed a metadata tool designed to work with the CDISC standards using semantic technology to help overcome these issues. The other major difference within the tool is that the tool is based on is the use of Biomedical Concepts.
Slides 7 – Biomedical Concepts
Biomedical Concepts (BCs) are there to try and help solve some of these issues. Biomedical Concepts simply try and bring all of our knowledge about our data into useful self-contained packages that can be deployed across the life-cycle. We want to bring clarity to what goes with what, what is valid and what is not. We want complete definitions with ALL terminology defined with the appropriate subsets all in a form that the human AND the machine can understand.
We can then use these definitions within a variety of scenarios such as the basis for creating all of our business objects, supporting the end-to-end process and, very importantly, providing traceability. They can also prove very valuable in impact analysis when, for example, new versions of terminology are produced.
I did not want to go into detail about BCs as I have written about them enough in the past. The slides contain the links to a set of slides and article from the PhUSE 2015 conference.
Slides 10 – Biomedical Concept Example
I just included this slide to give a quick picture of what a BC looks like for those unfamiliar with the idea. The ‘tree’ picture on the slide represents DIABP with it various attributes. We have our normal definition of Test Code, Test Name, Position, a value for the result, a code list for the units (only mmHg is valid for DIABP) as well as the code list for the valid body positions. This has been derived from a spreadsheet seen on the slide that was posted by CDISC some time ago. There is no new knowledge here, we have all seen these in many studies. What is new is the structure. The attributes are in the right place; the relationships, the self-contained nature and the terminology bindings are important.
Slide 11 – Define Once, Use Many
Update 29th April 2016: References to Research Concepts changed to Biomedical Concepts. The terms are synonymous. ‘Cut & paste’ is a dangerous thing!
This is an old slide of mine, one I have been using for a very long time. But it shows the linkage across the silos that Biomedical Concepts can help bridge.
First the protocol, we often see elements like this within visit descriptions indicating the test and procedures to be performed within a visit. Here we want blood Pressure captured that is composed of Systolic and Diastolic pressure. Is this one concept or two? I will ignore this since experience to date is showing that being able to group two Biomedical Concepts into a higher level Biomedical Concept may well be beneficial and thus I could have BP composed of DIABP and SYSBP. The use of the Biomedical Concept name within the protocol immediately links the protocol to the Biomedical Concept and can help with subsequent study build processes. You can see that collections of, say, Lab Tests (a panel) would be useful as will being able to state individual tests, questionnaires that relate to several Biomedical Concepts and so on. Using the Biomedical Concept names could also help with writing protocols and aid clarity on what data are to be collected.
Obviously, also in the protocol, is the Visit & Assessment Schedule (Table/Schedule of Events/Assessments/Procedures) which lays out the visit versus assessments for the study. This tends to be more at a CRF level (but is not always, it might be more a collection of BCs) but obviously it could have a direct link to the Biomedical Concepts. It would be desirable to express this table in terms of Biomedical Concepts, Forms built from Biomedical Concepts or groups of Biomedical Concepts.
As an aside two aspects that also are worth considering:
- Linking Research Concepts to inclusion /exclusion criteria to allow for structured protocol IEs that could be machine evaluated.
- The Statistical Analysis Plan and ‘analysis concepts’ and the use of Biomedical Concepts therein is something that also needs to be investigated, but that is some way off.
The CRF built to capture the data also links to the Biomedical Concept. From the protocol we obviously know which Biomedical Concepts are to be captured and thus each field on the CRF references the corresponding element within the Biomedical Concept defining the data to be captured along with question text, code lists etc. On this form obviously we need to refer to both the SYSBP and DIABP Biomedical Concepts. But here we have a common field, the position. It will be defined within both concepts but we only want to collect it once so the form will refer to the same item within both concepts.
And then the tabulation, the SDTM domain. The VSPOS variable (column) references (points to) the position definitions within the Biomedical Concepts (the other variables similarly pointing to the appropriate item within the Biomedical Concept). This allows the VSTESTCD and VSTEST variables (as well as other fixed values) to be set based on the content within the Biomedical Concept.
Now I have a chain, the protocol specified BP which is composed of SYSBP and DIABP. These Biomedical Concepts are used to define the CRF, the CRF fields referring the the items within the Biomedical Concepts (the leafs of the trees seen earlier) and the variables within the VS domain also refer to the same items. I have linked my protocol to my CRF and to my tabulation. I can work forward. Equally, and very importantly, I can work back. I get traceability for free.
Slide 12 – Terminology I
Within the system I have imported the last 11 versions of the CDISC terminology. Why the last 11? Well this is the set that has been exported from the NCI Thesaurus in a semantic format (OWL format). These files can be directly imported into the system from CDISC eSHARE (a load takes a couple of minutes for a new quarterly release). Once loaded each release can be searched and browsed and the changes between versions examined.
Slide 13 – Terminology II
Changes across all versions can be examined to see what has changed and further detail can then be examined and drilled into.
Slide 14 – Terminology III
This is a screen shot of one code list and the changes. Here C101841 has had a couple of changes since December 2013; the code list submission values was modified in December 2014 and the definition modified in June of 2014.
Slide 15 – Terminology IV
This slide is here just to make the changes more visible during the actual presentation. Screen shots are sometimes difficult to project at a conference, so I added this for clarity of one of the changes.
Slide 16 – Terminology V
This screen shot is then a further decomposition of the the code list into one of the items. Here we are looking at C100040 and the changes made to this item. Two changes to the submission value have taken place in June 2014 and then again in December 2014. These are important to note. If SDTM data had been generated in March 2014 then checking against a later version of terminology will cause errors to be detected. An example of why visibility of changes and precision about which versions are used are important.
Slide 17 – Biomedical Concept I
Having got a firm base of terminology in place I can then build a BC. For the purposes of this talk I created a simple BC based on a template (all BCs are based on a template so as to ensure we get consistency across BCs and we don’t create BCs that don’t fit with others).
This screen shot is the creation of Hip Circumference with the test code, the test name, date & time, the result value and the result units. Other items within the template are not used.
Slide 18 – Biomedical Concept II
This next slide is a zoom-in on the unit code list I set up. Have made a mistake but a deliberate one.
- On slide 12 you will notice a green tick against the December 2015 version of the CDISC terminology. This is my ‘current’ version, the one I am currently using.
- I know that in the March 2016 release I know that the AU submission value is modified.
- Therefore I can use this to simulate change during impact analysis.
So I used the AU unit for the purposes of the demonstration.
Slide 19 & 20 – Form I & II
On these two slides I then create a form within the system to use the BC we just created. Form creation is very quick because I have all the information in the BC. So it is really just a case of:
- Enter the form label and identifier.
- Add the BC to the form.
The system itself can fill in the details as it knows the details of the BC.
Slide 21 – CRF and Annotated CRF
It is always nice to see things in ways we are familiar with. So this slide shows that very simple form in traditional CRF and aCRF form.
But how did that form get annotated?
Slide 22 & 23 – Domain I & II
The system has also imported semantic definitions of the SDTM using the CDISC2RDF project outputs. As a result the system has a definition for the standard domains and associated variables.
Update 28th April 2016: As per Kerstin’s comment the CDISC2RDF project is better referred to as the PhUSE Semantic Technology project. The resulting outputs when then put out for public review by CDISC and published on the CDISC website. The problem is, CDISC2RDF is a nice catchy name for the work!
All the user needs to do is associate the BC with a given domain. In this case I have associated the Hip Circumference BC with the Vital Signs domain. In the background the system then uses a mapping of SDTM domains to BRIDG and an equivalent mapping of BC Templates to BRIDG to make connections between variables and individual parts of the BC.
As an aside, this BRIDG mapping is important, not so much that BRIDG is used, but that a standard reference framework is used to ensure that SDTM and BCs are built against a consistent model to ensure data are placed into the correct place. See the references above for more detail on this.
So now we have a form linked to a BC and the domain linked to a BC. The system can then navigate these links to auto-annotate the form. These linkages are our linked metadata.
Slide 24 – Linked Metadata
The next slide has animations on it in four steps. It is there to overview what has been undertaken using the metadata repository.
- First we add terminology in a semantic form (the green). This is our terminology level.
- Then we built a BC binding the appropriate terminology to the Test Code and the units (the blue). This is our data level and is a representation of the data in the real word, not how we wish to view for a particular operational purpose. This is rather important.
- Then we created a form based on the BC (the orange). This is in the operational layer
- And we linked the BC to a domain (the red). Again our operational layer
Note the solid and dotted lines on the diagram. The solid lines represent the links between items that belong with each other, the parts of a terminology release, the parts of a BC etc. The dotted lines represent references to items we wish to use, a BC using a terminology item for example. But all of these are held within the database in a semantic form as linked definitions (triples). Within the database there is actually no technical difference between the links, they are all triple references.
Note the connections at the operational level down into the data level and back up. These connections provide the basis for traceability (another theme in the FDA’s Technical Conformance Guide) and the route to automation.
Slide 25 – Impact Analysis I
Having created the definitions and then loaded a new version of say the CDISC Terminology we can then perform an automated impact analysis checking for any changes in the terminology and then assessing the impact of changes. Changes may be for any field within a code list item but here I have focused on submission values as they have significant impact on actual datasets.
As we can see, the expected change in C73686 is noted and the subsequent impact on the BC and onwards to the Form and VS domain is noted in the report. The report takes a short time to run (the more changes the longer the report).
Slide 26 – Impact Analysis II
Again an animated slide. The second green Terminology version is introduced; Vn+1. We can link the two versions as the same entity (CDISC Terminology in our example), the child items can be matched by their identifiers (C Codes in this case) and then the attributes compared to seek out differences (in this case submission values).
When a difference is detected the linked metadata can be traversed following the links so as to assess the impact. So for the code list item modified we can see where the item is referenced (which BCs). From there we continue navigating the links to see which forms reference that BC and the same with Domains. The thick red lines represent the navigation through the linked metadata (or linked data, metadata being data) to seek out our desired results.
Slide 27 – Upgrade
Having all this information to hand also allows for automatic upgrade. So having determined the impact we can decide if we are happy to move forward with the new version of the terminology. If so we can allow the system to upgrade the links within the metadata.
Again an animated slide showing the links being updated from the BC to the new code list items.
Slide 28 – And Then …
The benefits of the semantic approach is the ability to expand and link to new content. So we can add in the new areas as we desire.
As we stated above, the protocol may wish to reference BCs (they are only observations that we refer to by name in the protocol document). We could introduce a protocol model as it becomes available and link it with our existing metadata definitions.
The same approach can be taken with the analysis world. Work is on-going in this area within the PhUSE working groups. As models appear they can be linked to the existing infrastructure. The links I have shown here are just ideas at the present but datasets based on derived data that is itself based on captured raw data is an obvious first simple model.
But whatever the model, the semantic web helps us expand and adjust as we learn.
Slide 29 – Summary
So in summary. We need to be far more aware of versions and the relationship between various versions. This allows us to have greater understanding of our data. We need good version management. By doing so we are then in control, we know what is going on and our work gains precision and visibility to issues that may otherwise be hidden. Impact assessment becomes a lot easier.
To do this we need good tool support (and hence the reason behind a lot of my work). It is hard to achieve using manual methods of Excel. Its possible but it is very hard. As a tool and a technology, the semantic web is well suited to this sort of application.