Over the last couple of weeks I have been working to improve the implementation and support within Glandon for supporting SDTM. This includes support for the SDTM model, the implementation guides and then domains as actually implemented in studies. I have been keen to try and model the full relationships between SDTM Model variables, model classes, the use of classes as the basis for Implementation Guide (IG) domains and then link a sponsor’s actual implementation of domains to a firm base in the model and the IG. This is all part of trying to build a better SDTM model. I also want domain level version control as I believe, as we move forward, we will need to manage at the domain level to allow targeted changes to be made in a more timely fashion; this is also something that will come with define.xml v2.1.
Besides the actual model of what SDTM is – the answer to that is not an easy one as it is hidden in the model and IG documents PDF texts – another issue was a source of actual metadata of the various releases of SDTM to allow for an electronic load. There are three sources:
- The SDTM Model and IG documents (PDF)
- CDISC eSHARE downloads (Microsoft Excel)
- The PhUSE CDISC 2 RDF project outputs (semantic web RDF)
The first option was never going to happen. Too much like hard work and it would take forever. The eSHARE downloads are electronic and cover the three desired versions of SDTM IG 3.1.2, 3.1.3 and 3.2 with the associated SDTM Models 1.2, 1.3 and 1.4 respectively. However, a big minus is these downloads lack any of the SDTM rules in an electronic form. The CDISC2RDF project does have the rules but only covers the first two versions. So none of the routes offers a perfect solution.
After much deliberation and several mugs of tea I decided to create a semantic model reflecting the SDTM ‘model’ and use the Excel sheets as the source of data/metadata. What follows is the results of that experience that might be useful to others.
The eSHARE downloads share a format with some subtle differences. These are noted in the following table. Note, the list may not be exhaustive. Each download consists of a single sheet and combines the model definition with that for the IG domains (the downloads are really more IG focused).
Used by Model
Used by IG
|Seq. For Order||Order within the domain||Y||Y||Positive integer|
|Observation Class||The model class||Y||Y||For:
|Domain Prefix||The domain prefix||Y||Code list. See C66734|
|Variable Name (minus domain prefix)||The variable name minus the prefix||Y||Y||Free text|
|Variable Name||The variable name||Y||Y||Free text. Should be upper case and <= 8 characters|
|Variable Label||The variable label||Y||Y||Free text. Should be <= 40 characters|
|Type||The type||Y||Y||Num, Char.
For 3.1.2 the field is blank for Code Lists and ISO 8601, i.e. when the Controlled Terms or Format column is populated.
|Controlled Terms or Format||Controlled terms or Format||Y||For:
Can be blank for some IG entries, e.g. RELREC
|CDISC Notes (for domains) Description (for General Classes)||Notes||Y||Y||Free text|
|Core||The compliance for the variable||Y||
The table speaks to some of the difficulties in using the files and the differing content across the three versions. A few other observations about the Excel files just to give a flavour of some of the issues:
- There is no version information in the files, no statement of the Model or IG version for which the definitions relate, other a tab name in an inconsistent format.
- The main sheets are inconsistently named across the three files.
- The Format field refers to a code list using the submission value as the key. No version is quoted, either at the variable level, for a domain or for the IG as a whole.
- There are errors in the sheets, occasionally you come across blank values were there should be values stated. For example, a few datatypes are missing along with a few core values.
- Some useful information is not present. I have already mentioned the rules. Other information that would be useful is Domain structure, domain long names (though code list C66734 could be used but, without a version, if a label changes you don’t know which one is correct).
However, using a few assumptions, filling in the missing values and allowing for the difference in format a semantic load can be made to populate three SDTM Models, the base variables (–[Var Name] versions), classes based on those base variables and then IG versions with associated IG domains, each domain being based on a domain class.
Having loaded the three SDTM Model versions and the three IG versions I executed a couple of SPARQL queries. I had seen a recent LinkedIn question about SDTM and 40 character label names. So I ran the first one you can see in the image on the right. This one asks for all Model Variables where the length of the label is greater than 40 characters. It outputs the the URI, the variable name and the offending label. I am not sure if Model variables have to be 40 characters or less long in the same way as IG variables do but I thought I would check. The results indicate that the label for –TESTCD in all three model versions are over the limit.
More interesting is for the IGs. Change the type from ModelVariable to IgVariable , see the query below, and we get this result we see in the third image below. We have 14 results, two from internal version 2 (IG 3.1.3) and 12 from internal version 3 (IG 3.2).