Sunday, November 10, 2013

Submissions in XML

The CDISC XML Technologies Team is currently working on an ODM-XML based format for SDTM, SEND and ADaM submissions (or exchange). This format is envisaged to replace the old SAS Transport 5 (SAS-XPT, .xpt) format. The immediate advantages are obvious:
  • no more 8-, 40-, and 200-character limitations 
  • limitations on test codes disappear (this e.g. hindered us to use a LOINC code for LBTESTCD, as the LOINC code starts with a number, like 1234-5) 
  • supplemental qualifiers can remain in their parent domain/dataset. All that needs to be done is to flag them as such in the define.xml file (e.g. using Role="SUPPLEMENTAL QUALIFIER")
  • no more splitting of information over different fields (the COVAL1, COLVAL2, ... disaster) or even datasets ("banning" of information with over 200 characters to supplemental qualifier datasets)
  • perfect fit with define.xml 1.0 and 2.0: validation of SDTM/SEND/ADaM datasets against the define.xml now really becomes a piece-of-cake. Both the metadata (define.xml) as well as the data (the new format) now use the same format - both are extended ODM. 
These are the advantages for the use of the standard (SDTM, SEND, ADaM) themselves. Further great advantages are:
  • we can now really obtain end-to-end as we now have one format to transport information from study design to submission. Of course the contents will differ, but at least we do not need to switch between formats (and technologies) during the process 
  • XML is the format of choice for exchange of information in the modern world. This also means that an enormous amount of software programs and software libraries are available for working with XML
  • real vendor-neutrality: As ODM is an open standard (SAS-XPT was semi-open, it was very hard to implement in software) anyone with some basic XML knowledge can now develop great software with great features that work with SDTM/SEND/ADaM datasets. In the >20 years of SAS-XPT for SDTM, I haven't seen a single successful third party software programm using it. 
As the industry will need a transition period, the XML Technologies Team will also provide some tools like:
  • tools to transform existing SAS-XPT datasets into the new format 
  • tools to transform files in the new format to the old SAS-XPT format (but who would like to do so?) 
  • tools or scripts for loading the datasets in popular statistical software packages 
  • a viewer for inspecting datasets in the new format 
Development of that viewer is my task in the team. I called it the "Smart SDS-XML Viewer" as the name of the new standard will probably be "SDS-XML" and "smart" as the viewer will have capabilities and features that will go far beyond what the SASViewer could do.
The latter was just a viewer for SAS-XPT files, it was not "SDTM-savvy", it even did not understand what SDTM is about or how it works.

The picture below shows a few of the first features that were implemented sofar:

  • simple SDTM/SEND/ADaM validation such as uniqueness of the USUBJID in the DM dataset
  • check whether the subject is really present in the DM dataset
  • validation whether all required/expected fields really have a value 
  • validation of dates: is the date a real existing date (2013-03-32 is not), does RFENDTC really come after RFSTDTC? 
  • calculation of age from BRTHDTC (when present) and RFSTDTC and checking against the value given in AGE 
  • display of "date of first study medication exposure" and "date of last study medication exposure" as retrieved from the EX dataset in the DM dataset. The latter means that we can now remove RFXSTDTC and RFENDTC from the DM domain - they should never have been there as they are copied from EX
The second screenshot shows how easily supplemental qualifiers can now be visualized: the picture shows the right side of the DM table where the supplemental DM qualifiers are shown.

The columns containing these are colored somewhat differently (that information is retrieved from the define.xml). For ease of use, the USUBJID column has been shifted.

Other features that have been implemented, but which can be better demonstrated using a movie (soon to come, stay tuned) are one-click "jumping" to the corresponding record in the DM dataset (and back), one-click jumping from a comment record in CO to its parent record in another dataset and few-click jumping from a RELREC record to its parent records.
Of course the software also allows sorting and filtering. For example, one can first load the DM dataset and e.g. filter all subjects above a certain age, and than load other datasets for those subjects only. This feature will probably make life of reviewers much much easier.

Another small feature I implemented is highlighting of values (--STRESN) that are outside the reference range (defined by --STNRLO and --STNRHI) for all findings datasets.

Now you will probably ask what the cost of this viewer software will be. The answer is "nothing". It will become available for free as "open source" with a license similar to that of OpenCDISC. So reviewers at the FDA will be able to use it for free from day 1, and users at sponsor companies will have the same tool available as what the FDA reviewers are using. Even more important: as the tool will be open source, everyone can extend it, add great new features, for example for analysis, visualization, etc.

Stay tuned for more information and the public release announcement!


  1. You say "XML is the format of choice for exchange of information in the modern world" Well, I may think "modern programmers" prefer JSON over XML. From a data architecture perspective I would say that a"modern model" is the key aspect. However, even if you have a "modern model", capable of organizing data as well as metadata and terminology i.e. RDF, you can have it expressed in formats easier to use, such as JSON-LD and Turtle, or more complicated, such as RDF/XML.

  2. Hi Kerstin, I agree with you that essentially the transport format is unimportant as long as it does not constraint the model (as unfortunately SAS-XPT is doing). Also unfortunately, the current SDTM model is just two-dimensional (idea from the relational database world), but "polluted" with derived variables that do not belong in such a model, such as RFXSTDTC and RFXENDTC and in most cases all the --DY variables.
    I disagree with you on the JSON issue: in the medical world I haven't really seen much JSON implementations. Electronic health records (like HL7-CDA) are all using XML. There has been some discussion within HL7 to develop JSON-based standards, but the idea seems to have been abandoned.
    But, as said, it essentially does not matter for "modern programmers". Unfortunately such people seem not to be present at the regulatory authorities ...

  3. Hi Jozef, totally agree about the constraining SAS-XPT format. And also that the SDTM "model", or rather structure/s, is two-dimensional and include some awkward constructs. As shown in the FDA/PhUSE Semantic Technology project you can represent such structures using a basic RDF schema with classes such as DataElement, DataCaptureField and DataSet.

    Re. JSON I think it's interesting to follow the discussion on using it for FHIR