Saturday, November 16, 2013

Submissions in XML - first results

As already stated in my previous post, creating SDTM/SEND/ADaM datasets has a considerable number of advantages. Today I want to demonstrate a few of them as implemented in the "Smart SDS-XML Viewer", a software tool that we will make available as "open source" when the final specification is published by CDISC. The first advantage I want to demonstrate is real linking between data sets (also known as "joins"). For example, the FDA has insisted that the DM domain contains the information about the first and last treatment date/time, this although this information is already present in the EX dataset (first record of EXSTDTC and last record of EXENDTC). So the variables RFXSTDTC and RFXENDTC were created in the DM domain. That this violates the third normal form for good relational database design was obviously not taken into account. The reason the FDA wanted this is that their tools (SASViewer) was not able to link records in the EX and the DM datasets. The SASViewer can only read SAS-Transport-5 files but has no idea what they mean. With the new XML-based format, linking between datasets becomes very easy. I took me less than an hour to implement such a lookup for date/time of first and last treatment, so that this information appears as a tooltip on USUBJID in the DM table. Here is the screenshot:

Both the dates were taken from the EX dataset and are displayed in a very user friendly format.
One thing that could easily be added (I haven't done so) is the number of days between both days, and add this as a fourth line in the tooltip. Programming this would take me about 15 minutes I guess. If I had to do this from the SAS-Transport-5 however, I don't think I would have a chance (at least not without having to use SAS).

A second advantahe I would like to demonstrate is that supplemental qualifiers can now easily be kept in the parent domain. Most tools that generate SDTM datasets keep the supplemental qualifiers in the parent domain until the very last moment before the SAS-Transport-5 datasets must be generated. At that moment they are split of and "banned" to a separate SUPPQUAL domain such as the SUPPDM domain in case these are additional qualifiers for DM. The reason (I guess) that this was a requirement in SDTM is that there was no way to indicate in the dataset itself that a variable is a supplemental qualifier.
With the new format however, this is not necessary at all anymore. If the supplemental qualifier is marked as such in the define.xml, there is no reason anymore to "ban" it to another dataset. Software can then take care that these variables are marked as being supplemental. e.g. by a different color.

The following screenshot shows how this has been done in the "Smart SDS-XML Viewer". In our study, there are 6 supplemental qualifiers for DM. Instead of "banning" them to a SUPPDM dataset, they simply were retained in the DM dataset. In the define.xml, they have been marked as supplemental by setting the value of the "Role" attribute to "SUPPLEMENTAL QUALIFIER". As the software also reads the metadata, it knows what to do with these variables, in this case it colors them blue.

A third advantage I would like to demonstrate is the lifting of the 8-, 40-, and 200-character limitations, which caused so much pain in the past. In the following screenshot, the label for the variable COMPLT16 is displayed as a tooltip when the user hovers the mouse over the column header. In SAS-Transport-5, there was a 40-character limitation for labels, which we can  now get rid of.

Similarly, we can now get rid of the 200-character limitation. The SDTM forces us to split values with more than 200 characters into different variables and even different datasets (also here, there is a "banning" to a SUPPQUAL dataset). The splitting even has to be done in such a way that it is done between words, and not in the middle of word. In the CO domain (Comments domain), comments values have to be split and distributed over different variables, e.g. over COVAL, COVAL1 and COVAL2.
None of this all when using the new format. We do no split anything, as there is no reason anymore to do so.
The following snapshot shows a record in the CO dataset as displayed by the "Smart SDS-XML Viewer":


  1. Hi Jozef,

    Having non-standard variables as part of the domain, instead of having them in separate Supplemental Qualifiers, is not related to the transport format. This could be implemented in SAS Version 5 transport files, just as easy. In fact, the CDISC SDS team published a proposal for this already in 2011. There are many good reasons to have an alternative for SAS XPT files, but this is not one of them.

    Lex Jansen

    Disclaimer: The opinions expressed above are my personal thoughts and may not reflect the opinions of my employer (SAS).

    1. Hi Lex,

      I agree that it is possible, and should have been implemented.

      However, what I understood from some SDTM people, is that that was not acceptable (already long time ago?) for the FDA (which FDA?) as it is not possible to flag those variables as being non-standard within the SAS XPT files themselves. At that time (I mean when SUPPQUAL was first defined) we did not have define.xml either, so reviewers had no way (except for looking into the spec or implementation guide) to find out whether a variable is standard or not when looking at the datasets. Of course one could have used a naming convention (I do not like naming coventions though), e.g. that all non-standard variables should have the name starting with "SUPPEFFI" in the LZZT-DM example.

      Thanks for pointing me to this!

    2. Last sentence should of course have been:
      "that all non-standard variables should have the name starting with "SUPP", like e.g. "SUPPEFFI" in the LZZT-DM example.