Saturday, August 8, 2015

SDTM/SEND/ADaM Validation - how it should be done

This is my personal opinion.
It is not the opinion of CDISC, not the opinion of any of the CDISC development teams I belong or do not belong too (although I guess that some team members will probably strongly agree with it), nor the opinion of OpenCDISC (who I guess will strongly disagree).

How should validation of SDTM, SEND of ADaM datasets be done?
We have the CDISC standards specifications, the implementation guides (further abbreviated as "IG") and we have the define.xml. Furthermore we currently have two formats for the datasets themselves: SAS Transport 5 and Dataset-XML. Only the former is currently (unfortunately) being accepted by the FDA.

First let me state the most important thing of everything: Define.xml is leading.

This means that what is in the define.xml file that is part of the submission is "the truth". This file contains all metadata of the submission, whether it be an SDTM, SEND of ADaM submission.
So all submitted datasets should be validated against the metadata in the define.xml.
Now you will say "Wait a minute Jozef, being valid against the metadata in the define.xml doesn't mean that the datasets are in accordance with what is written in the corresponding IG".

That is correct! So the first step should always be to validate the contents of the define.xml against the contents of the IG. For example, if controlled terminology (CT) is expected for a specific variable, this should be declared in the define.xml, using a CodeList, and the contents of the CodeList should match the controlled terminology published by CDISC, meaning that every entry in the define.xml codelist should appear in the published controlled terminology, unless the latter has been defined as "extensible". It does not mean that the CodeList should have exactly the same contents as the CDISC-CT list, as usually only a fraction of the coded values is used in the actual submission. For example, the CDISC-CT for lab test codes contains several hundred terms, but the define.xml should only list those that have actually been used in the study.
Also the maximal lengths, labels and datatypes from the define.xml should be checked against the IG. For example maximal 8 characters for test codes, 40 for test names. E.g. if your define.xml states that the maximal length for LBTESTCD is 7, that is completely OK, when it states that the maximal length of LBTESTCD is 9, that would be a violation.
Unfortunately, the latest IGs still publish datatypes as "character" and "numeric", whereas define.xml uses much better and granular datatypes. Fortunately, the define.xml specification itself provides a mapping (see section 4.2.1 "data type considerations" of the define.xml 2.0 specification). Also other things like the order in which the variables appear can be checked at this point.

Once all the information in the define.xml has been validated against the IG and everything is OK, when can proceed with the second step: validation of the datasets against the define.xml. The largest part of this validation consist of checking whether the variable value is of the correct datatype (as defined in the define.xml), is one of the codelist provided by the define.xml (when applicable), and whether its length is not longer than defined in the define.xml. At this point, also the label for each variable (when using SAS Transport 5) can be checked against the one provided in the define.xml, which again must match the one from the IG (at least when it is not a sponsor-defined domain).

Now there are some "rules" that cannot be described in the define.xml. In most cases these are cross-domain or cross-dataset rules, although there also some "internal-domain" rules such as uniqueness of USUBJID in DM, and uniqueness of the combination of USUBJID-xxSEQ in each domain. But even then, some of the rules can be checked using define.xml: for example there is the rule that ARMCD (or ACTARMCD) must be one from the TA (trial arms) domain/dataset. In the define.xml, a codelist will be associated with the variable ARMCD for TA. So one should test whether the value of ARMCD in the TA dataset is one from the associated codelist in the define.xml (note that also e.g. "SCRNFAIL" should appear in the define.xml codelist when there was at least one). When then in another domain/dataset "planned" or "actual" ARMCD/ACTARMCD is present, and its value also corresponds to one of the entries in the codelist of define.xml, the value is valid, even without needing to do any direct cross-dataset validation.

Once validation of the datasets against the define.xml was done (step 2), one can proceed with step 3: validation against any remaining rules that cannot be described in the define.xml. Mostly these will be cross-domain rules, for example: "EX record is present, when subject is not assigned to an arm: Subjects that have withdrawn from a trial before assignment to an Arm (ARMCD='NOTASSGN') should not have any Exposure records" (FDA rule FDAC49). I took one of the FDA rules on purpose here, as it is a clear rule, and e.g. the SDTM-IG does not really provide such explicit clear rules, so validation software implementors usually base their rules on their own interpretation of the IG, which is very dangerous. It is very important that the rules are very clear and not open for different interpretations. As well the FDA as a CDISC-team have published or will be publishing sets of rules (remark that even the "FDA SDTM rules" are not always clear and some are even completely wrong in some cases). At best, such rules are as well human-readable as well as machine-interpretable and machine-executable. Unfortunately, as well the published FDA rules for SDTM are only "text", leaving them open again for different implementations. We have however already done some work on an XQuery implementation for the FDA rules, and one of our students recently implemented the CDISC rules in XQuery for the SDTM-DM domain. The plan is to extend this work, further develop and publish the XQueries in cooperation with CDISC and the FDA, and make these rules available over RESTful web services (so that they can be retrieved by computers without human intervention).

The figure below depicts (of course simplified) the process again:



The figure speaks of SDTM, but the same is equally applicable for SEND and ADaM.

Short summary:
  • step 1: for SDTM/SEND/ADaM submissions, validate the contents of the define.xml against the IG
  • step 2: validate the datasets against the define.xml
  • step 3: for all other "rules" (mostly cross-dataset/domain validation), validate the datasets against rules published by CDISC and FDA, if possible using human-readable, machine-executable rules.
Comments are of course very welcome!