Sunday, February 5, 2017

SDTM-IG v3.2 Conformance Rules v1.0 - First implementation experiences

This weekend, I started implementing the just published "SDTM-IG v.3.2 Conformance Rules v.1.0" under the umbrella of the "Open Rules for CDISC Standards" initiative, an initiative of a number of CDISC volunteers (not a formal team) to implement CDISC conformance rules in a vendor-neutral, open (non-propriety), free, machine-executable but also human-readable format. For this, the W3C open standard "XQuery" was selected, as it makes the rules-implementation independent from the tool used, i.e. everyone can generate tools that implement XQuery and the published rules.

This is not new: also the earlier published FDA, PMDA and CDISC-ADaM rules have been implemented in XQuery (as far as they make sense) and can be downloaded and used by everyone.

The new published SDTM-IG rules come as an Excel file. This is unfortunate, as this format does not allow to use the rules directly in software: the rules need to be read (by a human), interpreted, and then "translated" into software. This is far from ideal, as it leaves a lot of "wiggle room" in the interpretation of the rules. Fortunately however, the team published the rules with sets of pseudocode, with a "precondition" like "VISITNUM ^= null" (meaning: the value of VISITNUM is not null) and the rule itself, like "VISITNUM in SV.VISITNUM". So the rule can be read as "when VISITNUM is not null, then its value must be found in the SV dataset as a VISITNUM value".
This is a great step forward relative to the FDA conformance rules, which are "narrative text only" rules, and which sometimes seem to be just the result of the "CDISC complaint box" at the White Oak Campus.

It is now Sunday noon, and I could already implement about 40 rules (there are 400 of them), so there is still a lot of work to do. But I would already start sharing my first impressions:
  • Not all rules are implementable. Some of the rules are currently not "programmable", as they require information that is not in the datasets or define.xml. An example is "The sponsor does not have the discretion to exclude permissible variables when they contain data" (rule CG0015). Essentially, this is about traceability back to the study design and collected data sets (both usually in CDISC ODM format). Maybe in future the "Trace-XML extension", developed by Sam Hume, can help solving this.
    The Excel worksheet has a column "Programmable" (Y/N/C) and a column "Programmable Flag Comment", but I noticed that these are not always correct: I found rules that were stated to be non-programmable but which I think can be programmed, and vice versa.
  • Not all rules are very clear. Most of them are very clear, thanks to the "pseudo-code" that has been published, but sometimes, this is not enough. An example is:
    "Variable Role = IG Role for domains in IG, Role = Model Role for custom domains". Now, the "Role" is not in the dataset, and usually also not in the define.xml, as a "Role" is only necessary for non-standard variables, and is otherwise supposed to be the one from the IG or model. So what is the rule here? What is checked? I have no idea!
    In other cases, I needed to look into the "public comments" document to understand the details of the rule. It would have been great if the published document would also have had a "rule details" column with extra explanation, with inclusion of the anwers to the questions from the public review period.
  • A good rule has a precondition, a rule, and a postcondition. The two first are present, but the postcondition is failing. The latter describes what needs to be done in case the rule is obeyed or violated. In our case, this would normally be an error or warning message. 
  • Good rules are written in such a way that they are easy to be programmed. Rules like "VISIT and VISITNUM have a one-to-one relationship" are not ideal: they can better be split in 2 rules, one stating something like "for each unique value of VISITNUM, there must be exactly one value of VISIT", and the other one stating; "for each unique value of VISIT, there must be exactly be one value of VISITNUM". This is easier to implement, and also very important, allows to generate a violation message that is much clearer and detailed. Also from the description of some of the other rules, it is clear that the rule developers did not test them (by writing code) whether they are easy to implement or not. 
  • There were also a lot of things that I liked a lot:
  • The document does not distinguish between errors and warnings. The word "error" does not even appear in the worksheet. Good rules are clear and can be violated or not (and not something in between), Therefore setting something to "warning" is never a good idea in rule making, as it usually has no consequences (with the exception of the yellow card in soccer maybe). The use of "Warning" in the FDA rules has generated a lot of confusion, For example in the Pinnacle21 validation software, you get a warning when the "EPOCH" variable is absent in your dataset ("FDA expected variable not found"), but you also get a warning when it is present ("Model permissible variable added into standard domain"). So, whatever you do, you always get a warning on "EPOCH"!
  • More than in the FDA rules, the define.xml is considered to be "leading". For example rule CG0019: "Each record is unique per sponsor defined key variables as documented in the define.xml". This is not only a very well implementable rule, it is also much better than the (i.m.o. not entirely correct) implementation in the Pinnacle21 tool, which usually leads to a large amount of false positives, as it completely ignores the information in the define.xml.
    But also here, some further improvement is possible. For example for rule CG0016 "... a null column must still be included in the dataset, and a comment must be included in the define.xml to state that data was not collected". It does not state where the comment in the define.xml must come (my guess: as def:CommentDef referenced by the ItemDef for that variable, but that isn't said), or what the contents of the comment should be. So how can I implement this rule in software? Is the absence of a def:CommentDef for that variable sufficient to make it a violation? 
In the software world, when there is an open specification, there usually is a so-called "reference implementation". This means that everyone is allowed to generate its own implementation of the specification, but the results must be exactly the same as generated by the reference implementation for a well-defined test set. Other implementations may add additional features, excel in performance, and so on.
Ideally, the source code of the reference implementation is open, so that everyone can inspect the details of the implementation.

Also for this kind of rules (as well the ones from FDA, PMDA, ...) we would like that a reference impementation is published together with the rules. This reference implementation should be completely open, and written in such a way that the rules are written in a way that they are at the same time human-readable as machine-executable. Our XQuery implementation comes close to this.
The people behind the "Open Rules for CDISC Standards" initiative will surely discuss this with CDISC. So maybe you will somewhere in future hear or read about a reference implementation of the "SDTM-IG v.3.2 Conformance Rules v.1.0" written in XQuery!

The rules that I implemented can currently be downloaded from: "http://xml4pharmaserver.com/RulesXQuery/index.html" (also the FDA and PMDA rule implementations can be found there).  You can inspect each of the rules, even when you never used XQuery before, you can use them in your own software (even in SAS), and you can try them out with the "Smart Dataset-XML Viewer" by copying the XML file with the rules in the folder "Validation_Rules_XQuery" (just create it if not there yet), and the software will "see" them immediately.
We are currently also implementing a RESTful webservice for these rules, allowing applications to always (down)load the latest version of each rule (no more need to "wait until the next release ... maybe next year...")

Keep pinging the website, as I am intending to make rapid progress with the implementation of these 400 rules. I want to try to add a few new rule implementations every day. I hope to have everything ready by eastern (2017 of course). 

And if you like this and would like to cooperate in the "Open Rules for CDISC Standards" initiative, or would like to provide financial support (so that we can outsource part of the work), just mail us, and we will get you involved! Many thanks in advance!

And last but not least: congratulations for this great achievement to the "SDTM Validation team"! We are not completely there yet, but with this publication, a great step forward was made!

Thursday, January 5, 2017

Generating Define-XML: the Pinnacle21 roundtrip test

In my previous post, I presented our new "Define.xml Designer" software, implementing all "best practices for generating define.xml", but also allowing to generate extremely good define.xml files for legacy studies for which the SAS-XPT files are already present, but no define.xml exists yet.

It looks as many people are however still using the "Pinnacle21 Community Define.xml Generator", probably because it's free, and uses Excel as an input for the tool. The price for that is however that there is no user manual, no support, no graphical user interface. As there is no manual nor GUI, the originators advise users to load an existing define.xml into the tool, generate the Excel worksheet from that, and then adapt the worksheet for the current study, and then generate the new define.xml from the worksheet with the tool. This usually results in a number of "trial-and-error" cycles, each time changing the worksheet and have a new try, until the desired define.xml is obtained. However, when one knows the basic principles of XML (my students at the university learn these in less than 3 hours), I presume adapting the define.xml using an XML editor is considerably faster (and one understands what one does!).

A good test for such software is always to do a "round trip", i.e. taking a correct file, load it into the tool, and then exporting it again. In the case of the Pinnacle21 Define.xml Generator, this means loading an existing define.xml, exporting it to an Excel worksheet, and then generating a new define.xml from that worksheet, without having made any changes to it.
Ideally, the result should be that source define.xml and newly generated define.xml are 100% identical. No information should be lost, and no new information should have been added somehow. Existing information should not have been changed either.

Round-tripping is a typical quality test for software. Loading a file and exporting it again should result in no differences. So we did the test on the Pinnacle21 software (v.2.2.0) using the sample SDTM define.xml 2.0 file that comes with the standards distribution.

What are the results?


Let us first check whether any information was lost in the roundtrip. This is what we found:
  • we found that the "Originator" attribute on the "ODM" element disappears, as well as the "SourceSystem" and "SourceSystemVersion" attributes. These contain important information about who (organization) and what system generated the define.xml. As there is no manual, we could also not find out how one can reintroduce this important information using the tool.
  • we also found that the "label" of many of the variables had disappeared ("Description" element under "ItemDef" element). We found that this is the case when the variable is a "valuelist" variable. Inspection of the by the tool generated worksheet revealed that there is indeed no "Label" column in the worksheet in the "ValueLevel" tab. Maybe one should add one there manually, but as there is no user manual, there is no way we can find out. This also means that the as such generated define.xml file (without labels for valuelevel variables) is not only essentially invalid, but also not very usable for reviewers either as they cannot find out what the valuelist variable is about.
  • additionally, all "SASFormatName" attributes disappeared. Now, "SASFormatName" is an optional attribute, but it may have it's value to have it in the define.xml when a define.xml of one study is used as a template for a define.xml for a subsequent (similar) study (reuse).
The Pinnacle21 tool removes some of the important attributes on the ODM element (colored red)




Let us now check whether any information was added (silently) that was not in our original define.xml at all. 
  • Rather surprisingly, we found that a number of variable definitions were automatically added, although they were not in the original define.xml. We found that when a variable is defined once originally (e.g. STUDYID, USUBJID) and referenced many times (i.e. by each dataset), the Pinnacle21 refuses this kind of "reuse" and creates different variable definitions for STUDYID and USUBJID, for each dataset a  new one. So, in our original define.xml we had only 1 definition of STUDYID (with OID "IT.STUDYID), whereas in the newly generated define.xml we have over 30 of them (with OIDs "IT.TA.STUDYID", "IT.TE.STUDYID", "IT.DM.STUDYID", etc..). The same applies to USUBJID: instead of having a single definition of USUBJID, we suddenly have over 30 ones.
Did the tool change any information from our original define.xml file?
We found the following:
  • All OIDs (the identifiers) were altered, except for most of the ones of the valuelists (but not all of them) and of the codelists. It looks as in many cases the tool assigns the OIDs itself, without the possibility for the user to have any influence on this. As the OIDs are arbitrary, this is not a disaster, but it again means that one cannot use one define.xml as a template for a next one, especially when one has company-standardized OIDs for SDTM or SEND or ADaM variables.
The Pinnacle21 tool changes all the OIDs in the define.xml (or reassigns them)


We were shocked by the finding that the tool also alters the "Study OID" without any notice. In the original define.xml it's value is "cdisc01", in the newly created define.xml it is "CDISC01.SDTM-IG.3.1.2". We again suspect that the user cannot have influence on the assignment of the "Study OID". The same applies to the OID and Name attributes of the "MetaDataVersion" elements and the contents of its "Description" element: all these were changed by the tool without any notice.



OIDs of "Study" and "MetaDataVersion" have been altered, as well as "MetaDataVersion Name" and the "MetaDataVersion Description"





You might now ask yourself how our own "XML4Pharma Define.xml Designer" scores in the "roundtrip test". Well, you can easily find out yourself by requesting a trial version of the software and perform the roundtrip test yourself. This will also allow you to discover how user-friendly this new software is.


Conclusion: the Pinnacle21 "Define-XML Generator" does a pretty good job in generating a (prototype) define.xml starting from an Excel worksheet. The "round trip test" however shows that the user does not have any influence at all on how the OIDs are generated. Worse is that the labels for the "ValueList" variables are missing. Maybe this can be circumvented by adding an extra "Label" column in the worksheet for them, but as there is no user manual, there is no way to find out.
This means that the generated define.xml still requires manual editing (best by using an XML-editor - there are some free ones). This triggers the question whether taking an existing define.xml, and use an XML-editor for adapting it for a new study isn't the faster way, with the additional advantage that one is knowing what one is doing".
There are considerable better define.xml generating software tools on the market, with nice GUIs and wizards (including our own "Define.xml Designer"). These are not for free, but their cost is very reasonable, and e.g. only a fraction of what the "Pinnacle21 Enterprise Edition" costs.