Friday, July 7, 2017

CDISC-CT offers more than you think!

But even CDISC doesn't know that ...

Now that classes, exams and theses corrections are all done, I today finally found the time to work on one of my favorite topics: connecting CDISC Controlled Terminology (CDISC-CT) with other (and better) terminologies.

As I already stated in prior posts, CDISC-CT just consists of lists, with almost no relations between terms in the lists. The only (indirect) relations that are described are these between --TESTCD (test code) and --TEST (test name) through the NCI code.
For example, in CDISC-CT the only relationship between "diastolic blood pressure" (DIABP, NCI code C25299) and "systolic blood pressure" (SYSBP, NCI code C25298) is that they are both a vital signs test (NCI code C66741). But so are "height" (HEIGHT, C25347) and  "body frame size" (FRMSIZE, C49680). So the relationship between SYSBP and DIABP is exactly the same as between SYSBP and HEIGHT. We do all however know that systolic and diastolic blood pressure are highly related, and that there is no relation at all between systolic blood pressure and height.
All this knowledge is NOT in CDISC-CT, as it only contains ... lists.

However, other people are smarter and have developed UMLS, the "Unified Medical Language System". UMLS tries to connect all terminologies in medicine and healthcare, and very fortunately this includes CDISC-CT (through NCI controlled terminology).
Pretty recently, the National Library of Medicine NLM made a RESTful web service available for working with UMLS. It allows to submit a term or code in one system (e.g. CDISC-CT, LOINC, SNOMED-CT) and then ask for all related terms (parents, childs, mappings to other terminology systems, and much more) of the submitted term. One can then use the result list in different way, e.g. pick a related term, submit it and ask for related terms, get the list, pick ...
Like that, one can perform "chaining" and build networks of related terms with different kinds of relationships, and this not only within a specific coding system, but also between coding systems.

The RESTful API is not so easy to work with, as it requires a registration and works with "ticket granting tickets", allowing to retrieve a "ticket", which is only valid for a single REST request. Also, the response comes as JSON, which is not my big strength yet, so I transform that to XML, which is then parsed to retrieve the information.

My first "chaining" experiments were pretty successful. I developed some simple software that allows to submit a CDISC-CT, LOINC or SNOMED-CT term (others to come), and than (from the response) produces a list of mapped terms (in other systems), parent terms (in the same or other system) and child terms. The user can then select one of the related items, and submit that for further chaining. At the moment, the software is still very simple, and choices must still be provided through the console.

[SCREENSHOT TO COME HERE]

As I already stated, if you submit a CDISC-CT term, and ask for the parent and child terms, you won't come far, as such relations are merely present in CDISC-CT. The nice thing however is that the smart people at NIH and NLM added them as well as is possible. So you will e.g. find that CDISC "ALBGLYCA" (Glycated Albumin test) is a child of "ALB" (Albumin test), although that is not described in CDISC-CT at all.

I first did something very simple: I submitted CDISC-CT "ALB" (C64431) and asked for parent and child element. Remember that CDISC-CT as published by CDISC does not provide any such information. Here is a selection of the result (only most interesting terms are displayed):


 Ok - extremely simple, but already much more than is in CDISC-CT itself!

I did something very similar for CDISC-CT "DIABP" (C25299). Here are the results in a simple tabular way (instead of a picture). Again, this is a selection only:


Remark that term NCI C54706 is even not in CDISC SDTM-CT!

I also tried out "chaining". I again started from DIABP and then first looked for the child terms, picked on of them, looked for the child elements, ... Here is a partial result:

And similar, but then looking for ancestors:


I already hear a lot of my colleagues scream "why don't you use semantic web and RDF"?
They are completly right! But I am still at the beginning, exploring the possibilities with the RESTful web services, thinking about filters (no, I do not want all the MESH translations in my results), thinking about using this in a way that makes sense, optimizing my code (for each RESTful request, one need to retrieve a new "ticket", which makes it pretty slow).

I have already a masterproject in mind for a good student, building a graphical interface around this, so that the user can just click on a node, and either all parents, children or mappings to other code systems (or all of them) are generated through the RESTful web service and then displayed with the possibility for user-selected filters, ...




All I wanted to show today is that when using UMLS, there is more in CDISC-CT than one thinks, but even CDISC does not know that ...


Saturday, July 1, 2017

Submission dataset validation: Regular expressions versus XQuery

An interesting post recently showed up at the Pinnacle21 forum regarding validation of ISO-8601 durations. For those not familiar with ISO-8601 durations, this is about expressing time spans ("durations") in a machine-readable format. For example, a duration of 1 week is expressed as "P1W", a duration of 1 month and 21 days as "P1M21D". You can even have more complicated durations such as "3 days and 5 milliseconds" which is expressed as "P3DT0.005S".

The discussion was about rule FDAN039 (a SEND rule), which states "Value of Duration, Elapsed Time, and Interval variables (--DUR, --ELTM, --EVLINT) must conform to the ISO 8601 international standard". This rule is a consequence of section 4.1.4.3 "Intervals of Time and Use of Duration for --DUR Variables" in the SDTM-IG and section 4.4.3 "Intervals of Time and Use of Duration for --DUR Variables" in the SEND-IG.

First remark that the rule as formulated by the FDA (was it?) is not 100% correctly defined, as it suggests that the valid ISO-8601 value "2017-07-01" is a valid value for e.g. "CLELTM", which it isn't. Rules should be exact!

The discussion on the Pinnacle21 forum was about the regular expression to validate this rule. Here is a snapshot:


Got it? Understood it?


I am not going to show you the alternative the forum contributor proposed as it is as unreadable as the one above ...

So, how would the rule implementation look like if we were (finally) allowed to submit data in XML? In that case, we could easily use XQuery, and make the rule implementation independent of the software used. Such rule implementations are already public available for all SDTM, SEND and most ADaM validation rules published by CDISC, FDA and PMDA.
Here is the core of the rule FDAN039 expressed in XQuery:

The first 8 lines check whether the value is a valid "week" duration expression ("PnW"), the last line checks whether the value is either a valid XML-Schema duration (which is 1:1 implementation of ISO-8601 "duration" except for "week duration"), or is a valid "week" duration.
An error is then returned when the value is not a valid schema duration and not a valid "week" duration:

Which of both, the regular expression (from Pinnacle21) or the XQuery do you find better readable? Which of both can one best check on whether the rule is correctly implemented?

Time to move away from SAS Transport 5 and to move to a modern format like XML ... It would make validation so much easier ...

Your comments are welcome as always!