Wednesday, March 28, 2018

Annotating Clinical Research Protocols

As promised, somewhat more information about the protocol annotation tool that I developed in the last few months.

The tool reads protocols as a simple text file, which is then converted to XML. Once loaded, the user can select text, and is then prompted for which coding system to use for the annotation:

Currently, about 20 different coding systems can be used. The main dialog only shows the most needed ones for protocols. Clicking "UMLS - Multiple Coding Systems" displays the other ones:

Now, don't think that we have all these coding system libraries within the software. No, we use RESTful web services to connect to publicly available services, such as UMLS RESTful web services. Some of these RESTful web services have been developed by us, but are also publicly available from our public XML4Pharma server (so YOU can use them too).

Once a coding system selected, the selected text is submitted to the RESTful web service, and the best possible match asked for. Here is a short movie about how this works for CDISC test codes:

In this case, trial study parameters (TSPARMCD) and values (TSVAL) are being derived.
Near the bottom, one can see how each annotation is stored. A UUID is generated to identify the annotation, and together with the text coordinates (start-index and length) and the coding system and code is stored internally. When exported to the XML file, this looks like:

When selecting "pulse rate":

and again choosing for "CDISC Test Codes", one gets:

resulting in an annotation: "CDISC CT VSTESTCD" in the XML:

One can of course also use any of the other coding system. Here is a short movie about assigning LOINC codes (unfortunately nearly practiced in protocols!) for lab tests:

We also applied the system to annotations for inclusion and exclusion criteria:

This allows to extract the inclusion- and exclusion-critera from the "annotated protocol", together with the "trial parameters", and automatically generate a CTR-XML (CDISC Clinical Trial Registries) file, which can then easily be converted to SDTM datasets TS and TI.

Unfortunately, CDISC does not yet encourage to use SNOMED-CT coding (another "not invented here" case). SNOMED-CT coding however can be enormously  useful, especially when data needs to be retrieved from electronic health records (these usually use LOINC and SNOMED-CT coding).
Here is a short movie about SNOMED-CT coding of some text in the protocol:

REMARK: not all movies use the same version of the software: the software is in constant evolution.


Unfortunately, protocol writers often use simple tables or even embedded pictures to display the study work flow or "schedule of events" (which in this form I usually name "schedule of disaster"). There is however an international standard for workflows: BPMN2 (Business Processing Model and Notation), which even has an XML implementation. I have used this standard a lot in the past e.g. for study design, but I haven't seen a single protocol yet where BPMN2 is used an especially not in XML. What a pity!

Future prospects:

If you watched the movies and know something about Machine Learning (ML) and Artificial Intelligence (AI), think about the following: you annotate 10 protocols and then feed that as an input to an ML program. As you will know, protocols often look similar or at least have similar elements. Can you then imagine that the ML program can annotate the 11th protocol itself without human interaction? Some parts such as the Inclusion/Exclusion Criteria are an easy prey for such ML systems! And then imagine that you also have a metadata repository containing standardized CRFs or CRF templates. With a bit of luck this combination could allow to automate 80% or more of the study design, AND at the same time generate the SDTM trial design datasets, and a CTR-XML dataset for submissions to clinical trial registries.


Using a simple Java program in combination with RESTful web services allows to annotate protocols with codes from over 20 coding systems used in medical informatics, healthcare and biology. Such annotations allow for much more precise instructions to the sites on what exactly should be done.
Currently, in many cases interprete the instructions in different ways. For example, when you instruct the site to measure "albumin in blood" you might obtain results of 20 different tests with 30 different units.
I consider this as a first step towards the "e-Protocol". I don't claim this "annotated protocol" is the best way on the road to an "e-protocol", but at least it is a first step.

Comments are of course very welcome!

Thursday, March 22, 2018

The machine-readable SDTM-IG

One of the major problem with the CDISC SDTM Implementation Guide (SDTM-IG) is that it is a PDF document. The last published version (SDTM-IG v.3.2) is just not 400 pages long, containing information for almost 50 domains.

As the SDTM-IG (as PDF) is not well machine-readable, this means that when implementing its contents in sofware, one needs to read each line, implement each table, go through all the "assumptions", and first interprete and implement every rule (which is even not designated as "rule" in the text) in the software. This and other facts have led to validation software that completely over-interpretes the IG, and then also implemented many of the "rules" incorrectly, leading to a software that delivers many false positive errors. Unfortunately, this "buggy" software is also used by the FDA for all incoming SDTM submissions!

So implementing an SDTM-IG in software (the same applies for SEND-IG) is not only a huge task, but also error-prone - a lot of copy-and-paste may be involved. As it requires a human interpretation during the implementation, each software for generating SDTM uses its own interpretation of the standard, which of course undermines the meaning of the word "standard".

So wouldn't it be better if we have a machine-readable SDTM-IG, published by CDISC? This would enable that software just reads the electronic version and implements it, replacing weeks of writing code for it, with the result of yet-another-interpretation of the standard. The call for such an electronic version already exists for many years, but each time it is requested during the public review, the so-called "CDISC disposition" is: "considered for the future".

Now, I am sick and tired of each time hearing "considered for the future" during almost the last 10 years, so I decided to start working on a machine-readable version myself. This is also a lot of work (but maybe only 10% of the overall effort of developing a completely new version of the SDTM-IG). Fortunately, I could convince four of my undergradute students to do this in the scope of their "Bachelor project", in which they essentially learn all the aspects of working in a project team, but also must deliver a technical result. So the SDTM-IG in XML was their envisaged result.

They did a good job, and delivered the result well in time. It wasn't perfect (SDTM was new to them), so I still needed to make a few corrections and do some additions (I am still adding new "features"). We also developed an XSLT stylesheet to create a human-readable view of the standard, and which is >99% identical of what is found in the by CDISC published PDF. This means that we created a machine-readable document which at the same time displays exactly identical as the PDF document.

In the XML version of the IG, all the domains are grouped per class, and for each domain, the variables are defined in XML elements:

For example, the variable for the variable EXDOSFRM:

Remark that extra information was added:
  • The "modern" (XML) datatype of the variable, that is also used in the define.xml
  • When a CDISC codelist is attached, the NCI code of that codelist.
This would allow to automatically generate a define.xml template for each domain.

For codelists that are "sponsor defined", the "*" from the PDF is replaced by well-machine-readable XML attribute:

Also the "Assumptions" were implemented as XML, at this moment still as simple narrative:

However, the sentences about "variables generally not used ..." (whatever that means - such sentences should not appear in a standard), were structured and elements created for this:

so that software can use this information to generate an "info message" when the user tries to add one of these variables to the domain.

In a number of cases, the students could also add the "examples", formatted as XHTML within the HTML (as is also done in HL7 FHIR):

Here, some of the table formatting was done in the XML, but that can of course also being taken care of by the stylesheet.

Rules are important. Unfortunately, they are "hidden" in the SDTM-IG as simple text, which easily leads to different implementations. So, we also started adding some of the rules in machine-readable pseudo code. For example:

In future, we will add the  rules as XQuery code ("Open Rules for CDISC Standards"), which will be enormously useful once the FDA moves away from SAS-XPT and finally accepts Dataset-XML as the file format for submissions. We will use the "SDTMIG v.3.2 Conformance Rules" published by CDISC, unfortunately long after the publication of the IG itself. Essentially, such rules should be published at the same time as the standard itself. Using our approach, they can be published as part of the standard (XML) document itself.
Also remark that these rules already are available as XQuery, so we only need to transfer them from there to within the XML for the SDTM-IG.

So, we have everything as machine-readable XML, opening a lot of opportunities. But how does this all display when a normal human wants to see the information? In that case, the stylesheet is applied and the result is displayed in the browser. Here are a few screenshots:

And for the "Assumptions":

or the examples:

Looking almost 100% identical to what is seen when inspecting the PDF file.

All this work was done by four undergraduate students as part of a not-all-too-large project.
So you may ask: why did the SDTM team not take this approach?

Honestly said: I don't know.
Maybe that they never cared about machine-readable standards? Or that they prefer "business as usual"?

I often have the strong impression that SDTM-IGs are still being developed using Excel worksheets and Word documents. The better way is surely to develop them using a database, in which each variable, each assumption, each rule, even each example is a row in a table in the database.
Of course one can also use a native XML database.
It then only takes one click to generate the XML version of the IG, and a second to generate the human-readable display of the IG.
That database system could be SHARE. Unfortunately, this is done not yet: SHARE currently only contains non-machine-readable versions of the standards. Instead of publishing the standards to SHARE, they should be developed within SHARE. This would also enable services (e.g. using RESTful technologies) like those that have only be implemented by some visionary volunteers.

Whether this SDTM-IG in XML is freely available? Yes and No. It is still a prototype and not 100% ready yet.But yes, it is available to companies that want to fund the further development at my institute at the university. You can find the coordinates here.

Saturday, February 17, 2018

If you only have a hammer ...

... everything looks like a nail.
and if you only know XPT format, everything looks like a table...

Why this comparison? The trigger for this was a recent proposal of the CDISC QRS ("Questionnaires, Ratings and Scales") team to add a table to SDTM, a so-called "Trial Lookup Table" ("TL" domain), containing metadata, like information about instruments. They provided a number of examples, one with codes assigned to questions on a questionnaire, one about questions on a questionnaire sharing an identical set of possible answers ("scale"), and one about "logically skipped items". The latter sounds very much like CDISC-ODM "skip questions", but we are unfortunately not allowed to use ODM nor its "tabular" equivalent Dataset-XML for submissions.

Essentially, the table ("domain") they propose is nothing else than an "entity-attribute-value" model, a type of tables that can contain almost everything as it is about key-value pairs.

For example, for the simple case that a question can have answers ranging from 0 to 4:

 The proposed table is:

i.e. they need 5 rows for explaining to the reviewer that for that specific question (specified by TLVAR1, TLVARVAL1, ...) the possible values are 0 to 4.
That there is an ODM file with the study design that exactly states this in a much simpler way does not come up in the mind of the people who proposed this, as all they know is ... XPT tables!

Fortunately, we do have a few open-minded volunteers within CDISC who think further than tables ("the earth is not flat"), so Sally Cassells of Next Step Clinical Systems immediately demonstrated how this should be done in a much more simple way in the define.xml. For example, for the "scale sample", this information simply goes into a "ValueList" in the define.xml. The human-readable presentation of this (I do want to shield you from the XML although it is extremely simple) is:

 and for the scale values (0-4) themselves in the codelist (wich was already in the ODM in the study design):

So how do we educate these people who can only think in terms of tables that also the clinical research world is not flat? I would propose that every member of any of the SDTM development teams first must attend a define.xml course (where we explain such things very well), before they come with "yet-another-table" proposal.

And if you now say: "well Jozef, then you need to take an SDTM training too", I can say "I did"!

Saturday, January 13, 2018

Why changing "Submission Value" into "Preferred Term" is a bad idea

The CDISC-CT team recently published a new Controlled Terminology Package 33 for public review. At the same time, a proposal for changing the column header from "CDISC Submission Value" to "CDISC Preferred Term" was published:

In this blog, I will explain why this is a bad idea and why CDISC members should protest against it.

You can already find my own protest here:

First of all, we need to take into account that CDISC controlled terminology is based on tradition rather than on science. CDISC controlled terminology is a set of "lists", without any relations between the terms. CDISC members can ask to add terms based on their own, local usage of a term.
For example, last automn, I asked to add "centimeter mercury column" to the "UNIT" list as in the country I originate from (Belgium) blood pressure is measured (by tradition) in "centimeter mercury column" rather than in "millimeter mercury column". So CDISC added it to the list. What is however not visible from that list is what the relation is between "centimeter mercury column" and "millimeter mercury column". As a human, I know that 1 cmHg = 10 mmHg. But how does my computer know that? Does the CDISC-CT allow to know how to convert "pounds per square inch" into "millimeter mercury column"? If CDISC would allow UCUM notation, such unit conversions can easily be automated. And how does my computer know that (for CDISC codes) "SEVERE" is worse than "MODERATE" is worse than "MILD"? This all is not part of CDISC-CT.

Also, CDISC is publishing codelists for things it has no authority in. For example, it publishes "lists" of microorganisms (codelist MICROORG), whereas specialists in the field have developed taxonomies (for example NCBI) and also SNOMED-CT has a full taxonomy of microorganisms: 

The NCBI and SNOMED-CT taxonomies of microorganisms is based on science, the CDISC "list" of microorganisms is based on allowing members to add terms to the list based on the tradition how they name a microorganism locally. In the CDISC-CT list of microorganisms, you will not find any information on how these organisms are related to each other - it is just a list.

There are some cases where these "lists" based on tradition make sense, for example for "vital signs test code" (VSTESTCD/VSTEST), although this is also already covered by a scientific taxonomy developed by LOINC:

We indeed need to realize that LOINC is not yet used in every hospital, although it is mandated to be used in electronic health records in many countries and by the US "Meaningful Use" program, so such a VSTESTCD codelist can be used as a temporary solution, but it should not be forever.

So, the proposal to change the column header from "CDISC Submission Value" to "CDISC Preferred Term" is suggesting that in the whole clinical research process (and thus not only in submissions to regulatory authorities) we should start using terms that are based "on tradition", and forget about all the science. So it suggests that instead of writing "Glucose" in our protocols, we should start writing "GLUC", or instead of writing "measure the number of Metamyelocytes/100 leukocytes" (LOINC code 28541-1) in our protocols, we should put "BASOMM" as that is the CDISC "preferred term" and then also add a "method" from the "METHOD" CDISC codelists, and add additional terms from other CDISC-CT lists to complete the description of "measure the number of Metamyelocytes/100 leukocytes, use LOINC code 28541-1".

Changing the designation "CDISC Submission Value" into "CDISC Preferred Term" would be a very dangerous evolution. It would isolate us further from other standardization organizations for which there is an overlap in application area. It would make the statement to these SDOs saying "We don't need you".
And it would mean that CDISC completely "says goodbye" to the use of concepts that are based on science.

A second major problem is that CDISC controlled terminology is tightly bound to the 30 year old, obsolete SAS Transport 5 format (XPT format), with its 8-character and 40-character limitations. This format is only used within CDISC, no other industry worldwide is using this anymore. For example, CDISC "test codes" (--TESTCD) are limited to 8 characters only, which must be ASCII characters, and may not start with a number. Test names (--TEST) are limited to 40 characters and must be ASCII characters. This has lead to some idiotic test codes and names, such as "Corpuscular HGB Conc Distribution Width" as "test name" for "test code" "CHDW" (NCI-ID C139068) where the word "Concentration" needed to be shortened to "Conc" because of the 40 character limitation. Also "CHDW" is meaningless as a mnemonic, due to the 8-characted limitation for --TESTCD.

So, when this proposal would be accepted, we are pinning everything we do in terminology, whether it is in submissions or in non-regulated research, to the outdated XPT format. This means that for everything that is "CDISC preferred"
  • is limited to 8 characters when it is a code
  • is limited to 40 characters when it is a name or description
  • is not allowed to have any characters outside the ASCII-range, so "ñ", "ü", "á" (spanish characters), no German characters like "ß", "ü", no Norwegian characters like "å" or "æ", no Japanese, no Chinese, no Arabic, no Korean, no ...
  • may not start with a number

Do we really want this? Do we really want to say to people who do not submit to regulatory authorities, but do want to use CDISC standards, that they should keep away from LOINC, from UCUM, from SNOMED-CT and NCBI coding, and use CDISC terms instead that
  • are nowhere else used in the world
  • that are based on tradition
  • that are not based on science at all
Do we want to say to them that their codes should be not longer than 8 characters, and that non-ASCII characters are not allowed as these do not comply to "CDISC preferred"? Should we force them to implement the limitations of the XPT format in their systems? Highly probably, they do not use SAS-XPT at all.
This CDISC-CT proposal indeed looks like "megalomania" to me.

It is already bad/sad/mad enough that for submissions, we are obliged to use controlled terminology that is not based on science, and now the CDISC-CT team wants to extend this to everything we do in clinical research. Have they really gone mad?

If you agree and/or feel the same way, please comment directly to CDISC on their JIRA "issue" site: You will need an account, but if you don't have one, you can create one using!default.jspa. Please take into account that this account is not the same as your "CDISC members" account.

Your comments here are of course always welcome!

Sunday, December 10, 2017

SDTM and CDISC-CT: fit for e-Source?

The title of this post should have been something like "CDISC SDTM and Controlled Terminology post-coordinated versus pre-coordinated", but then most people would probably have no idea what I am talking about. So a little bit of explanation first.

CDISC SDTM uses "post-coordinated" controlled terminology. This means that controlled terms are combined "as needed" so that they can be build "as required". The consequence is that the result is dynamic, the ontology is "what you see", and any combination of terms is possible. So essentially, the combination of e.g. LBTESTCD=Albumin with LBSPEC=Blood and LBMETHOD=dipstick is valid, although you can't test albumin in blood using the dipstick method (that method is only available for albumin in urine).
"Post-coordination" has its advantages. It brings (some) order into chaos. It is especially useful when it is not known in advance (or cannot be envisaged) which tests will be performed.

Most systems in healthcare use "pre-coordination". This means that any possible combinations are assembled in advance and, when meaningful, obtain a single code. So not all combinations are possible. An example of such a system is LOINC. So in LOINC, you won't find a code for "albumin in blood measured using dipstick", but you will find a code (1751-7) for "albumin in serum or plasma measured quantitatively as mass/volume". Pre-coordinated are (must be) precise: each code should uniquely describe a term (a test in this case).

CDISC SDTM findings domains have been developed to bring "order in chaos". Essentially this means the paper world or the world where protocols do not precisely describe which tests need to be performed. For example, in the famous LZZT protocol we find the following tests defined: "Urinalysis: Color, Specific gravity, pH, Protein, Glucose, Ketones, Bilirubin, ". That's it. So not very precise. The problem with this is that each site can (and will probably) perform different tests. For example, for "glucose in urine", LOINC lists over 20 different tests (even when excluding all the "post" and "challenge" tests). When then submitted, post-coordination is necessary, but the results will not be comparable between sites, studies and sponsors. Even the combination of LBTESTCD (essentially the analyte), LBSPEC (the specimen, e.g. "urine") and LBMETHOD does not guarantee at all a unique combination. So it is no wonder at all that the FDArecently mandated the use of LBLOINC, i.e. it requires (as of 2020) that additionally, the unique LOINC identifier is added.

The problem however is not limited to laboratory tests alone. For example, there has been a discussion on the CDISC wiki about the "ebola vital signs CRF", about how the important test " highesttemperature in the last 24 hours" must be annotated for SDTM. Using SDTM, it cannot be done, as there is no way to define "in the last 24 hours". 

The solution is however simple when using LOINC: the LOINC code 8315-4 "Body temperature 24 hour maximum" very exactly describes this test.

Remark that the argument "pre-coordination could result in an explosion of new CT terms ..." is nonsense if CDISC finally allows LOINC to be used (it is not a problem in healthcare ...).
This means that our current SDTM findings variables are not always able to exactly describe tests, even when using post-coordination.

Nowadays, we see that research data are more and more extracted from electronic health records (EHRs) and hospital information systems (HIS), rather than collected separately (e-Source). There are even voices that say: "in 5 years from now, everything will be e-Source". Data from e-source is almost always pre-coordinated, i.e. using pre-coordinated terminology like LOINC, SNOMED-CT, etc..
When e-Source data is used, and the data is submitted, the pre-coordinated terminology must be translated to post-coordinated terminology, which is arbitrary, ambiguous, and not always possible, as the "highest temperature in the last 24 hours" example clearly shows. For lab tests, we can use the LOINC tests 5792-7, 22705-8 and 25428-4 as an example: all three would be modeled in SDTM as LBTESTCD=GLUC, LBSPEC=URINE and LBMETHOD=TEST STRIP. One can only distinguish by looking at the results themselves and at the units used.

Both examples "maximum temperature in the last 24 hours" and "glucose in urine by test strip" demonstrate that information loss is possible or even unavoidable. So, even when the test is exactly described by a pre-coordinated code (LOINC, SNOMED, …), we are forced to submit using a post-coordinated system with loss of information or test uniqueness.

This leads me to an important conclusion: the current SDTM is not fit for use with e-Source.
It is great for the paper world and for classic EDC where data is collected separately from the healthcare world.

How can we do better? Especially when the statement "everything will be e-Source in 5 years from now" becomes true.
In the past, I published an article "An Alternative CDISC-SubmissionDomain for Laboratory Data (LB) for Use with Electronic Health Record Data" in the "European Journal of BioMedical Informatics" (EJBI) where I proposed that, at least for laboratory data coming from e-Source", the typical LBTESTCD, LBTEST, LBSPEC and LBMETHOD are replaced by  a set of variables that align with the 6 dimensions of LOINC.

However, this only provides a solution for laboratory data using LOINC. There are however more coding systems used in e-Source data. For example, for microbiology data, NCBI coding is often used. This means that when using e-Source, data (pre-coordinated) using NCBI coding must be translated to one or more of the SDTM variables in the SDTM domain, which uses its own CDISC controlled terminology, and with guaranteed loss of information, as NCBI is much more specific.
Essentially, all this means that we need an alternative "e-Source" domain for each of the existing SDTM findings domains. These new domains can be much simpler than the existing SDTM domains, as much of the information for which several variables are needed in the "classic" domains, can now be in one single variable, the "test code". As these domains need to be "code system neutral", the core variables in these "e-Source" domains would be "test code", "code system" and maybe "test name". The latter is even not necessary, as there is a 1:1 relationship with "test code" and can easily be looked up automatically by computer systems e.g. using one of the many RESTfulweb services from NLM, UMLS, NIH, HIPAA etc..
So for example, for the "e-Source LB" domain, the core variables would be:
"Study ID" to "Sponsor-Defined Identifier" and then "test code" and "test system", "original test result", "original result units" (using UCUM). The classic LBCAT, LBSCAT, LBSPEC, LBMETHOD can be removed, as they are all included yet in the pre-coordinated "test code". Remark that I avoid to assign variable names, as e.g. "LBTESTCD" would mean completely different things in both variations of the LB domain. In the e-Source domain it would mean "the unique test code" whereas in the classic domain, LBTESTCD is essentially misleading, as it specifies the analyte, and not the test (remark that –TESTCD has a different meaning depending on the domain in classic SDTM).
In the "e-Source" LB domain, the first records in example 1 of the SDTM-IG (page LB-5) would look like:

Study Identifier
Subject ID
Sequence Number
Test Code
Code System
Original Result
Original Result Units (UCUM)

Using UCUM notation is important, as UCUM notation is almost always used in e-Source, and we don't won't information loss nor conversion errors. Even more, UCUM allows automated conversions (e.g. for the "standardized result"), using one of the RESTful web services available (NLM and our own one).
The next columns in the "e-Source" LB domain would then be "reference range indicators", and the "standardized results". The latter could then use (at least for quantitative results) e.g. use the "LOINC proposed unit".
Similarly, for the "ebola hightest temperature in the last 24 hours", which cannot be exactly described at all in classic SDTM, the "e-Source" VS domain could contain a record like: 

Study Identifier
Subject ID
Sequence Number
Test Code
Code System
Original Result
Original Result Units (UCUM)

Also here, VSCAT (and VSCAT) are not used, as it is already comprised in the test code 8315-4. In most cases, even VSPOS will be unnecessary (for e.g. blood pressure), as it is already included in the LOINC code.

As it is clear that the current SDTM is not fit for use with e-Source, we make a first proposal for a set of "e-Source" findings domains, using pre-coordinated coding systems (as is already used in e-Source), and using UCUM as much as possible for unit notation.
These "e-Source" domains are not meant to replace the "classic" SDTM domains, as these remain their value for the "classic" case where data is collected separately (paper, classic EDC). These "classic" domains can then only be deprecated when "everything is e-Source", so maybe in 5 years from now?
Please remark that with this first proposal, I do not encourage the use of "tables" for regulatory submissions. At the contrary, on the longer term, we need to go to submission of "biomedical concept" data points or "resources".
But that's another discussion