Wednesday, November 5, 2014

ValueList web services for SDTM

Today, I added three more web services, which have to do with SDTM value lists:

  • Validate whether a CDISC unit is a correct unit for a given (VS) test code
  • Validate whether a Vital signs "position" (VSPOS) is a correct "position" for a given (VS) test code
  • Validate whether a Character Result/Finding in Standard Format text value (EGSTRESC) for a given ECG Test code (EGTESTCD) is a valid value
These web services are based on the excellent work of Anthony Chow (CDISC), who published these lists (as Excel) on the CDISC website under "CT Mapping/Alignment Across CodeLists". 
All I did is generate a relational database from these lists, and then writing the new web services using RESTful technology.

For example, when you submit:
meaning "validate whether 'mmHg' is a valid CDISC unit for the vital signs test 'DIABP') to the web service, the string "true" is returned, but  when you however submit,
the string "false" will be returned.

This is not "rocket science" at all - it took me one evening to create and add the tables to an existing relational database, and about 3 hours to develop, test, and roll out the web services (and write this blog ...).

This is just a first simple example of what is possible, one can easily do similar things for more or even extremely complicated domains such as the LB domain (laboratory) - however in my opinion, we should have a complete rewrite of the LB domain, and use LOINC for the tests, and UCUM for the units.

The complete list (for the new ones, look near the bottom) of the available web services can be found at:

Are these the kind of web services you would like to see implemented with SHARE? Please let me (and CDISC) know ... 

Saturday, November 1, 2014

More web services

In the last weeks and months, with help of my students at the university, I developed some more web services for use with CDISC standards, especially for use with SDTM and CDISC controlled terminology. Almost all of these have been implemented in the "Smart Dataset-XML Viewer".
I got very enthusiastic reactions about these web services, especially during a demo for CDISC officials, for the Japanes CDISC User Group (CJUG) and yesterday during a "Dataset-XML CDISC Training" in Copenhagen, Denmark. As I could also nicely test the services during the last weeks, and found no major issues, I decided today to make the specification of these web services publicly available. You can find them at:

These are all RESTful web services, so extremely easy to implement in your own applications, and as RESTful services are based on HTTP, you can even test them in a browser.

Suggestions for new services and for improvements are as always very welcome. Please just contact me at

Monday, October 13, 2014

Bringing SUPPQUALs back to their parent dataset

In CDISC SDTM, values for non-standard variables (i.e. sponsor-defined variables) need to be submitted in separate data sets, the so-called "supplemental qualifier" or "SUPPQUAL" data sets.
So for example, if you have a variable "ENDPOINT" in your lab data, you do not submit these in your LB data set, you submit them in a SUPPLB data set.
The reason for this is an historical one: in the past, there was no way to "mark" non-standard variables in the data sets, as these needed to be provided in SAS Transport 5 format. You could of course mark them as such in the define.xml file, but as the define.xml format and the SAS Transport 5 format are pretty incompable, tools like the SASViewer could not elaborate this meta-information. So you could indeed put values for non-standard variables in the parent data set, but the viewer would not e.g. color them differently, unless of course you wrote a special viewer. No one however wanted to invest in such a venture, for understandable reasons.

Everything changed when the Dataset-XML format was published. It replaces SAS Transport 5 and is based on as well CDISC ODM, as define.xml, and perfectly aligns with the latter. So you can now just put the values of non-standard variables (like the "ENDPOINT" variable in our case) in the normal data set, and just tell that this is a non-standard variable in the define.xml. As Dataset-XML and define.xml perfectly work together, tools extract the metadata from the define.xml and apply this to the variable values when loading the values themselves from the Dataset-XML files.

This allows to e.g. do data validation during loading, like checking whether the to-be-loaded value is of the right data type, whether it obeys to its associated codelist, and so one (I could write several pages about this ...).

For non-standard  variables, if you add 'Role="SUPPLEMENTAL QUALIFIER"' to your non-standard variable's ItemRef, then the system knows that this is a non-standard variable and can e.g. color it differently in the viewer. That is how we implemented it in the "Smart Dataset-XML Viewer".

Now, keeping the non-standard variable in the parent data set instead of "banning them" to a SUPPQUAL data set is not yet allowed by SDTM. This might change once the FDA starts accepting Dataset-XML as an exchange format.
Currently, the FDA is piloting the use of Dataset-XML, but still wanted the participants of the pilot to submit their data in almost exactly the same way as when using SAS Transport 5, i.e. with non-standard variables in SUPPQUAL datasets, which I think is a lost opportunity.

So I was looking into a way to bring non-standard variables back to their parent data set during loading in the "Smart Dataset-XML Viewer". When you have a SUPPLB however with 65,000 records, this would not be easy (i.e. it would be time consuming), as for each SUPPLB record, you need to find the parent record in the parent LB data set, which itself also has over 60,000 records. So I started thinking about a "smart" solution.
This weekend I found it. I observed that in most SDTM datasets, the records are more or less ordered by USUBJID. So if you first do a quick scan over all the records of the LB data set, and for each unique USUBJID keep the first and last occurrence row number, and keep this information in an "hashmap", you later only need to search for the parent record within this limited range.
To my surprise, generating the hashmap only took 25 milliseconds, although the 60,000 records need to be scanned. Merging the non-standard variable values back into their parent data set then became very efficient, a speed increase by a factor of over 20 was observed. As such, it took less than 2.5 minutes to merge the 65,000 SUPPLB records back into the 60,000 LB records. Here is a screenshot:

The variables "ENDPOINT" and "LBTMSHI" are the non-standard ones, and although they were loaded from the SUPPLB data set, they nicely show up in the LB table (I moved the columns to the left for better visibility).

The "trick" that made this possible was essentially creating sort of an index on USUBJID.

So I wonder whether the reviewers at the FDA create indexes on their (SAS?) tables in their review systems to be able to easily combine information from different tables. Do they?

P.S. I will release this new version of the "Smart Dataset-XML Viewer" on Sourceforge in the next few days. In that release, "bringing back SUPPQUAL data to their original data set" will be an optional feature (it still takes extra time). So you will have a checkbox for switching the feature on or off (default is off):

It would be great if someone would then try out this new feature on data sets that have millions of records, and then report back to me!

Monday, September 29, 2014

The meaning of "Unit"

Last week, I worked on a mapping between the CDISC-CT [UNIT] codelist, and UCUM. For every "unit" published by CDISC (there are about 550 of them), I tried to find an appropriate UCUM notation. Then I used the mapping (which was done using extending the Excel worksheet provided by CDISC) to generate a relational database and also generated a RESTful web service.
So, when you make an HTTP request "" the corresponding UCUM notation {beats}/min will be returned. Similarly, if you submit "", then "mm[Hg]" will be returned.

I then implemented this web service in the Smart Dataset-XML viewer: when the user right-clicks a cell with a unit (e.g. --ORRESU or --STRESU value), the web service is triggered and the UCUM notation is shown (when the cell value is a valid unit from the [UNIT] list). A few screenshots are shown below:

In some cases, the CDISC notation follows the UCUM notation, but this is surely not always the case, especially for non-SI units the deviations are considerably.

What difficulties did I encounter during the mapping exercise?
Quite a few ...
Some CDISC "units" are not units at all. For example "Virtual Pixel" (NCI C71620).
Other "units" are mixing up objects "what it is about" and units. For example "g/mol Creatinine". UCUM has recognized that this bad habit exists and has solved this by so-called "annotations" (see the UCUM specification). So the UCUM notation for this is "g/mol{creatinine}.
In my opinion, CDISC should control annotations for use in clinical research, not the units themselves.
A difficulty that arose, and costed me quite an amount of time is the "unit" "U/kg". The CDISC definition is: "An arbitrary unit of substance content expressed in units of biological activity per unit of mass equal to one kilogram. Unit per kilogram is also used as a dose calculation unit expressed in arbitrary units per one kilogram of body mass". This sounds like a dual definition, i.e. "U" is used for two different things. When it is a unit of biological or catalytic activity the UCUM unit "U" can be used which is equal to 1 umol/min:

So when a biologial activity is meant, the corresponding UCUM notation for "U/kg" would then simply be "U/kg" which is equal to 1 umol/min/kg.

When "arbitrary units per one kilogram of body mass" is meant (second part of the CDISC definition), then it is something arbitrary, and depending on what is measured. In such a case, an annotation must be used. So, in the second case, the UCUM notation must be {Unit}/kg.

It is OK that a "CDISC unit" means completely two different things depending on the use case? I don't think so. Is "arbitrary unit" a unit anyway? Isn't the wording "arbitrary units" a "contradictio in terminis" anyway?

Do you also think CDISC should stop developing controlled terminology for "units" and use UCUM?

You reactions are as always highly appreciated.

Saturday, September 20, 2014

SDTM: let the service do the work - not the dataset

This week, I found some time to continue working on SDTM. Or better: on services for SDTM. In my previous blog entry, I already showed how web services can help working with controlled terminology such as the LOINC codelist for laboratory tests.
I know extended this for CDISC controlled terminology (CDISC-CT) in general, based on the work of my student Wolfgang Hof. First, I download the latest CDISC-CT (june 26) from the NCI website as a set of XML files. Starting from these, I generated and populated a database with about 6 tables. I then wrote some RESTful services so that remote applications can retrieve information for answering questions like:
  • what is the test name for test code XYZ?
  • what is the NCI code for test code or test name XYZ (or the other way around)?
  • what is the CDISC definition of controlled term ABC?
  • are there any synonyms for controlled term ABC?
I hope to make these services available for the general public in the next few weeks.

Then I implemented a good number of these services in the "Smart Dataset-XML Viewer". Here is a screenshot as an example:

What you see that is when the user hovers the mouse over a test code (in this case a LBTESTCD: SPGRAV), the web service is triggered, the test name and NCI code is retrieved from the remote server/database and displayed as a tooltip on the cell contents.
When the user right-clicks the LBTESTCD cell, the web service is triggered and looks up the "CDISC definition" for the given test code and displays it in a separate window (left upper corner).
When the user right-clicks the LOINC code for this test (in this case 2965-2) a request is send to the RESTful web service of the "National Library of Medicine", returning the address of a website with explanations about the test, which is then displayed in a browser window that pops up.

On the right, you also see some yellow-colored cells. These indicate that there is something special with the data. In the current case, the cell is colored because its value is lower than the low normal range limit. This is not done by a web service, but by the viewer software itself. Thus, when using this feature, the SDTM variable LBNRIND is superfluous and can be removed from the SDTM specification ("let the service do the work - not the dataset"). Other such features that are already present in the "Smart Dataset-XML Viewer" are:

  • show date of first and last exposure in the DM dataset (retrieved from EX)
  • show --DY value on any --DTC value (calculated from difference with RFSTDTC in DM)
  • show visit name on VISITNUM (retrieved from TV)
Essentially this means that many of the SDTM variables (all the ones that are "derived") are superfluous. We estimate that about 1 in 3 SDTM variables could be removed from the SDTM-IG as they can be calculated "on the fly" from the data that are already present in the datasets, or being retrieved by a web service. For example, all --TEST variables are superfluous, as their value can be obtained from a web service.

Now, this is just the tip of the iceberg. So many other things are possible which can considerably contribute to data quality in SDTM submissions. A few examples:

  • the web service informs about what the usual units for the test or observation are. For example: mm[Hg] for SYSBP and DIABP, cm and [in_i] (inches) for WEIGHT, no units for SPGRAV. This can be used to test whether the combination of ORRES and ORRESU is reasonable and acceptable
  • if it were allowed to use UCUM notation for ORRESU/STRESU (unfortunately it is not yet, although all EHR systems and Hospital Information Systems work with UCUM - it is even mandated by Meaningful Use), then the value of --STRESN could be automatically calculated. The combination of the value of --TESTCD with --ORRES and --ORRESU could be send to the web service with the request "please calculate the standardized numerical value", as the web service already knows to what unit the value must be standardized to for the specific test. This would even enable to have such normalizations as some of the values for blood pressure are e.g. in [psi] (pounds per square inch)
In my opinion, these are the kind of features and services people will expect from SHARE in the future. SHARE should be more than a repository of standard specifications, it should behave as an semi-intelligent system that help sponsors and reviewers improve data quality of electronic submissions.

For those who like these features of the "Smart Dataset-XML Viewer" and these web services, I am still working on improving the features and extending them, and I hope to make a new (branched off) version of the Viewer available on the Sourceforge website within the next 1-3 weeks. So please remain a bit patient ...

Comments are of course always welcome!

Wednesday, September 3, 2014

LOINC Web Services

In my previous post, I showed how a simple web service (using REST) can be used to retrieve information about a LOINC code from a remote public server. In the "Smart Dataset-XML Viewer", when the user hovers the mouse over a LOINC code, additional information about that LOINC code is displayed as a tooltip. I now extended this principle to connect to a webservice from the National Library of Medicine, named MedLinePlus Connect. What the webservice is however returning is snippet of XML containing a reference to a website that contains a lot of explanation about the given test. The way I implemented this is such that when the user right-clicks a LOINC code in the "Smart Dataset-XML Viewer", the MedLinePlus webservice is called, the website reference is retrieved, and the user's default browser is opened with the given URL. So when I right-click "3094-0" in the viewer, a new browser window pops up, giving me the MedLinePlus information about the corresponding "Blood Urea Nitrogen" (BUN) test:

 Cool isn't it?

Is this rocket science? No, not at all, it only costed me 3 hours (including writing this blog) to implement this in the "Smart Dataset-XML Viewer".

MedlinePlus also has similar web services for SNOMED-CT, ICD-9 and ICD-10, and medications (RXCUI).So what I want to do in the next few days is to see whether we can implement more of these web services in the "Smart Dataset-XML Viewer".

Saturday, August 30, 2014

Why LBLOINC is so important - web services

LOINC is a worldwide recognized coding system for laboratory tests. It contains over 72,000 codes for lab and vital signs tests, and is used in almost every hospital in the world.
The use of LOINC coding is mandatory in most electronic health record systems and standards for exchange of them, like HL7 CDA. Also, in order to achieve semantic interoperability, the US "Meaningful Use" programm mandates the use of LOINC coding (see e.g.

Still, CDISC refuses to mandate the use LOINC in SDTM LB (laboratory) datasets. Instead, it has developed its own controlled terminology for lab tests,  and has published this on the NCI vocabulary website. The latest version of CDISC-CT for lab tests has somewhat more than 2,200 test codes, so it is considerably less detailed and less granular than LOINC. Even worse, the CDISC-CT is just a list, whereas LOINC is a 5-dimensional system.

Essentially, this means that sponsors or their service providers must map whatever they get from the labs to CDISC-CT, which is not only laborious, but also prone to error. A few years ago, a "CDISC LOINC most used codes" list has been published, but is does not contain the CDISC-CT codes, and  is also not maintained anymore. Since then, the number of CDISC-CT codes has multiplied.

In the SDTM domain, there is a variable LBLOINC (LOINC code) but is marked as "permissible" and the explanation is: "Dictionary-derived LOINC Code for LBTEST" meaning that the LOINC code should be derived from LBTESTCD/LBTEST, and not the other way around.
So we have the ridiculous situation that the recommended workflow is to derive LBTESTCD/LBTESTD from what the lab is delivering (although the lab itself is using LOINC internally) and then derive LBLOINC from LBTESTCD/LBTEST. You can already guess what will come out of such a workflow.

The fact that LBLOINC is "permissible" has led to the praxis that LOINC codes are omitted from SDTM submissions. I only have seen a small fraction of SDTM submissions where they are provided. This also means that the FDA cannot compare lab test results between different studies and sponsors, as LBTESTCD is not sufficiently granular and is ambiguous.

One of the arguments that people have used against the use of LOINC codes in SDTM is that FDA reviewers cannot deduce information about what the test exactly is just from the LOINC code, which is 4-5 digit number + 1 check digit. Better said: the review tools of the FDA are not connected to a LOINC database for LOINC-code lookups.

Therefore, we have developed a LOINC lookup "RESTful" web service and implemented it in the "Smart Dataset-XML Viewer". It's all still a prototype, but I would like to present the first results anyway.

The development of the prototype took me 3 evenings and a saturday afternoon: one evening for installing the LOINC database on my application/web server, 2 evenings for programming the web service (server side), and one evening for implementing the client side in the "Smart Dataset-XML Viewer". So this is not "rocket science" and I wonder why NCI has not yet implemented such web services for their vocabularies.

Here is a snapshot of a first result:

This is a snapshot of the view of an SDTM-LB dataset using the "Smart Dataset-XML Viewer", containing a LBLOINC column (remark that I moved it to the left just after the LBTESTCD column).
When the user hovers the mouse over an LBLOINC cell, the webservice is triggered, and the code (in this case 26515-7) is send to the LOINC terminology server, and as well the short LOINC name (showing the 5 dimensions), the LOINC "common" name (for "normal" people well- understandable short description) and the example UCUM unit or units that are given in the LOINC database are returned by the web service. This information is then combined, and showed as a tooltip on the cell.

It is important to note that the "Smart Dataset-XML viewer" does NOT have a LOINC database, but it uses a "web service" on a remote server that is generally available to anyone or any application that is connected to the internet. 

We are currently developing a series of such web services and will make them generally available. One my my students has already developed a --TESTCD lookup service (see and which I still need to implement on a public server, and we will also develop a "UCUM conversion" web service, that e.g. allows to calculate a blood pressure in mm[Hg] from a blood pressure in [psi] "pounds per square inch).

Such web services will hopefully trigger CDISC to adhere to worldwide healthcare accepted semantic standards for laboratory tests and for units, instead of developing their own controlled terminologies for these.

If you would like to try out our LOINC webservice, just send me an e-mail, and I will provide you with the details how to query it.

Saturday, April 26, 2014

Submissions in XML - the next step: web services

In one of my previous entries, I showed how some SDTM variables are essentially superfluous as they are derived and the derivation can as well be done by the tool that reads the data, such as a viewer. Such features were already implemented in the "Smart Dataset-XML Viewer" (freely available from SourceForge). Using this viewer, the user can get values for all "--DY" variables calculated "on the fly", as well as the values for RFXSTDTC (first date of study treatment exposure) and RFXENDTC (last date of study treatment exposure) which are automatically retrieved from the "EX" dataset and displayed in the "DM" dataset.

One of my students (Wolfgang Hof) has taken the idea a step further and has developed a web service (using RESTful web service technology) for retrieving information about test codes (like LBTESTCD, VSTESTCD, ...). He developed a simple database and server software, and also generated a few test clients. He also implemented the service in the "Smart Dataset-XML Viewer". So if the user hovers the mouse over a cell containing a LBTESTCD value, the web service is triggered, a request send to the server, and the received information about the test code is displayed as a tooltip:

Now, what are the consequences of this?

First of all, it means that we could get rid of the variable "LBTEST", as there is a 1:1 relationship between LBTESTCD and LBTEST and the web service takes care of retrieving the value of "LBTEST" anyway. So essentially we could remove the variable LBTEST from the "required" variables in the SDTM-IG, also meaning that our LB submission files will decrease in size by something like 10-20%. That is surely something the FDA will like.

But take it another step further and imagine that we could query SHARE using a web service asking for additional information about the "CHOL" test code, returning information like which are the "preferred standardize units", what the FDA regards as being the normal range, and much much more, and use that information in the viewer, like marking values the FDA regards as being "out of range". Also SHARE already "knows" that "CHOL" is a test in the category "chemistry" so that we also can get rid of LBCAT.

A review environment using web services like the one our student Wolfgang Hof developed and using SHARE, wouldn't this be a great step forward?

Thursday, March 27, 2014

Why FHIR is so cool - also for clinical research

Last week I took an "HL7 FHIR" course under the motto "life long learning".
I became enthusiastic about FHIR as it takes away many of my concerns about HL7-CDA.

FHIR (pronounce "fire") has been developed by software people, so by "ordinary" IT people, and in such a way that it can be used by people like you and me without needing to learn every detail of the HL7-v3 RIM.
I have been teaching HL7-CDA at companies, and am teaching CDA at the university. It is extremely hard to explain. When doing so I get questions like "why is a medication an Act?" or even worse "why is a medication an observation?", or "why is a document an Act - that simply cannot be true".

When using FHIR however, you do not need to know about the RIM, acts and observations or roles and entities. In FHIR a medication is called a "Medication" - that's it! Furthermore FHIR is about RESTful web services, which is a modern way of exchanging information. Furthermore, FHIR is about snippets of information (named "resources") and not necessarily about complete documents.

Why is this also so important for clinical research?

When an investigator for example wants to know about all the medications that were subscribed to a subject in the last five years by different healthcare providers, he can either query all these systems which have different interfaces (I suppose here he is allowed to do so and has access), or ask these systems to generate a standardized document like a CCD or CCDA in the USA, or an "Arztbrief" or "Entlassungsbrief" in Austria (which is a kind of discharge letter). All these are "documents", meaning that one first need to find the right section in the document (probably by OID), and then inspect the "code" and "codeSystem" to be sure the entry is really about a medication, and then extract the information. As a medication is just an "observation", this is not very easy to automate.
Also, will a hospital be willing to generate a CCD either "on the fly" or in "off line mode" for a medication that was prescribed 5 years ago? And will it want to reveal information about the patient that is needed to make the document valid, but has nothing to do with the request?

When using FHIR, a medication is a "Medication" and the investigator can query the systems where the subject has records using a RESTful web service and a standardized FHIR interface, returning snippets of information (FHIR resources) instead of a whole document. So only the relevant information is provided.

Also for the patients themselves, FHIR is or will be a huge improvement. For example, if a patient wants to make a plot of all his/her cholesterol values of the last years as function of time, he cannot easily do so from a set of CDA documents, as CDA was never developed for such a use case. He/she will probably need to copy-past the values from the CDA document (or from the HTML view of it) into an Excel worksheet or other application, and then try to generate the plot.
With FHIR, the patient just can use a tool that uses a RESTful web service to query the system with the LOINC code of the cholesterol test (or tests), and get a list of values back with the test dates. Making the plot from this list is then probably a piece of cake (generic software for patients can become available).

So if FHIR with its web services is made available for use by patients, it could also be used for clinical research (of course only with authorization of the subject). Getting all the medications of the last five years would then become much easier as it is now by extraction from CDA documents (which were never designed for such a purpose), if the latter can be obtained from the different providers at all.

Your comments are very welcome as usual!