Saturday, November 7, 2015

Submission standards and monopolies

This morning and the last week was frustrating in one aspect. I had to retrieve some personal information from the Austrian tax portal "Finanzonline". It came to mind as I received a letter from the authorities that I hadn't reacted to a message to me on that portal (I didn't even know that there was such a message, as there never has been any E-mail notification). Now I had to reset my password for some reasons, so I tried and expected to obtain an E-mail allowing to do the final step. Instead I got the message that another paper letter would be send to me. I wasn't at home when the postman wanted to deliver it, so I had to pick it up at the post office: the postman was not allowed to throw it in my post box (as it was "official"). To my surprise I did not get a new password - I got two.
It then took me another 15 minutes to get the portal entry working, among others as it wanted me to change these two passwords and required me to enter additional information, i.e. my social security number which is known to them (beyond username and the two passwords). Finally I got into the portal, read the message, saying I needed to provide a few documents. I looked for a PDF upload button. There wasn't. I looked for an e-mail address (including in the letters I got) where I could send the PDFs to. There wasn't either. The only thing I found was a phone number: if I had questions I would need to use that (phone is much more secure than E-mail isn't it? Just ask Mrs Merkel, Germany's chancellor). Essentially, I experienced the portal as one of the top 3 most user-unfriendly systems I ever used (the two others are the German tax portal, which is even worse, and the SAP portal at our university)

So it came as a surprise when a colleague told me that this tax portal is a prize winning portal! I googled somewhat and found the following announcement on the portal's website itself:

Even if your German is good (mine is), can you read it? I can't - they even did not get the encoding of German characters right ... If you would get such a product description on Amazon, would you still buy from them? Looks like a prize winner in the land of the half blind...

Now, what does this have to do with SDTM submission standards?

Just like the tax authorities here, the regulatory authorities for new medications and therapies have a monopoly. They do not need to compete like the Chicago Scubs in basebal (see Wayne's blog). So there is no real drive for IT innovation, except sometimes from the general public over the parliament (see e.g. the article in Medpage Today).

I my not so short life, I have always been one of the first trying out new IT systems when they became available that aimed to make people's life easier. I started using the internet at the time you still needed a phone modem, and transferring a 10MB dataset took a whole day. I started using internet banking at the time that less of 1% of the customers of my bank did. So I gained quite some experience. My experience with IT systems of organizations that have a monopoly is the following:
  • They haven't got a working portal for their "customers" (example: "Deutsche Rentenversicherung")
  • If they have, it is often extremely user-unfriendly (example: the Austrian tax portal)
  • Within the organization, they use outdated technology
  • They generally distrust e-mail 
  • No publicly described web services are available
  • They have no idea what XML and JSON is, or have prejudices against them
Recognize some things in the scope of SDTM and ADaM submissions?

Suppose these type of organizations each have a large set of web services for which the API is publicly available. Then vendors could compete on tools for interacting with the systems of these organizations. For example for the Austrian tax authorities, Austrian citizens could then choose between different portals for communicating with the authorities, the portals being owned by different companies or organizations, competing against each other in user friendlyness.

But this doesn't solve the problem of the drive to modernize...

Any exceptions on these? Just let us know...

Saturday, August 8, 2015

SDTM/SEND/ADaM Validation - how it should be done

This is my personal opinion.
It is not the opinion of CDISC, not the opinion of any of the CDISC development teams I belong or do not belong too (although I guess that some team members will probably strongly agree with it), nor the opinion of OpenCDISC (who I guess will strongly disagree).

How should validation of SDTM, SEND of ADaM datasets be done?
We have the CDISC standards specifications, the implementation guides (further abbreviated as "IG") and we have the define.xml. Furthermore we currently have two formats for the datasets themselves: SAS Transport 5 and Dataset-XML. Only the former is currently (unfortunately) being accepted by the FDA.

First let me state the most important thing of everything: Define.xml is leading.

This means that what is in the define.xml file that is part of the submission is "the truth". This file contains all metadata of the submission, whether it be an SDTM, SEND of ADaM submission.
So all submitted datasets should be validated against the metadata in the define.xml.
Now you will say "Wait a minute Jozef, being valid against the metadata in the define.xml doesn't mean that the datasets are in accordance with what is written in the corresponding IG".

That is correct! So the first step should always be to validate the contents of the define.xml against the contents of the IG. For example, if controlled terminology (CT) is expected for a specific variable, this should be declared in the define.xml, using a CodeList, and the contents of the CodeList should match the controlled terminology published by CDISC, meaning that every entry in the define.xml codelist should appear in the published controlled terminology, unless the latter has been defined as "extensible". It does not mean that the CodeList should have exactly the same contents as the CDISC-CT list, as usually only a fraction of the coded values is used in the actual submission. For example, the CDISC-CT for lab test codes contains several hundred terms, but the define.xml should only list those that have actually been used in the study.
Also the maximal lengths, labels and datatypes from the define.xml should be checked against the IG. For example maximal 8 characters for test codes, 40 for test names. E.g. if your define.xml states that the maximal length for LBTESTCD is 7, that is completely OK, when it states that the maximal length of LBTESTCD is 9, that would be a violation.
Unfortunately, the latest IGs still publish datatypes as "character" and "numeric", whereas define.xml uses much better and granular datatypes. Fortunately, the define.xml specification itself provides a mapping (see section 4.2.1 "data type considerations" of the define.xml 2.0 specification). Also other things like the order in which the variables appear can be checked at this point.

Once all the information in the define.xml has been validated against the IG and everything is OK, when can proceed with the second step: validation of the datasets against the define.xml. The largest part of this validation consist of checking whether the variable value is of the correct datatype (as defined in the define.xml), is one of the codelist provided by the define.xml (when applicable), and whether its length is not longer than defined in the define.xml. At this point, also the label for each variable (when using SAS Transport 5) can be checked against the one provided in the define.xml, which again must match the one from the IG (at least when it is not a sponsor-defined domain).

Now there are some "rules" that cannot be described in the define.xml. In most cases these are cross-domain or cross-dataset rules, although there also some "internal-domain" rules such as uniqueness of USUBJID in DM, and uniqueness of the combination of USUBJID-xxSEQ in each domain. But even then, some of the rules can be checked using define.xml: for example there is the rule that ARMCD (or ACTARMCD) must be one from the TA (trial arms) domain/dataset. In the define.xml, a codelist will be associated with the variable ARMCD for TA. So one should test whether the value of ARMCD in the TA dataset is one from the associated codelist in the define.xml (note that also e.g. "SCRNFAIL" should appear in the define.xml codelist when there was at least one). When then in another domain/dataset "planned" or "actual" ARMCD/ACTARMCD is present, and its value also corresponds to one of the entries in the codelist of define.xml, the value is valid, even without needing to do any direct cross-dataset validation.

Once validation of the datasets against the define.xml was done (step 2), one can proceed with step 3: validation against any remaining rules that cannot be described in the define.xml. Mostly these will be cross-domain rules, for example: "EX record is present, when subject is not assigned to an arm: Subjects that have withdrawn from a trial before assignment to an Arm (ARMCD='NOTASSGN') should not have any Exposure records" (FDA rule FDAC49). I took one of the FDA rules on purpose here, as it is a clear rule, and e.g. the SDTM-IG does not really provide such explicit clear rules, so validation software implementors usually base their rules on their own interpretation of the IG, which is very dangerous. It is very important that the rules are very clear and not open for different interpretations. As well the FDA as a CDISC-team have published or will be publishing sets of rules (remark that even the "FDA SDTM rules" are not always clear and some are even completely wrong in some cases). At best, such rules are as well human-readable as well as machine-interpretable and machine-executable. Unfortunately, as well the published FDA rules for SDTM are only "text", leaving them open again for different implementations. We have however already done some work on an XQuery implementation for the FDA rules, and one of our students recently implemented the CDISC rules in XQuery for the SDTM-DM domain. The plan is to extend this work, further develop and publish the XQueries in cooperation with CDISC and the FDA, and make these rules available over RESTful web services (so that they can be retrieved by computers without human intervention).

The figure below depicts (of course simplified) the process again:

The figure speaks of SDTM, but the same is equally applicable for SEND and ADaM.

Short summary:
  • step 1: for SDTM/SEND/ADaM submissions, validate the contents of the define.xml against the IG
  • step 2: validate the datasets against the define.xml
  • step 3: for all other "rules" (mostly cross-dataset/domain validation), validate the datasets against rules published by CDISC and FDA, if possible using human-readable, machine-executable rules.
Comments are of course very welcome!

Saturday, June 6, 2015

FDA starts embracing LOINC

Two friendly colleagues from pharma companies pointed me to very recent FDA publications about LOINC, the worldwide coding system for laboratory test codes in healthcare.
The first is the "Study Standards Resources" mentioning LOINC as a standard used by the FDA (better "was", as the mentioning was removed a few days later). The second, a few days later, is however much more important. It is an article in the Federal Register titled "Electronic Study Data Submissions; Data Standards; Support for the Logical Observation Identifiers names and codes".

You might want to read it before continueing reading this blog entry.

I have been "fighting" several years now for having LOINC recognized as the identifying coding system for lab tests in SDTM submissions. CDISC however developed its own coding system for lab tests, which is inconsistent and does not allow to uniquely identify tests. My formal request to put the further development of CDISC-CT-Lab codes (LBTESTCD/LBTEST) onto hold, and gradually move to LOINC, did however not make me many friends in the CDISC Controlled TerminologyTeam, at the contrary.

In their newest Federal Register publication, the FDA requests comments on their initiative, especially about "the Agency recognizes that the high level of granularity inherent in LOINC has presented coding challenges and that these challenges have led to the creation of subsets of LOINC to help facilitate coding" with the specific question:

"Should FDA identify a LOINC subset for its use case?"

I think this is a good idea. LOINC has over 72,000 test codes. LOINC itself already published a "top 2000+ LOINC codes", i.e. a list of the 2,000 most used codes. Also in Austria, where the use of LOINC coding is mandatory for use in the national electronic health record system (ELGA), a subset has been published which should preferentially be used. And then we have the very old "LOINC codes for common CDISC tests", which has unfortunately not been maintained in the last years.

Important is that such a subset needs to be a "recommendation", otherwise people will start "pushing" tests with codes that are not into the list, into one of the test codes that is in the list, thus essentially falsifying the information. If the FDA would recommend the 2000+ list, or would pick up the old CDISC list again (and modernize it), this would be a very wise step, as e.g. the 2000+ list covers  98% of all the tests done in hospitals (and probably also in research).

There is however no technical limitation to allowing the full LOINC list of test codes, as there are now RESTful Web Services available for looking up test codes and their meaning. You can find the complete list here.The National Library of Medicine even has a  LOINC web service with very detailed information about many of the test which is queried using the LOINC code. This web service, and other CDISC-CT web services, have already been implemented in the "Smart Dataset-XML Viewer", a software tool for inspecting SDTM submissions in Dataset-XML format. They can also be implemented easily in any modern software tool, including the ones used by the FDA.

The FDA starting embracing LOINC (does CDISC soon follow?) is a major step forward. I have been "fighting" several years for this, so you can imagine that my weekend is now already perfect ...

Tuesday, April 21, 2015

The DataSet-XML FDA Pilot - report and first comments

The FDA very recently published a report on their pilot of the use of Dataset-XML as an alternative to SAS-XPT (SAS Transport 5) for SDTM, SEND and ADaM submissions. Here is a short summary of this report together with my comments.

The pilot was limited to SDTM datasets. The FDA departments that were involved were CDER and CBER. The major objectives of the pilot were:
  • ensuring that data integrity was maintained when going from SAS datasets (sas7bdat) to Dataset-XML and back.
  • ensuring that Dataset-XML format supports longer variable names, labels and text fields than the one from SAS-XPT (which has limitations of 8 characters for names, 40 for labels and 200 for text fields).
Comment: Unfortunately, the following was not part of the testing in the pilot:
  • the capability of transporting non-ASCII-7 characters (another limitation of XPT)
Six sponsors were selected out of fourteen candidates. The selection criteria can be found in the report and will not be discussed here.

Two types of tests were performed:
a) test whether the Dataset-XML files can be transformed into sas7bdat and can be read by FDA data analysis software (e.g. JMP)
b) test whether data integrity is preserved when converting sas7bdat files to Dataset-XML files and then back to sas7bdat files

Comment: Although the report doesn't mention this at all, I heard that one of the submissions was generated using Java/Groovy. This also proves that correct Dataset-XML files can be generated by other tools than by statistical software (which was pretty hard with XPT). I.m.o. this is a valuable result that should be mentioned.

Further, sponsors were asked to submit Dataset-XML files that contain variable names longer than 8 characters, variable labels longer that 40 characters, and text content larger than 200 characters. The goal of this was to test whether Dataset-XML files can (sic) "facilitate a longer variable name (>8 characters), a longer label name (>40 characters) and longer text fields (>200 characters)".)

Comment: well, that is something we already know for many many years ...

Issues found by the FDA.


During the pilot, a number of issues was encountered, which could all be resolved.
  • Initially, testing was not successful due to a memory issue caused by the large dataset size. This issue was resolved after the SAS tool was updated which addressed the high memory consumption issue
Comment: Well designed modern software that is parsing XML should not use more memory than when parsing text files or SAS-XPT. See e.g. my comments about memory usage using technologies like VTD-XML in an earlier blog. It is a myth that processing large files consumes much memory. XML technologies like SAX are even known for using only small amounts of memory. The issue could however quickly be resolved by the SAS people that were cooperating in the pilot (more about that later).
  • Encoding problems in a define.xml file
Comment: this has nothing to with Dataset-XML itself. What happened was that a define.xml used curly quotes ("MS Office quotes") for the delimiters in XML attributes. Probably the define.xml was created either from copy-past from a Word document or generated from an Excel file. These "curly quotes" are non-standard and surely not supported by XML.
Generating define.xml from Excel files or Word documents is extremely bad practice. See my blog entry "Creating define.xml - best and worst practices". Ideally, define.xml files should be created even before the study starts, e.g. as a specification of what SDTM datasets are expected as a result of the study.

  • A problem with an invalid (?) variable label in the define.xml
Comment: The FDA found out that "Dataset-XML requires consistency between the data files and define.xml". Now, there is something strange with the statement about the variable label, as the latter does not appear at all in the Dataset-XML files. What I understood is that the define.xml file that came with the Dataset-XML files had one label that was not consistent with the label in the original XPT file. With Dataset-XML, define.xml becomes "leading", and that is exactly how it HAS to be. With XPT, there is a complete disconnect between the data files and the define.xml.
So yes, Dataset-XML requires that you put effort in providing a correct define.xml file (as it is leading), and that is good so.

File sizes


A major concern of the FDA is and always has been the file sizes of Dataset-XML files. Yes, XML files usually are larger than the corresponding XPT files. However, this does not have to be the case.
The largest files in an SDTM submission usually are the SUPPQUAL files.
SUPPQUAL files can be present for several reasons:
  • text values longer than 200 characters. Starting from the 201st character, everything is "banned" to SUPPQUAL records. This is not at all necessary when using Dataset-XML.
  • non-standard variables (NSVs). According to the SDTM-IG, NSVs may not appear in the "parent" dataset, but must be provided in the very inefficient SUPPQUAL datasets. The latter can then also grow quickly in size. If they would be allowed to remain in the parent dataset (and marked as an NSV in the define.xml) we would mostly not need SUPPQUAL datasets at all, and so the largest files would disappear from our submission. Unfortunately the report does not give us any information about what the largest files were.
Let's  take an example: I took a classic submission which has an LB.xpt file with laboratory data of 33MB and a SUPPLB.xpt file of 55MB. So the SUPPQUAL file SUPPLB.xpt is considerably larger although it only contains data for 2 variables (the LB.xpt file has data for 23 variables). The corresponding Dataset-XML files have sizes of 66 and 40 MB. So they are somewhat larger than the XPT files. If one now brings the 2 NSVs back to the parent records, the Dataset-XML file is 80MB in size (and there is no SUPPLB.xml file), so smaller than the sum of the LB.xpt and SUPPLB.xpt files.
Of course, one could also move NSVs to the parent dataset when using XPT.

In the pilot, the FDA observed file size increases (relative to XPT) up to 264%, and considers this to be a problem. Why?
It cannot be memory consumption when loading in modern analysis tools. As I have shown before, modern XML technologies like SAX and VTD-XML are known for their low memory consumption.
Disk costs can also not be an issue. The largest submission was 17GB in size which comes at a disk cost of 0.51 US$ (3 dollarcent per GB).
So what is the issue? Here is the citation from the report:
"Based on the file size observations, Dataset-XML produces much larger files than XPORT, which may impact the Electronic Submissions Gateway (ESG)".
OK, can't the ESG handle a 17GB submission? If not, let us zip the Dataset-XML files. Going back to my 80MB LB.xml file, when I do a simple zipping, it reduces to 2MB, so the file size is reduced by a factor 40! If I would do the same with my 17GB submission, it would reduce to a mere 425MB (for the whole submission), something the ESG can surely handle. So, what's the problem?
Wait a minute Jozef! Didn't we tell you that the ESG does not accept .zip files?
A few thoughts:
  • would the ESG accept .zzz files (a little trick we sometimes use to get zipped files through e-mail filters: just rename the .zip file into .zzz and it passes...).
  • would the ESG accept .rar files? RAR is another compression format, that is also very efficient.
  • why does the ESG not accept .zip files? We will ask the FDA. Is it fear of virusses? Also PDFs can contain virusses, and modern virus scanners can easily scan .zip files on virusses before unzipping. Or is it superstition? Or is it the misbelieve that zipping can change the file contents? 
  • modern software tools like the "Smart Dataset-XML Viewer" can parse zipped XML files without the need of unzipping the files first. Also SAS can read zipped files.
Compression of XML files is extremely efficient, so those claiming that large files sizes can lead to problems (I cannot see why) should surely use a compression method like ZIP or RAR.

 A few things that were not considered in the pilot:


  • data quality
    The combination of Dataset-XML and define.xml allow to perform better data quality checks than when using XPT. Tools can easily validate contents of the Dataset-XML against the metadata in the define.xml. With XPT this is much harder as it needs a lot of "hard coding". Although OpenCDISC supports Dataset-XML, it does not (yet) validate Dataset-XML against the information in the define.xml file (or very limited)
  • the fact that define.xml is "leading" brings a lot of new opportunities. For example, the define.xml can be used to automatically generate a relational database (e.g. by transformation into SQL "CREATE TABLE" statements), and the database can then be automatically filled from the information in the Dataset-XML files (e.g. by transformation into SQL "INSERT" statements). This is also possible with XPT, but much much harder when not having SAS available.
  • this brings us to another advantage of Dataset-XML. As it is XML, it is really an "open" and "vendor neutral" format. So software vendors, but also the FDA itself, could generate software tools to do very smart things with the contents of the submission. Also this seems not to have been considered during the course of the pilot.
  • non-ASCII support. As already stated, XPT only supports ASCII-7. This means that e.g. special Spanish characters are not supported (there are 45 Million spanish speaking people in ths US). XML can (and does by default) use UTF-8 encoding (essentially this means "unicode"), supporting a much much larger character set. This one of the main reasons why the Japanese PMDA is so interesting in Dataset-XML: XML easily supports Japanese characters where there is no support at all in ASCII-7.


What's next?


 Very little information is provided in the report. It only states:
"FDA envisages conduction several pilots to evaluate new transport formats before a decision is made to support a new format". So it might be that the FDA conducts other pilots with other proposed formats such as semantic web technology, or even maybe CSV (comma separated values).
There are also some rumours about a Dataset-XML pilot for SEND files.



The pilot was successful as the major goals were reached: ensuring data integrity during transport, and ensuring that Dataset-XML supports longer variable names, labels and text values.
The FDA keeps repeating its concerns about file sizes, although these can easily be overcome by allowing NSVs to be kept in the parent dataset, and by allowing compression techniques, which are very efficient for XML files.

Some further personal remarks


I have always found it very strange that the FDA complains about file sizes. It has been the FDA who has been asking for new derived variables in SDTM datasets (like the --DY variables) and for duplicate information (e.g. test name --TEST which has 1:1 relationship with test code --TESTCD). Derivations like --DY calculation can easily be done by the FDA tools themselves (it is also one of the features of the "Smart Dataset-XML Viewer"), and e.g. test name can easily be retrieved using a web service (see here and here). Removing these unnecessary derived or redundant variables from the SDTM would reduce the file sizes with at least 30-40%.



Special thanks are due to the SAS company and its employee Lex Jansen, who is a specialist in as well SAS as XML (well, I would even state that he is a "guru"). Lex spend a lot of time working together with the FDA people and resolving the issues. Also special thanks are due to a number of FDA people that I cannot mention by name here, for their openess and many good discussions with a number of the CDISC XML Technology team volunteers.

Friday, March 27, 2015

Creating define.xml - best and worst practices

Define.xml is a CDISC standard, based on XML, allowing sponsors to provide metadata for SDTM, ADaM and SEND electronic submissions to the FDA and the Japanese Ministry of Health.
But define.xml is more than that: it is also a very good way to exchange metadata for SDTM, ADaM and SEND datasets between partners in the clinical trial world.

As a CDISC define.xml instructor, I am often asked what the best practices are to generate define.xml files that not only are conform to the standard, but also correctly and concisely describe the metadata for SDTM, ADaM or SEND datasets.

In this blog, I will discuss a few of what are, in my opinion, the best practices for generating define.xml. Although never asked for, I will also list what i.m.o. are the worst practices for creating define.xml.

In the following, I will use SDTM a lot as an example, but usually the same applies to ADaM and SEND.

Best practices

If you are a sponsor outsourcing the generation of SDTM datasets (e.g. to a CRO), the best practice is to generate a define.xml that can be used by the company that will need to provide the SDTM datasets, as a specification of what will have to be delivered. This means that it might be an extremely good idea to generate a define.xml for a study, even before the study has started. If you had similar studies before for which you do already have a define.xml, this is pretty straightforward. Your define.xml that you supply as a specification does not need to be complete, but if you designed your study well (taking into account that you want to submit it later), it should be nearly-complete.

Here is a slide set of my colleague Philippe Verplancke about this, even showing how a define.xml can be used as a direct source for CRF design for the next study.

Another best practice is to design your study using a study design tool, and generate an ODM metadata file from it, and do the mapping between study design and SDTM, even before the study starts. There are several good software tools on the market for doing so. I have a customer who is even doing the complete mapping between ODM and SDTM long before the study starts, as a quality control for the study design. The idea is very simple: "if we cannot map the study design to SDTM now (before the study starts), we will be in  big trouble later". The define.xml itself is used to store the mapping instructions.

Even if your study has already started, it is a very good practice to do this mapping long before the database closes. Partial data can be used to test the mappings, and if you do it well, you can just run the mappings only once (maybe taking half an hour or so) after data base closure. And you do already have your define.xml - you only need to remove the mapping instructions from it.

The reason for this is that the best tools on the market all use the same method: during generating the mappings (usually using drag-and-drop for generating mapping scripts that can then be refined), a define.xml is kept in the background (also keeping the mappings) and automatically synschronized. So if you change something in the mapping, this is automatically reflected in the underlying define.xml.

Whether you use a special tool or a statistical package, essential is that you have a process in place where your define.xml is fully synchronized with your generated or to-be generated SDTM datasets.

Other good practices

Learn XML or have someone in your team that has XML knowledge. My undergraduate students learn XML in just two lectures (each of 90 minutes), and one or two exercise afternoons. Learning XML is easy.
With a litte XML knowledge, you can generate a define.xml file for your SDTM datasets starting from an existing sample file (like the one published by the CDISC define.xml team). Estimated effort: about 1-2 days. Use an XML editor (and not NotePad or WordPad or so - disaster is preprogrammed) - there are even some very good XML editors that are free.
When you edit a define.xml file, be sure that you regularly validate it against the define.xml XML-schema (most XML editors have this functionality) which ensures you that your basic structure is correct. Read the "define.xml validation whitepaper" for more information. There are some good tools on the market for validating define.xml files, however, do avoid those that do not allow you to validate your define.xml against the define.xml XML-Schema (such tools have implemented non-official "rule interpretations" which are the vendors own interpretations of the define.xml specification).

There are also a few tools on the market that allow you to generate / edit define.xml in a user-friendly way without needing to see the XML itself (several new ones have been announced). As long as such a tool allows you to validate your define.xml against the XML-Schema and against other rules from the define.xml specification, and allow you to inspect the source XML and allow you inspect the result using the stylesheet, such a tool may be a good choice.

Bad practices

A bad practice that is often followed is to try to generate a define.xml file from a set of SDTM files (XPT files) using a "black box" tool (post-generation). You can use such a tool to generate a "first guess" of a define.xml file for your study, but you will still need either to edit it using an XML editor, or to use a tool mentioned before, to complete and fine-tune the define.xml.
Many users however expect such tools to generate a correct and ready-to-submit define.xml file. They inspect the result using the stylesheet (without inspecting the source XML nor validating it against the XML-Schema), and are surprised that it doesn't work or doesn't provide what they expect.
One such tool allows you to write the "instructions" in an Excel worksheet and then use that to "automatically" create a define.xml file. There is however no manual how to write these instructions, you are told you should take an existing define.xml file (for example published by CDISC), and convert that to an Excel worksheet with such "instructions" and learn from that how to write or complete the worksheet for your own submission. In the time needed to find out how that works, you could already have become an XML expert!

Many people think that define.xml is what they see in the browser (using the stylesheet), not realizing (or not wanting to) that there is an XML file behind that, a file that is machine-readable and that can be used for much more than just displaying the study metadata. Unfortunately, most reviewers at the FDA do not realize this either.

Worst practices

Some companies have similar processes in place where the define.xml is generated post-SDTM, usually using Excel worksheets, or even using Word documents. Especially the latter is extremely dangerous, as XML usually uses UTF-8 encoding (the international standard), and your Word document might be using an encoding that is incompatible with UTF-8. So if your copy-and-paste from a Word document into an XML document, and you have characters that go beyond ASCII-7, be not surprised when you see the strangest things in the result.

  • create your define.xml before your study starts
  • use a tool or process that keeps your define.xml in sync with your mappings between study design and SDTM
  • post-SDTM generation of define.xml is a "last resource method" - try to avoid it
  • do not expect "black-box" tools to generate a "submission-ready" define.xml file for you
  • avoid the use of Excel worksheets and Word documents to generate a define.xml
  • learn XML - it is not difficult! Even if you use good tools, it will help you to understand what you have created
Also read our newest post with even more information.

Wednesday, February 25, 2015

CDISC Therapeutic Areas and LOINC

In a recent discussion with an important CDISC representative (well, they are all important), I heard the remark that the therapeutic area user guides (TAUGs) may be the perfect forum to introduce people to LOINC coding for lab tests. So I downloaded them all, and searched them for:
  • whether LOINC is mentioned at all
  • whether there are lab tests described which are / maybe important for the specific therapeutic area
  • whether the important lab tests that are described come with their LOINC code
  • whether example of LB domain tables have the LOINC code (in LBLOINC) included
So, let's go!

TAUG multiple sclerosis

  •  mentioned that --LOINC variable is not to be used in domains NV and OE
  • no lab tests described
TAUG virology
  • LOINC mentioned many times as standard reference technology for tests in the domains VR, PR.
  • Shows up in many of the example tables, for example PFTSTRCD=48005-3, PFTSTRNM=LOINC
  • No specific laboratory tests described
TAUG Influenza
  •  LOINC not mentioned at all
  • A few lab tests are described (e.g. nucleic acid amplification techniques - NAAT) for which LOINC coding exists (e.g. 68987-7, 38270-5), but the LOINC codes are not provided
TAUG Cardiovascular Studies
  •  LBLOINC field and values provided in all LB example tables
  • Not further explained what LOINC is, supposed to be known by the users
TAUG Asthma
  • LOINC not mentioned at all
  •  Different lab tests extensively described (e.g. Leukotriene E4, Immunoglobulin E) but no LOINC codes provided, though they exist (e.g. 33344-3, 62621-8, ...)
TAUG Alzheimer (2.0)
  •  Mentioned as standard technology for genomic tests (PFTCVNM, PFTTESTCV)
  • No specific laboratory tests described
TAUG Diabetes
  •  LOINC not mentioned at all
  • A good amount of proposed lab tests is defined in detail, but no LOINC codes are given, although they exist for each of the described test
So, the use of LOINC seem to be very different: in some cases, LOINC is used all the way (even expected to be known by the user - cardiovascular), and in other cases, the authors of the TAUG seem even not to be aware of LOINC as a coding system for the exact identification of lab tests.

In the case of the TAUG diabetes, I added the LOINC codes myself to one of the tables describing a set of lab tests. It took me less than 10 minutes to do so (thanks WWW!). Here is the result:

Personally, I strongly believe that providing/suggesting LOINC codes for lab tests in a TAUG is a very good thing, as it can be copied into the protocol, so that the protocol writers / investigators / SDTM generators do not need to interprete / guess which test exactly must be done or was done.


Sunday, February 8, 2015

The power of UCUM: unit conversions using web services

Yesterday, one of my smaller dreams came true.

Some years ago, when I started working with electronic health records (EHRs), I discovered UCUM, the "Unified Code for Units of Measure"  which is used for all quantitative data in them. So I started teaching UCUM to my students. UCUM also publishes an XML file, the "ucum-essence.xml" file, which in principle should allow to do automated unit conversions, as it allows to decompose each string that is in UCUM-notation into the base units (m,g,s,...).

So I found a student, Milos Ilic, who agreed to develop a Java programm for unit conversion with a nice graphical user interface, as part of his Bachelor thesis. He succeeded doing so (well done!) - here is a snapshot of his application:

It allows the user to select a property (pressure in this case), and then allows to select a source unit plus a prefix for it (cm water column in this case) and a target unit plus a prefix (mm mercury column in this case), and then do the unit conversion.
Just to be clear, the application does not use a conversion table, it decomposes both cm[H2O] and mm[Hg] into its base units (m,g,s) allowing to generate the conversion factor "on the fly".

But this was just a first step, as the Java application (though excellenly made) still has a number of disadvantages:
  • it requires a human to use it
  • you can't call it from within your own application e.g. saying "please let me know how many mm mercury column 100 cm water column is"
  • it does not support more complex, composite UCUM strings like "mm[Hg]/min pressure decrease"
  • it does not support annotations, like in mmol{creatinine}/ml

So a "web service" was due.

And yesterday, I finally could finalize writing (and especially testing) the code. The service is a RESTful web service, allowing any modern computer application to ask the service questions like: "please let me know how much 0.25 pounds per square inch per minute" is in "millimeter mercury per hour". Here is the result when displayed in a browser:

You can find all the details about the web service at:

So how could this web service be used in healthcare and in clinical research? Here are a few possibilities:
  • allow EHR implementations to do unit conversions automatically
  • if CDISC would allow / mandate UCUM units (it only very partially does so for lab units), automate the process of converting original results (--ORRES) to standardized units (--STRESN)
  • allow the FDA tools to validate whether a conversion between original result and standardized has been done correctly
  • allow the FDA to compare lab results across studies and sponsors (they can't right know) - another requirement for this is that they use LOINC for lab test codes
  • and much much more
The web service is not perfect yet, there still are a small number of limitations which we are working on.
But I do encourage you to at least use this web service in your pilot applications!

Friday, January 23, 2015

ValueList Web Services in the "Smart Dataset-XML Viewer"

I now also implemented these services into the "Smart Dataset-XML Viewer". I still need to QC it and will then make the new version available through SourceForge.

Here is a snapshot of a VS dataset. It validates with no errors nor warnings in OpenCDISC (2.0).

 See something special?

Now I let the "Smart Dataset-XML Viewer" validate the data using the following web services:

  • check whether a CDISC unit is a correct unit for a given (VS) test code
    ({testcode}/{cdiscunit} )
  • check whether a Vital signs "position" (VSPOS) is a correct "position" for a given (VS) test code
    ({testcode}/{position} )
Here is the result:

 The "Smart Dataset-XML Viewer" finds the following problems:
  •  mm[Hg] is not a valid unit for VSTESTCD=SYSBP (second row)
    (remark that this data point came from an EHR, where UCUM notation is mandatory, but CDISC still does not allow UCUM...)
  • cm is not a valid unit for VSSTRESU with VSTESTCD=SYSBP (same row)
  • cm is not a valid unit for VSORRESU with VSTESTCD=SYSBP
    obviously a data management error (although a mapping error cannot be excluded)
  • SITTING is not a valid VSPOS ("position") with VSTESTCD=HEIGHT
Once again, this dataset passed without errors/warnings through OpenCDISC.
The reason is that the latter does not implement this kind of plausibility rules. It e.g. just checks whether "cm" is a valid member of the [UNIT] codelist (which it is). But of course it is not applicable to a blood pressure.

Now, one could implement such plausibility rules in software (hardcode it as OpenCDISC mostly does for other rules), but why do that (with zero transparency) when a web service is available?

I must explicitely thank Anthony Chow (CDISC) who published these rules in the form of an Excel worksheet (see "CT Mapping/Alignment Across CodeLists" at the CDISC-CT website).
All I did was move this information into a database and write the RESTful web service for it.

This kind of functionality is exactly what CDISC users want to see in SHARE. My implementation is just a prototype of "proof of concept", and of course I am talking with CDISC about how this kind of web services can be provided by the real SHARE.