Friday, January 20, 2012

What ODM (and SDTM) can learn from HL7-CDA

There has been a lot of discussion about using HL7-v3 messages in XML for submissions to the FDA. Especially some people at the FDA (who are not XML experts at all) are in favor of this: they expect better integration with EHRs from this. But they do forget that HL7-v3 messages is not about EHRs at all: they mix up between HL7-v3 and CDA (the latter using a subset of HL7-v3).
Their logic is similar to: "If I have a truck (transport format) that can carry cows, and the same truck can carry oranges, than I will be able to breed cows that can produce orange juice".
More reasons for not choosing for HL7-v3 for (SDTM) submissions can also be found in my old article "Ten good reasons why an HL7-XML message is not always the best solution as a format for CDISC standard - and especially not for SDTM data". Also the recent postings from Gartner have indicated that HL7-v3 messages have been a big failure.

I am getting currently getting training from HL7 Austria, which will (hopefully) make me an HL7-standards experts and especially make me even more better in working with CDA. If all goes well, this will also deliver me HL7 certification.

As I also am teaching CDA at the university (as the Austrian EHRs system "ELGA" will be based on CDA), I do see more and more clearly (also technically) how information from EHRs can be used in clinical research: the answer is NOT in using HL7-v3-XML for FDA submissions, nor in data collection: we do have CDISC ODM for that.

However, there are a few things we can learn from CDA, and which we may want to introduce in ODM, or at least allow it (though standardized extensions). Here is a first list:

- use of universal Object Identifiers (OIDs). These are not the current OIDs of ODM, but the one used in CDA and ISO-21090. For example:

The "Object Identifier" here is "2.16.840.1.113883.6.1" which is the worldwide recognized identifier for the LOINC controlled terminology for lab tests, which is used by almost every hospital in the world that has computers. "8480-6" is the LOINC code for "systolic blood pressure".

In ODM, we would probably use an "Alias" for this, e.g.:
<Alias Contexrt="LOINC" Name="8480-6"/>  
or  <Alias Context="2.16.840.1.113883.6.1" Name="8480-6"/>

But better would be that we could incorporate this into the ItemDef, e.g. using:

Don't pin me on the "cda:" prefix for the extension namespace: it was arbitrarily chosen.
Also remark that I additionally added a "MeasurementUnitRef" and an "Alias", the latter to indicate that this data point should later be mapped to SDTM "VSORRES" for "VSTESTCD=SYSBP" (the latter belonging to the CDISC controlled terminology.
Also remark that this snippet contains all necessary information to enable that information from an EHR (in CDA or CCD) is automatically retrieved into a CRF in case the systolic blood pressure in the EHR is coded using LOINC (it can also have been coded in SNOMED-CT, which would just add another line to the above snippet)

Though CDA-R2 is not using ISO-21090 datatypes (yet - it is envisaged for CDA-R3), many people have been asking or even demanding that ODM replaces its own datatypes by ISO-21090 datatypes. We must however take into account that currently ISO-21090 datatypes can only be used in captured data, not in definitions of to-be-captured data. So it currently not possible (without adding new stuff to ODM the standard) to define that "the systolic blood pressure will be captured as a "physical quantity' (PQ datatype)". One can only state that "the systolic blood pressure has been captured as a 'physical quantity'".

So for example, in CDA-R2, a systolic blood pressure observation is written as:

whereas in ODM it would be written as:
<ItemData ItemOID="IT.SYSTOL" Value="132">
    <MeasurementUnit MeasurementUnitOID="MU.mmHg"/>
Remark that CDA/ISO21090 is using the code and codeSystem (in this case SNOMED-CT) which is more or less equivalent to the use of a reference to the ODM-OID of the ItemDef in "ItemOID". I say more or less, as each has its advantages and disadvantages.
The CDA/ISO-21090 "notation" has the advantage that it is universal, as a machine will immediately understand, that the data point is about  a systolic blood pressure.
The disadvantage is that there will not always be a test code available for a (new?) test in a clinical study.
Also remark that the CDA/ISO21090 "notation" contains more information, such as the capture date and time, which would be a separate data point in ODM.
So why not combine best of both worlds?

Consider the construct:

Not valid ODM you say!?
This is valid ODM, using a "vendor" extension, allowing elements from the CDA-R2 standard or ISO-21090 standard to appear within the "ItemData" element.
Using this simple and valid construct, it is possible to combine the best of both worlds, and even to directly insert a data point from an EHR into a CRF.
Now, I am not an absolute supporter of ISO-21090, for at least two reasons:
  • you have to pay in order to obtain a copy of the specification - and it is not cheap: ISO will charge you 238 Swiss Francs (currently US$ 250 or € 200). The document is copyrighted, so if you pass a copy to your colleague, it's either illegal unless you pay another 238 Swiss Francs.
    Therefore, I do not really consider ISO-21090 an open standard.
  • The XML is bad: have a look again at the last snippet. Did you notice something weird?
    Have a look at the date "19990229". This is not an ISO-8601 date in the XML sense. It is even not a valid date: there has not been a "february 29th in 1999.
    But if you validate against the schema or the schematron, this error will go unnoticed.
    CDISC uses ISO-8601 in its XML (but also in SDTM), and if you would validate some CDISC XML in which the (now with the correct XML notation) "1999-02-29", the validation engine would immediately and loudly protest. So the HL7 people made a big mistake here!
Another thing you might have noticed is the use of the "mm[Hg]" in the "unit" attribute. This is UCUM notation (Unified Code for Units of Measure). Although healthcare all over the world uses UCUM units, CDISC decided to develop its own controlled terminology for this. In my opinion, it would be better if also CDISC would only use UCUM units.

So what we still need to do to be able to currently better use EHRs in clinical research, is a mapping between CDISC controlled terminology for units of measure, and UCUM units. Of course it would be better if we deprecate our own controlled terminology for that, and only use UCUM units.

But for the moment, we could use the following construct in ODM:

<MeasurementUnit OID="mmHg" Name="millimeter mercury"

which again defines the link between the CDISC controlled terminology and UCUM.

So, though CDA is not perfect (and HL7-v3 messages are a disaster), there still are a few things we can learn from CDA. Most of it can already be implemented (due to the "vendor" extension mechanism), as well in ODM as in a future SDTM-XML.
Our truck can then carry types of clinical and healthcare information at the same time, even linking both of them perfectly.

That is what we "CDISC end-to-end" is really about!

Thursday, January 12, 2012

SDTM in XML - the data themselves

No that we made the define.xml more logical (and much more end-to-end-friendly), we can do the same for the data themselves.
We do not need VSTEST anymore (as it is a "synonym" or "display" variable, and listed in the metadata),so I commented it out, and we also can move the units of measurement to where they belong, i.e. as an attribute to the data point rather than as an attribute to the record.

This leads e.g. to the following SDTM-XML:

Remark that there is no explicit VSORRESU, nore VSSTRESU anymore, but the units have been attached directly to VSORRES and VSSTRESN.

When going from the "flat" SDTM-XML representation (see post of xxxx-xx-xx), I would call this "minimal-invasive multidimensional SDTM in XML" (the world is not round!).

There is some similarity with how HL7-v3, HL7-CDA/CCD and ISO-21090 is handling such information, e.g.:

We see that the unit of measure ("unit" attribute to "value" element)  is directly attached to the datapoint itself.
There are however also some main differences with ODM:
  • HL7-v3 (as far as I know) does not have a construct for isolating the metadata. It does not know about planning a visit, planning which forms are used in a visit, which questions to ask etc..
    In HL7-v3 every data point is an "observation", and one cannot see whether that "observation" was planned or "ad hoc", i.e. the physician spontaneously decided to do a specific test or to make a specific observation.
    Therefore, there is also no referencing to a specific metadata section.
Also remark in the above HL7-v3 snippet (it comes from a CCD document) the use of "xsi:type" which is currently disputed within HL7 as it is not well validatable and essentially has nothing to do with XML-Schema types.

Remark also the bad usage of date formats (in element "effectiveTime") which does not follow the ISO-8601 rules for XML dates. The disastrous effect is that e.g. a date "2011-02-29" (which does not exist) is a vaid date in HL7-v3.

On the other hand, HL7-V3 uses a lot of "code" and "codesystem" with OIDs (unique object identifiers). In the snippet the codesystem "2.16.840.1.113883.6.96" stands for SNOMED-CT, and the code "271649006" stands for "systolic blood pressure" (NEVER trust what is in "DisplayName"!).
CDISC however decided to generate its own controlled terminology (unfortunately without OIDs), which means that we urgently needs mappings between CDISC-CT and coding systems used in healthcare such as SNOMED-CT, ICD-10 and LOINC, if we really want to enable integration between healthcare and clinical research.

Another very nice thing in CDA is the use of UCUM units of measurement ( which I could recommend highly. There is currently no good placeholder in ODM for adding UCUM units of measure (except maybe for the "Alias" element), so I think we should have an additional attribute in the next version of ODM to allow to give the UCUM code for each unit of measure that we use in the study. The great advantage is that the use of UCUM codes easily allows for transforming one unit in another (e.g. from pounds to kilogram).
But that's another topic for one of the next blog entries.

Back to our SDTM-XML data snippet. I call it minimal-invasive because it only deviates in a small amount from a two-dimension representation of the data.

But if we look more careful, we can see a lot of things we can further improve:
  • do we need the datapoint (SDTM variable) "DOMAIN"?. The fact that we have ItemGroupOID="MyStudy:VS", which is a reference to the "ItemGroupDef" with OID "MyStudy:VS" and which has the attribute "Domain" with value "VS" already gives us that information

Wednesday, January 11, 2012

Other strange things in define.xml

Although some people will protest, I am still stating that the SDTM standard has been written with SAS XPT in mind. The 8-character, 40-character and 200-character limitations in SDTM do have a source: the ancient SAS XPT format.
Another major problem of SAS XPT is that it essentially describes two-dimensional tables, similar to tables in relational databases. But even if the SDTM is a blueprint for databases, database specialists will still find a lot of strange things in the specification and implementation guide.

"The world is not flat" has been preached by Armando Oliva from the FDA, stating that also the FDA would like to go to a multi-dimensional model for SDTM submissions. Unfortunately, what they are proposing is a set of HL7-v3 messages, not really knowing what they are talking about.

Multi-dimension models for SDTM would make life (and CDISC end-to-end) considerably more easy, and it would us allow to get rid of many of the strange and illogical constructs in define.xml.

Let us e.g. have a look at the pair VSTESTCD and VSTEST.
According to the SDTM standard, VSTEST is a "synonym" qualifier to VSTESTCD (the standard speaks about "equivalent terms for a --TESTCD". So VSTEST is NOT an attribute of the SDTM record, it should be an attribute to VSTESTCD.
But how is this made visible in SDTM datasets and in define.xml?
It isn't.

Even worse, both VSTESTCD and VSTEST have controlled terminology, i.e. there is an associated CodeList in define.xml for each of them. Let's have a look:

Here is the codelist for VSTESTCD:

and here the one for VSTEST:

We see that define.xml uses CodedValue = Decoded Value.
But how do we now know that "BMI" corresponds to "Body Mass Index"?
These are related 1 to 1 isn't it?
Maybe we know, but there is no way a machine can understand this.

So, what's wrong?
The reason for all this is the flatness of the SDTM, due to the choice of SAS XPT as a transport format.

For me, VSTEST (i.e. test name) is just  a "display variable" to VSTESTCD, i.e. it is not really necessary, and when using XML, one could just display it when necessary, i.e. as a tooltip in the HTML that is generated by the stylesheet.
I will soon write a separate blog about how this can be done and how it could look like.

So, ideally, in the define.xml there should NOT be a variable VSTEST, only a VSTESTCD, and the ItemDef for VSTESTCD should look like:

Remark the use of SDSVarName to keep the SDS (SDTM) Variable name, and the correct use of the "Name" attribute containing the test name (description), so that we do not need VSTEST anymore.
Here is the associated codelist:

It clearly shows that "Adipose Tissue" is the vital sign test name for the vital sign test code "BODYFAT", a relation that cannot be found out with the current SDTM constructs.

Next time, we will see how this can be further extended for units of measurement (--ORRESU, --STRESU) and valuelists.

SDTM in XML - the metadata

The previous post showed a snippet of ODM-XML that could be used (and is used by a number of vendors) to store SDTM data in an XML format.
Now, we do already have the metadata for this set of SDTM data in XML format: it is the define.xml.
For example for the VS domain we may find:

(P.S. The order of the attributes is unimportant in XML, the browser just lists them in alphabetical order)

I have set Mandatory="Yes" for those variables that are "required", and to "No" for those that are expected or permissible. The reason is that ODM has the rule that a data point MUST have a value when Mandatory="Yes", and because a data point can be null (or absent) in SDTM even when the variable is expected, we need to set Mandatory="No" for expected variables. A typical example is VSORRES which can be null if the test was not done.
In my personal opinion, it was a design error in SDTM to have "expected" variables. In my opinion, they should have been called "conditionally required", and the rules should have been stated.
But I must also admit that also ODM does not have a good construct for "conditionally required" data points.

The careful reader will already have noticed that the SDTM data in XML format in the previous post do not have an entry for VSSTAT and VSREASND.
There is a very good reason for that: in ODM, only the data points for which there is a value are listed within an "ItemGroupData". Only when one explicitely wants to state that a data point has been set to NULL, one can use the "IsNull" attribute (for further details, see the ODM specification).

One thing I have never liked in define.xml is the "abuse" of the "Name" attribute to keep the SDTM domain name or SDTM variable name (the latter in ItemDef). In ODM the "Name" attribute is used to keep a short description (for display) of the variable, as free text (so not enumerated). This short description is kept in def:Label in define.xml. In my opinion, it shouldn't. That the developers of define.xml have chosen to have an extra (def:Label) attribute is strange, as in ODM, there is already an attribute to keep the domain name, i.e. "Domain" in the case of an ItemGroupDef, and to keep the SDS Variable name, i.e. "SDSVarName" in the case of the SDS/SDTM variable name. So the better solution would e.g. have been:

<ItemGroupDef OID="MyStudy:VS" Name="Vital Signs" Domain="VS" Repeating="Yes">...</ItemGroupDef>

which I think would make more sense, especially as it makes end-to-end easier (less transformations necessary). Probably (but I am not 100% sure) the developers of define.xml did not choose for this solution because "Domain" is not a mandatory attribute in ODM.