Saturday, February 27, 2010 | Posted by Will FitzHugh at 10:15 PM
Advances in Genome Biology and Technology Conference 2010
Many feet, many paths to EHR
I’ll forgive a news commentator for inaccuracy under fire. But in this era of “health IT,” with intense focus and funding for creating electronic medical records, is this kind of thinking old fashioned? Is it acceptable to state that someone should seek treatment where their medical records reside? Even a former president?
Well… yes. Many strident cases are made for “digitizing” medical records through the wounded warrior stories, in which families of injured veterans cart around suitcases of medical records from doctor to doctor, seeking treatment and answers. This is one of the business cases that drives the Nationwide Health Information Network (NHIN), which seeks to provide the trust fabric for secure electronic health information exchange. One of the promises of the NHIN is that you can leave your folder/binder/suitcase at home, because Dr. Smith can obtain the test results taken by Dr. Joshi. It's a fantastic vision of a primary benefit of electronic health records.
I work with the wonderful people behind the NHIN; I work to create the testing and validation processes that will ensure that the systems exchanging health information across this network are capable of doing so. I’ll write more about the work being done, but I work on the NHIN and had to be my own personal NHIN at my doctor’s office last week – Dr. S--- faxed my test results to me, and the e-fax rendered a pdf on my laptop, which I showed to Dr. P--- in his office. I work on the NHIN and realize we’ve got a long way to go before its promise can be realized – we’ve got a long way to go before it would be considered absurd that Bill Clinton would get treatment at a hospital just because his medical records were there.
Why aren’t we there yet? Government, and businesses big and small, are attempting to clear the path to make the sharing of health information simpler – everyone from Dell to seemingly Pizza Hut is getting into the effort. We’d like to hear your thoughts on the largest challenges – time, money, technical hurdles, privacy concerns, workflow modifications, access consent, human nature? – how will conditions unfold so that Bill Clinton can access his medical records anywhere? What will that look like – will Clinton own his information (in a chip in his arm, in his iPhone, in HealthVault or Google Health et al, or in his suitcase full of medical records), or will the data reside where it originated, shareable with others? Is our first step defining standards for machines to exchange data, or defining up a way for people to share information? We’re heading to the Health Information Management Systems Society conference (HIMSS) next week and will report on our conversations, musings, and discoveries. In the meantime, please share your own comments and ideas.
Thursday, January 21, 2010 | Posted by Brent Gendleman at 11:23 AM
What Gets You Up at 5AM? Focus: Rare Disease Community
As part of 5AM's deeply ingrained commitment to understanding the facets of the community we seek to serve, this marks the first of a series of entries focused on learning what gets our clients, partners and the collective set of biomedical stakeholders up at 5am!
The team at 5AM was thrilled to participate in the series of meetings organized by Genetic Alliance and the NIH's Office of Rare Disease Research. The volume and diversity of people so willing to share their perspectives, ideas, operational models and concerns surrounding rare disease issues and research was simultaneously informative and inspirational. The sense of urgency, combined with considered approaches, gives birth to a vision that can drive change now and for generations to come. We appreciate that the use of registries, the exposure of biorepositories and the protocols that inform them, and the ability to associate consented data - be it patient reported such as Family Health History - or provider collected clinical, imaging or molecular data - will lead to a greater understanding of disease. We are optimistic this will ultimately provide the means to inform treatment and hopefully, cures.
After demoing our Biospecimen Locator at the Registry and Repository Boot Camp, we sat down to figure out how best to collaborate with the variety of forces - government, non-profit, commercial, and patients and their families - to move the collective, burgeoning vision forward, focusing on what can we do today. The Biospecimen Locator enables the exchange of information to facilitate research by moving specimens from the freezer to the hands of researchers.
We hope you will consider leveraging the collective work we've done with a wide variety of stakeholders as we all look to harmonize standardized common data elements, vocabulary, and open source software to facilitate research collaborations that produce results. Please take advantage of our offer to do a demo of this open source software that can be used today to simplify the visibility and exchange of biospecimens.
I'll close with one of the brilliant quotes in a breakout session from Dr. Carolyn Compton, Director of the Office of Biorepositories and Biospecimen Research, when talking about how to propel change:
- You can make people do it
- You can pay people to do it
- You can show people VALUE and they will do it themselves
Monday, January 4, 2010 | Posted by Andy Evans at 10:28 AM
Extreme Visualization Makeover #1: Genome.gov’s “Published Genome-Wide Associations” Chart
Our first subject provides us with a really great teaching moment on color semiology. Semiology is a favorite term-of-art in visualization – simply put, it means “pertaining to communication through signs and symbols.” We use it to mean any system of signification – whether it’s through icons, color coding, numbering systems, mapping symbols, etc. Semiology encompasses the art and science of choosing the right way to signify things.
Which brings us back to the first subject of our series. Genome.gov publishes a quarterly summary graphic showing the loci of all the SNP-trait associations with p-values < 1.0 x 10-5, plotted as colored dots on a graphic representation of the human chromosome complement.

Now, this chart has grown over time to encompass 104 such traits (as of this writing). The authors of the chart have chosen to differentiate the traits through color semiology – each trait is assigned a unique colored dot. Take a moment to view the full-size graphic (click on the thumbnail above – the chart will open in a new window). Now – see if you can uniquely identify all of the orange dots, and the traits they represent. It’s pretty tricky, isn’t it? I actually had to resort to Photoshop’s color-sampling tool to tell some of them apart. It’s virtually impossible for a fully-sighted individual – imagine how tough this chart is for someone with color-compromised vision.
The problem here is that color semiology is not appropriate for such a large value range. We simply don’t do well differentiating 104 different colors from a field of dots. Compounding the issue is the effect of the Gestalt Color Principle – our brains want to group together things with really similar colors, which can be useful in some instances, but here it just makes matters worse.
In 1969, Brent Berlin and Paul Kay published a groundbreaking study of color perception across culture, in which they proposed that there were really 11 fundamental (or “focus”) colors that everyone could easily differentiate, most likely based on some underlying physiological or neurological principle. The Berlin-Kay palette was extended to include cyan by the visualization guru Colin Ware, giving us 12 colors that are reasonably safe to use for ordinal color semiology in infographics and data visualization. What do I mean by “ordinal” color semiology? Data dimensions that are sets of things (like categories, without quantitative interrelationships), rather than continuous, ordered, quantitative values, are ordinal. We can also use color for quantitative values – in fact, we can even split color into its three component subdimensions – hue, saturation and value – and use each of these to represent a separate quantitative dimension. Heatmaps and terrain relief are examples of such quantitative color semiology. We still have to be careful though, because we’re fairly bad at discerning specific quantitative values in a color (hue, saturation, or value) range.
But in the graphic we’re considering here, the authors are trying to use ordinal color semiology for 104 separate ordinal values. By now, you should understand why this is disastrous. It’s nearly 10 times as many colors as the Berlin-Kay set. It’s a set-up for failure.
So, how might we improve matters? One approach might be to employ a hybrid semiology – for instance, grouping the traits into manageable sets (with 12 or fewer sets in total) and encoding these sets with color semiology. Then, within each set, numbering the traits (numeric semiology). Let’s see how this might work, using chromosome 18 as a guinea pig. First, here’s what we’re starting with (excerpted from the original document):

Notice how your eyes and mind have to work to make sense of this, even though it’s just one chromosome from the whole diagram – you can do it, but it’s not intuitive or fast. Also, notice that the two Type 1 diabetes dots may actually look slightly different in color, due to their proximity to dots of different colors – this kind of color interaction is another hazard of using lots of different colors jumbled together to represent things. If there were only 12 well-differentiated colors on the diagram this would not be as big of a problem, but on the full diagram with 104 colors, there are too many things that are “lavender-mauve-ish” – so these kinds of visual effects become meaningful.
Now, let’s rework it a bit by sorting our traits into categories, and assigning a “Berlin-Kay safe” color to all traits in the same category. Then we’ll number within each category and put the numbers on the dots.

Suddenly, you can find things! It works well in both directions – whether you start from the legend or from the loci on the chromosome. This solution will scale up to quite a large number of traits without losing its efficacy, as long as the number of categories stays at 12 or less.
Is this the only solution (or even the best) to this problem? Probably not. There are other possibilities as well – one approach would be to split the diagram into multiples – duplicate copies of the whole diagram, broken out by some value, such as the categories assigned above (i.e., a diagram showing only cancers, another showing only cardiovascular loci, etc.). A possible disadvantage to this approach would be that you would no longer see the proximity of seemingly unrelated traits on the same chromosome, which might hinder insight into linkages, etc.
I hope this stimulates your thinking about choosing the right way to signify things. Please comment – do you see another way to approach this? Do you think I’m way off base (or right on)? Also, if you encounter any other charts, visualizations or infographics that you think could use a makeover, please send me a link and I’ll add them to the list for consideration.
Thanks for reading, and be sure to check back here for future installments of Extreme Visualization Makeovers!
Tuesday, December 15, 2009 | Posted by Todd Parnell at 2:57 PM
Done, Done, Done, and Done
Private branch commits: Commits to private branches can be made at any time, without restriction. The code doesn't even have to compile. This permits daily checkpointing and keeps us from losing work.
Mainline development commits: Commits here affect other team members, so we have several requirements. Every commit must be tied to a tracker item. Peer review is required, no matter how minor the change. Code must pass continuous integration (CI), which brings in things such as coding conventions and the regression suite (and the sombrero). These requirements ensure traceability, that CI always passes, and that at least two eyes have seen every piece of code.
Resolving a subtask: Individual tracker items are the lowest level of granularity we typically talk about at daily standups. Resolving subtasks requires not only committed code, but high quality unit tests, removal of fixmes in the code, and known limitations or escapes are captured as additional subtasks. We don't measure velocity here, but we have found that the team gets a good sense of progress by watching subtasks move to resolved.
Resolving backlog items: This is where most scrum definitions of done center, and where the team measures velocity. Our definition is similar to others - beyond resolving all subtasks, resolving a backlog item requires that: Functional and non-functional requirements have been met. Integration tests exist. Escaped or deferred functionality is captured. In totality, the code is deployment-ready.
We're done, done, done, and done.
Monday, December 14, 2009 | Posted by Will FitzHugh at 4:09 PM
Mammography Screening - Don't Panic!
The main idea that caused so much consternation was that women aged 40-49 should not routinely undergo mammograms to look for signs of breast cancer, which is what previous guidelines had said. But let's look at what these new guidelines really say:
- "The USPSTF recommends biennial screening mammography for women aged 50 to 74 years"
- "The decision to start regular, biennial screening mammography before the age of 50 years should be an individual one and take patient context into account, including the patient's values regarding specific benefits and harms"
So it doesn't say that women in their 40's should not get mammograms. It says they should weigh the risks and benefits themselves (and with their doctor, obviously, although I wish it had said that explicitly) and make their own decision.
So why not get mammograms earlier? The term 'screening' is key here. A screening test is a test that is given when there's no prior evidence or risk for a condition. A cholesterol test is a screening test, for instance, to look for evidence of heart problems.
You have to remember that by its definition a screening test is given to large number of people, only a small fraction of which have the condition being tested for. So even if such a test is relatively accurate, there will be a large number of false positive results. In the case of mammography, a false positive (and I hesitate to use the word 'positive' here) is an abnormal result when in fact the patient does not have cancer. There's a useful statistic to quantify this called positive predictive value (PPV). PPV is the fraction of positive results that actually have the condition being tested for. For women aged 40-49, mammography has a PPV of 2-4%. So that means that only a small percentage of women in that age group who have an abnormal mammogram actually have cancer.
The critics of these guidelines have said this is fine since we want to catch as many cancers as early as possible. But you have to take into account that no test is without risk, and that more invasive procedures, such as biopsies, are done when a mammogram is abnormal. The mammogram itself, as well as the subsequent procedures, have a monetary cost as well.
What the guidelines say is that women under 50 should get mammograms if they want to, or if they or their doctor feels there are other reasons for them to have a high risk of breast cancer. If one has a family history of breast cancer, or a genetic risk factor, then those would be good reasons to have earlier mammograms, in my non-medical opinion.
So this is a case of reasonable scientific conclusions being misinterpreted. However, I do think there needs to be more research done to say how many cases of cancer would be missed if mammograms were only done for women aged 40-49 who had some other risk factor. That would make it more concrete for women about what risk they were running by not having mammograms before age 50.
Monday, December 7, 2009 | Posted by J Ireland at 4:16 PM
The Case for Complexity
So, to you Neil Saunders, mostly I say “Amen, brother! I’m with you.” My physics background also gravitates me towards boiling a problem down to its essence – the bare facts upon which complex analyses can be built. As much as I believe this, however, I also believe that ontologies and other meta-data play an essential role in integrative analyses. Let me illustrate…
We recently were working with a client to integrate multiple –omic data sets (gene expression, GWAS, proteomics, etc). Similar to Neil, our first step was to extract the data from their individual repositories, ontologies, schemas, etc and boil these very disparate data sets down to their greatest common divisor (GCD). “Greatest common divisor” is an apt term here because in selecting what data was included in our simplified model we looked both for what was “common” to all data types and the analyses that would follow and also what was “greatest” in the sense that wherever possible we wanted to take processed results over raw data. In Neil’s case, his GCD was the feature-probe-value triplet. For us, the GCD was a combination of probe, sample and value. It would take another blog post (or more) to go into this project and design decisions, but for now I wanted to emphasize the similarity with Neil’s approach and perhaps holler another “Amen!”
But hold on! Before tossing all ontologies onto the scrap pile of failed good ideas (somewhere between the Apple Lisa and Pepsi Clear, I imagine), let’s take a step back. After extracting the multiple, disparate data out of their various schemas and into a single, clean model, what can we do with it? The answer: not much. In our project, although the data was all in the same form, the domains were still very different. How do we relate the value we got from a SNP to one from a transcript? Should data for a sample labeled “breast cancer” be grouped with data from samples labeled “human mammary carcinoma”? The truth is although we had a unified data model, the data was far from being integrated.
So what did finally bring us to an integrated data set? Ontologies, thesauruses, mappings, etc. In our case we used multiple sets of mapping data to take our primary data values from probes to a common feature – a gene (Entrez). We used nomenclatures like SNOMED and MeSH to normalize our samples. Only after leveraging this ontological information could we work with the data in any meaningful way.
It didn’t stop there. Once we had the data mapped to common feature and sample terminology, we then utilized ontologies like GO to form and test hypotheses on biological function. Our plan going forward is to leverage disease ontologies, pathway information and other meta-data to go beyond simple lists of differential features and go towards true biological understanding. Although coming up with a good ontological model capturing the entities and relationships relevant in biology is hard and complex, it is still a good way to capture our biological knowledge in a form that can be directly applied in analyses.
Finally, lest we forget, although dealing with a different schema/ontology for each type of data is annoying, it is far better than the alternative of having no such schema/ontology in place. This was painfully clear in our project when dealing with the relative uniformity of gene expression data in repositories such as GEO compared to the state of proteomic and even GWAS data.
To sum up, I agree Neil – you’re on to something. Getting your different data sets into a simple, consistent model is the way to go. We shouldn’t try to build a complex schema/ontology to record all things for all data. However, once you’ve got the data in this simple form and are ready to move on to the analysis, I think you’ll find the ontologies to be indispensible.
Labels
Our Bloggers
-
Andy Evans
aevans@5amsolutions.com -
Will FitzHugh
wfitzhugh@5amsolutions.com -
Brent Gendleman
bgendleman@5amsolutions.com -
J Ireland
jireland@5amsolutions.com -
Dan Kokotov
dkokotov@5amsolutions.com -
Todd Parnell
tparnell@5amsolutions.com -
Leslie Power
lpower@5amsolutions.com

