One of the most frustrating elements of practicing medicine is the how often I am required to admit that I don't know. In particular, I am most frustrated when a reasonable and intelligent patient poses a logical question regarding the natural history of some common disease and I must reply by saying we simply don't have good information to address such a question.
Much of the data on frequency of disease, particularly in ambulatory contexts, comes from administrative and billing databases. They were not set up to be used to study disease frequency but mostly to track the collection of money. We are fortunate that someone decided to add a few additional data fields for whatever reason. They are limited since the diagnostic data and classification schemes used are very flawed.
There are many additional examples of how broad collection of data or samples resulted in huge dividends years later in realms that could never have been anticipated when the collections were started. Serum and tissue banks historically were initiated for essentially open ended purposes. When new pandemics erupted such as HIV presented, serum banks were used to look for where the disease originated. When new models of disease pathophysiology are considered, tissues banked for other purposes allows for rapid testing of hypotheses.
The same applies to data collection. Perhaps no other entity embodies this better than Google. They are data hoarders like no other. They do not make decisions regarding what they save. When the time comes to look at an interesting question, they do not worry about whether they saved the relevant data. They simply save everything. Last year an article was published in Nature describing how by simply looking at Google cold and flu queries they could demonstrate where influenza outbreaks were occurring, well before the CDC tracking system identified these peaks.
Within the world of clinical research, we have good news and bad news. The good news is we have data collection and analysis tools which make things possible which were simply not possible before. The bad news is what we are called upon to do before we can deploy them. Because of privacy concerns, we are now asked to define what information we want to collect and why. I would argue that if we have learned anything in the recent past it is that we cannot anticipate what element of what we collect (whether data or samples) will be of most value and the best collection strategy is to collect as much as possible. Key breakthroughs in understanding may simply happen as much through chance as anything else. Chance favors the prepared and in the world of data, being prepared means you have collected the data.