There is a buzz out there in the health care delivery world about the promises of artificial intelligence (AI). There are fears among physicians that they might be replaced by computers. There is excitement and there is fear and there is hype. In my opinion, at this point there is mostly hype. The reason I believe that what we have most is hype is that for most of the important tasks we can delegate to AI, we are missing one key element. AI is not something programmed. It is something learned and in order to learn a computer needs validated data sets which contain unambiguous right and wrong answers. There lies the rub.
The recent article in the New Yorker by Siddhartha Mukherjee (AI - New Yorker AI v. MD) describes studies done by Stanford where they trained computers using images taken from patients diagnosed with melanoma.
Thrun, who had maintained an adjunct position at Stanford, enlisted two students he worked with there, Andre Esteva and Brett Kuprel. Their first task was to create a so-called “teaching set”: a vast trove of images that would be used to teach the machine to recognize a malignancy. Searching online, Esteva and Kuprel found eighteen repositories of skin-lesion images that had been classified by dermatologists. This rogues’ gallery contained nearly a hundred and thirty thousand images—of acne, rashes, insect bites, allergic reactions, and cancers—that dermatologists had categorized into nearly two thousand diseases. Notably, there was a set of two thousand lesions that had also been biopsied and examined by pathologists, and thereby diagnosed with near-certainty.......
...Thrun, Esteva, and Kuprel then widened the study to include twenty-five dermatologists, and this time they used a gold-standard “test set” of roughly two thousand biopsy-proven images. In almost every test, the machine was more sensitive than doctors: it was less likely to miss a melanoma. It was also more specific: it was less likely to call something a melanoma when it wasn’t. “In every test, the network outperformed expert dermatologists,” the team concluded, in a report published in Nature.So should our dermatology brethren be afraid that Watson and its prodigy will supplant the mole spotting workforce in dermatology? Perhaps, but there is a flaw in this work. What does it mean to use "biopsy proven" images? What exactly does a biopsy prove? It may not prove anything and there lies the problem. The teaching sets upon which machine learning is based may be validated (or not) by a not so shiny gold standard.
In a recent paper published in the British Medical Journal by Elmore et al (BMJ) the reproducibility of histology in melanoma diagnosis was examined. The results are a bit concerning and call into question the gold standard status of anatomic pathology and its ability to "prove" anything. The best concordance found was about 80% for lesions believed by experts to be frankly malignant. That means any training set the computer viewed likely had at least a 20% error rate built in. For the more subtle lesions, the concordance rates hovered around 50% (and some lower). How about comparing this to coin flips?
Training machines to learn to make diagnoses by using flawed teaching sets will generate AI; perhaps more likely to generate artificial ignorance than it is to generate artificial intelligence.
1 comment:
I was asked to comment on an AI project to detect basal cell carcinoma. It may have been this same program out of Stanford or at least some tech company in Palo Alto that had applied this to diagnosing basal cell carcinoma again with higher reliability that the dermatologists it was put up against. It took an image of an actual lesion and compared it against the image database which through "neural net learning" had become expert at diagnosing this skin cancer for which pathology is highly reliable unlike melanoma. My question was: how long would it take to evaluate several dozen lesions if not the entire face using this technique. I can examine the entire face and detect one or more basal cell cancers greater than 3mm with accuracy exceeding 98% in about 1 minute. And I probably stretch even that time out to a minute to make it appear I'm being thorough. I never got an answer as to how much time the machine takes to do the same. Bottom line is that there are other aspects to the diagnostic process than simply a fancy matching program marketed for its "AI" capability. And, as the example with melanoma illustrates, the mantra from the inception of the digital computer holds: garbage in/garbage out, however one wants to dress it up as "neural net learning." And I wish I was the one to have coined the term "artificial ignorance."
Post a Comment