Every so often these days, a study comes out proclaiming that AI is better at diagnosing health problems than a human doctor. These studies are enticing because the healthcare system in America is woefully broken and everyone is searching for solutions. AI presents a potential opportunity to make doctors more efficient by doing a lot of administrative busywork for them and by doing so, giving them time to see more patients and therefore drive down the ultimate cost of care. There is also the possibility that real-time translation would help non-English speakers gain improved access. For tech companies, the opportunity to serve the healthcare industry could be quite lucrative.
In practice, however, it seems that we are not close to replacing doctors with artificial intelligence, or even really augmenting them. The Washington Post spoke with multiple experts including physicians to see how early tests of AI are going, and the results were not assuring.
Here is one excerpt of a clinical professor, Christopher Sharp of Stanford Medical, using GPT-4o to draft a recommendation for a patient who contacted his office:
Sharp picks a patient query at random. It reads: “Ate a tomato and my lips are itchy. Any recommendations?”
The AI, which uses a version of OpenAI’s GPT-4o, drafts a reply: “I’m sorry to hear about your itchy lips. Sounds like you might be having a mild allergic reaction to the tomato.” The AI recommends avoiding tomatoes, using an oral antihistamine — and using a steroid topical cream.
Sharp stares at his screen for a moment. “Clinically, I don’t agree with all the aspects of that answer,” he says.
“Avoiding tomatoes, I would wholly agree with. On the other hand, topical creams like a mild hydrocortisone on the lips would not be something I would recommend,” Sharp says. “Lips are very thin tissue, so we are very careful about using steroid creams.
“I would just take that part away.”
Here is another, from Stanford medical and data science professor Roxana Daneshjou:
She opens her laptop to ChatGPT and types in a test patient question. “Dear doctor, I have been breastfeeding and I think I developed mastitis. My breast has been red and painful.” ChatGPT responds: Use hot packs, perform massages and do extra nursing.
But that’s wrong, says Daneshjou, who is also a dermatologist. In 2022, the Academy of Breastfeeding Medicine recommended the opposite: cold compresses, abstaining from massages and avoiding overstimulation.
The problem with tech optimists pushing AI into fields like healthcare is that it is not the same as making consumer software. We already know that Microsoft’s Copilot 365 assistant has bugs, but a small mistake in your PowerPoint presentation is not a big deal. Making mistakes in healthcare can kill people. Daneshjou told the Post she red-teamed ChatGPT with 80 others, including both computer scientists and physicians posing medical questions to ChatGPT, and found it offered dangerous responses twenty percent of the time. “Twenty percent problematic responses is not, to me, good enough for actual daily use in the health care system,” she said.
Of course, proponents will say that AI can augment a doctor’s work, not replace them, and they should always check the outputs. And it is true, the Post story interviewed a physician at Stanford who said two-thirds of doctors there with access to a platform record and transcribe patient meetings with AI so they can look them in the eyes during the visit and not be looking down, taking notes. But even there, OpenAI’s Whisper technology seems to insert completely made-up information into some recordings. Sharp said Whisper erroneously inserted into a transcript that a patient attributed a cough to exposure to their child, which they never said. One incredible example of bias from training data Daneshjou found in testing was that an AI transcription tool assumed a Chinese patient was a computer programmer without the patient ever offering such information.
AI could potentially help the healthcare field, but its outputs have to be thoroughly checked, and then how much time are doctors actually saving? Furthermore, patients have to trust their doctor is actually checking what the AI is producing—hospital systems will have to put in checks to make sure this is happening, or else complacency might seep in.
Fundamentally, generative AI is just a word prediction machine, searching large amounts of data without really understanding the underlying concepts it is returning. It is not “intelligent” in the same sense as a real human, and it is especially not able to understand the circumstances unique to each specific individual; it is returning information it has generalized and seen before.
“I do think this is one of those promising technologies, but it’s just not there yet,” said Adam Rodman, an internal medicine doctor and AI researcher at Beth Israel Deaconess Medical Center. “I’m worried that we’re just going to further degrade what we do by putting hallucinated ‘AI slop’ into high-stakes patient care.”
Next time you visit your doctor, it might be worth asking if they are using AI in their workflow.