Researchers developed a new way to identify how diseases might cause or influence one another by analyzing scientific literature and validating the results using real-world patient data. They searched through PubMed articles for phrases suggesting that one disease leads to another, then standardized those disease names using ICD-10-CM medical codes to keep the data consistent.
To test whether these suggested relationships were credible, the team used a combination of five validation methods. They looked at how strongly diseases were statistically linked in the UK Biobank dataset, whether the timing of diagnoses followed the expected pattern (with the “cause” usually diagnosed before the “effect”), and how frequently the relationships appeared in the literature. They also tested how dependent the diseases were on each other and asked GPT-4, a powerful AI language model, to assess the plausibility of each connection. All of this information was combined into a confidence score for each relationship.
Using these scores, the researchers built a large network showing disease-to-disease connections, making sure to avoid any circular logic. The final network included 1,860 diseases connected by 7,589 directional links that represent likely cause-and-effect relationships.
From more than 16,000 relevant sentences in the literature, the researchers identified 8,191 unique potential causal links between diseases. When compared to randomly selected disease pairs, the identified relationships showed much stronger associations in actual patient data. The diagnosis timing analysis further supported these relationships, with most cause diseases appearing before their effects in patient histories. A manual review by experts found that 84 percent of the sampled links appeared to be accurate.
The disease network wasn’t just theoretical—it also proved useful in practice. When the researchers applied it to polygenic risk scores (which estimate a person’s genetic risk for certain diseases), it improved predictive accuracy. For example, adding diseases that are commonly caused by coronary heart disease—such as heart failure, myocardial infarction, and angina—boosted prediction accuracy by as much as 22.9 percent. The network also helped reveal genetic variants that influence diseases indirectly, by acting through other conditions, offering a clearer view of complex genetic pathways.
Despite these promising results, the approach has a few important limitations. Because it relies on published literature, the method may be affected by publication bias, where some diseases or relationships are studied and reported more often than others. The text-mining approach could miss some connections due to subtle or complex language, or include incorrect ones when the meaning of a sentence is unclear. The validation process used data mainly from the UK Biobank, which may not represent all populations equally well. The method is also limited to binary outcomes—whether someone has a disease or not—so it doesn’t apply to continuous health traits like blood pressure or cholesterol levels. Lastly, the confidence score gives equal weight to all five validation measures, which might not reflect their actual importance or reliability in every case.
By Impact Lab