Generating hypotheses is a fundamental step in the scientific method, but also increasingly challenging due to the ever-growing observational data from which hypotheses are derived. Papers are published at an unmanageable pace, with an 8-9% increase per year; only in PubMed 2 papers are published every minute. In this upscale of science, researchers still lack automated means of supporting hypothesis generation: provided the current state of the art in a scientific field, what does the landscape of all plausible hypotheses look like? What hypotheses will lead to a chain of derived hypotheses that maximize impact? What hypothesis paths might raise the most dangerous ethical concerns? In general, scientists rely on their gut feeling and their personal preferences for such choices: having an AI ranking and linking thousands or millions of hypotheses in a safe and trustworthy manner is, today, utopic.
A decade ago, the irruption of “big data” fueled the idea that this could be achieved by simply leveraging the increasing amounts of available data and computing power: it would be sufficient to provide enough of them, and let algorithms “discover interesting patterns” in the data that would act as hypotheses. However, we have seen that these patterns are very common in large datasets, are usually circumstantial, and are affected by dataset bias; therefore not leading to interesting hypotheses. Moreover, blindly relying on such circumstantial data patterns could lead to untrustful and ethically unsafe hypothesis paths. Consequently, it has become clear that pattern mining alone is not enough for generating safe and trusted hypotheses.
Description logics are known to be adequate for representing a domain in a principled, controlled and trusted way; and provide sound and domain-independent reasoning algorithms with good expressivity while preserving efficiency. They are also a good fit for scientific reasoning since they operate under the open world assumption (e.g. “absence of evidence is not evidence of absence”). This project adds description logics-based reasoning on top of data patterns mined from observational datasets to increase the trust and safety of automatically generated hypotheses.
-It applies well-understood techniques in knowledge engineering to large corpora to build a Hypothesis Ontology (HO) through their most important concepts (classes)and the relations connecting these concepts (properties), especially those common among trustful and ethically safe hypotheses
-It populates this ontology from instances found through mining data patterns in large datasets, building a Hypothesis Knowledge Graph (HKG)
-It runs a semantic reasoner to derive the whole entailment graph of plausible hypotheses in a controlled, trustful and safe way, using the W3C PROV standard for representing plausible hypothesis provenance pathways in the Hypothesis Provenance Graph (HPG)
-It investigates heuristics and algorithms to identify and prune branches of unsafe and untrustful hypotheses pathways in the entailed graph.
The HO, HKG, HPG and pruning heuristics will be openly published to ensure that the scientific community can continue to enrich, extend, and use them as building blocks for a more scalable, semantically rich, safe, and trustful hypothesis generation for the scientific discovery process.