Cancer Research Empowered by the Cloud and Machine Learning

Finding novel ways to treat and eradicate cancer, the second leading cause of death in the United States, continues to be the core focus for the National Institutes of Health (NIH) National Cancer Institute (NCI). In 2016, NCI issued a challenge to the global cancer research community to accelerate efforts to eradicate cancer. In response, cancer researchers recommended that NCI establish a central data ecosystem to provide a repository for cancer data that would allow sharing of data generated by research for all cancer types.

NCI’s Cancer Research Data Commons (CRDC) was born of that request and was created as a place for data discovery, patient participation, and disease surveillance in the interest of rapidly developing novel treatments and therapies.

CRDC’s petabytes of data and billions upon billions of individual data points includes everything from genomics, proteomics, imaging, and cancer models to clinical trials, cohort studies and patient demographics. Its volume of data, number of users and contributing investigators are growing all the time. Today, CRDC empowers researchers with state-of-the-art visualization, analysis, and interoperability tools in a flexible, cloud-based computational environment.

GDIT works with NCI’s CRDC to provide a flexible cloud environment enabling indexing of massive data sets and large-scale processing for applications such as artificial intelligence and machine learning.

types of cancer can be studied by researchers
research program datasets
to calculate more than 6 billion correlations in 3 hours
research publications cited CRDC in 2022

In collaboration with NCI, GDIT streamlined the cancer research pipeline for cancer researchers utilizing Google BigQuery enterprise data warehouse. By consolidating and transforming data from thousands of NCI molecular-level files into Google BigQuery tables, researchers can rapidly access and analyze the data alongside their own. BigQuery is a cloud-based data warehousing and business intelligence solution that enables users to analyze large datasets with machine learning tools.

The cloud-scale resources also enable researchers to leverage computationally intensive tools such as AI and ML to analyze vast amounts of data, uncovering complex patterns and relationships that may lead to breakthroughs in cancer diagnostics, treatment, and personalized medicines.

As one example, our team created the capacity to take MRI image scans and join them with genetic data and then use artificial intelligence and machine learning models to identify correlations between individual gene mutations and tumor volume or shape.

This is possible with the unique blend of talent and expertise from GDIT’s doctors, biochemists, mathematicians, data scientists and engineers, who together develop the technical solutions needed to meet the CRDC’s incredibly important mission.

“Our team is annotating data, making it available to everyone in the world, securing it in the cloud, and applying machine learning algorithms – all in an effort to expand what we know about cancers and to accelerate the development of treatments,” GDIT Bioinformatics Director and Program Principal Investigator Dr. David Pot said.

Learn more about GDIT’s healthcare work, including how we apply innovative technologies to achieve the art of the possible.