Data Science for HCPs

Data Engineering: The Foundation of Data Science Efforts

David Hughes, BSN; Victor M Gehman, PhD; Michael Becich, MS; Amal Katrib, PhD; David Whewell; Erwan Rivet, MBA, MSME; Kelly Leyden MRes; Anisha Keshavan, PhD; Fatima Rubio da Costa, PhD

04/09/2020

Data engineering can be universally defined as the acquisition and transformation of data into a form tailored for a data science effort. Data engineering is the process and set of tools used to transform disorganized, unstructured, and often unclean data into something organized, structured, and maintainable.

Data engineering solutions may follow established recipes for retrieving and transforming data, but often require bespoke pipelines (the start to finish lifecycle of data) that meet the specific challenges of the data being used and the requirements of the data science approaches being explored. There is no cookie-cutter solution to data engineering; as the saying goes, “If you have seen one data engineering pipeline, you have seen one data engineering pipeline.”

This discussion will walk through the journey of data engineering from source to secure transport, storage, and versioning to its ultimate deployment.

Data Sources: Where Does Data Come From?

A wide range of expert-curated resources should be carefully selected and interrogated to comprehensively explore a clinical question of interest. For example, clinical pathways are constructed by leveraging data-driven knowledge. Sources include, but are not limited to, electronic health records (EHRs, which are more comprehensive than EMRs), payers (claims data), patient-reported outcomes (PROs), peer-reviewed biomedical literature (eg, PubMed/MEDLINE), clinical study findings (eg, ClinicalTrials.gov), and National Drug Code (NDC) directories. With data sources like these, pathways can track adherence, provide decision support, surface clinical intelligence, and ultimately facilitate an evidence-based standards of care. Other resources available for biomedical investigations include, but are not limited to, imaging archives (eg, the National Biomedical Imaging Archive), medical devices and mobile apps, wearables (eg, wristbands), and paper-based clinical notes.

The diversity of the multimodal data collected nowadays renders data engineers an integral part of a data science team. These individuals are tasked with providing high-quality data and a dependable infrastructure, enabling the extraction of meaningful predictive and prescriptive insights that funnel into the design of innovative health care solutions.

Data Formats: Structure vs Topology

Many people think of data as it exists in a spreadsheet like Microsoft Excel. Data can exist in numerous file formats and contain various types and attributes. Data formats depict the structure of data when it is physically stored. When data is stored on a server, cloud storage, or on a laptop, it is known as being “at rest.” Data can be commonly found in structures such as flat files (JSON, CSV, text), database records, documents, data frames, and several others. Advanced structures exist that support specific analytical needs; for example, graphs are rapidly becoming important for population analytics and insights from time series of patient data. The state of graph databases in 2020 is reviewed in Graph Technology Landscape 2020.

The topology, or shape of data, is a description of properties of the data contained in a structured format. A simple example of data’s topology, keeping with the spreadsheet analogy, is how many rows and columns are in the spreadsheet. Important properties include volume of data (1 million rows), data types (eg, string, integer, float, and varchar), and dimensions (256 rows by 256 columns), amongst others. Knowing the topology of a data set guides transportation, preparation, storing, and consumption by end users (eg, data scientists, analysts, clinicians, and businesses).

In the past, enterprises attempted to shoehorn data of different structures/topologies together into one single tool or solution, such as a database. More modern tools such as Hive, Presto, Impala, and many others allow indexing, discovery, and even joining of data stored in dramatically different formats. This mitigates transformation problems and enables multiple data formats to be used simultaneously – each tailored for a specific purpose.

Data Storage and Transport

Data is stored in a format on a given medium, whether it is a cloud-based storage (eg, AWS S3, Microsoft Azure Storage, Box, or Google Drive), on a laptop as a flat file, in a database located in a data center, or as part of a patient’s medical record located in an EHR. Some common storage locations for data engineering include:

Data is seldom used from its source location in data science efforts. Data generally requires transportation to another location where it can be cleaned, explored, and anonymized when necessary, and then transported again into a final location before any analysis occurs. This need for transportation is the first indication that data science requires infrastructure in order to perform its basic functions. Transportation may involve negotiating access permissions, licensing, building infrastructures, establishing or following standards, scheduling, and setting up certifications, data usage agreements, and validations. Each data source requires a specific tool, skillset, process, and infrastructure to transport.

You likely have noticed one theme present in this post: data engineering requires unique solutions for each data set. Data engineering is just the beginning of the recognized challenges in undertaking a data science initiative.

Secure Data: Is Your Data Safe?

Data safety in part comes from adhering to regulations and following best practices. Data security can be complex and is easy to improperly implement. Ensuring secure data at rest and in transit is a special skill that often requires a security engineer when data science projects evolve past using small local data sets that do not contain Protected Health Information (PHI), financial, or strategically sensitive data. It is best to demonstrate an overabundance of caution and security, less your data science project turns into a data breach with significant financial, legal, and business implications. Data security may be a blog post in the future of this series. Some general guidelines for data security, though not comprehensively covering the topic, include:

Best Practices for PHI
Cloud Adoption Best Practices
Security Standards (Amazon AWS, but generally applicable)
Best Practices for Avoiding Code Injection of Web Apps
Principle of Least Privilege Philosophy

Data Versioning: The Past, Present, and Future of a Data Set

Just like a text document or slide presentation, data sets change over time during their transports, transformations, and data science experiments. In order to support knowing all the states of a data set, versions are kept starting with the initial source, each transformation, and ending with the final version. An extension of this concept is that each experimental analysis and interpretation of results should reference the various versions of data used to create them. Data versioning can help support audits and explain how data science results were achieved in the event of an internal or external challenge of the results.

Next Blog Post Topic

We have reviewed a few of the topics in data engineering in this blog post. In the next article, we will provide more detail about how data is acquired, transformed, and prepared for data science efforts, as well as made available for data scientist teams.

About David Hughes

David David Hughes is the Principal Machine Learning Data Engineer for Octave Bioscience. He develops cloud-based architectures and solutions for surfacing clinical intelligence from complex medical data. He leverages his interest in graph based data and population analytics to support data science efforts. David is using his experience leading clinical pathways initiatives in oncology to facilitate stakeholder engagement in the development of pathways in neurodegenerative diseases. With Octave, he is building a data driven platform for improving patient experience, mitigating cost, and advancing health care delivery for patients and families.

About Octave Bioscience

Octave The challenges for MS are significant, the issues are overwhelming, and the needs are mostly unmet. That is why Octave is creating a comprehensive, measurement driven Care Management Platform for MS. Our team is developing novel measurement tools that feed into structured analytical data models to improve patient management decisions, create better outcomes and lower costs. We are focused on neurodegenerative diseases starting with MS.

Current Issue

December 2024

Volume 10

Issue 6

Current Issue

Issue Archive

Special Reports

Sponsored

JCP Special Report

2024 NCCN Clinical Practice Guidelines in Oncology (NCCN Guidelines®) Update: Impact on NSCLC Landscape

11/15/2024

The updated 2024 NCCN Guidelines® recommend broad genomic testing and the need for multidisciplinary care to accurately diagnose and treat non-small cell lung cancer. View this special report to learn more.

The updated 2024 NCCN...

11/15/2024

Journal of Clinical Pathways

Sponsored

JCP Special Report

The Value of Tissue-Based Genomic Profiling in Oncology

09/17/2024

Innovations in precision oncology have helped healthcare providers to create more personalized treatment plans and improve patient outcomes. View this special report to learn more.

Innovations in precision...

09/17/2024

Journal of Clinical Pathways

Sponsored

Special Report

Shedding Light on Non-Small Cell Lung Cancer & Its Impact on Patients

04/03/2023

This supplement aims to raise awareness about non-small cell lung cancer (NSCLC) and its impact on patients by providing comprehensive information to help improve early detection and appropriate care.

This supplement aims to raise...

04/03/2023

Journal of Clinical Pathways

Sponsored

JCP Special Report

Brukinsa® (Zanubrutinib) for Chronic Lymphocytic Leukemia

03/31/2023

In this product monograph, read an interview with Jeff P. Sharman, MD, as he discusses important BRUKINSA® trial data including efficacy, safety, dosing, administration, and other relevant data. These key findings supported the Food and Drug...

In this product monograph, read...

03/31/2023

Journal of Clinical Pathways

Sponsored

JCP Special Report

Tumor Lysis Syndrome: Early Diagnosis and Management

01/04/2023

This review summarizes the diagnosis, pathophysiology, and evidence-based guidelines for the prevention and management of tumor lysis syndrome, a common, acute, life-threatening disease primarily in patients with hematologic cancers and solid...

This review summarizes the...

01/04/2023

Journal of Clinical Pathways

Sponsored

JCP Special Report

A Tumor Lysis Syndrome Risk Assessment and Its Impact on Patients

08/09/2022

In an interview with Journal of Clinical Pathways, Nicholas Short, MD, shares objectives on the design and benefits of MD Anderson’s Tumor Lysis Syndrome clinical assessment for patient risk and impact on patient care.

In an interview with Journal of...

08/09/2022

Journal of Clinical Pathways

Updated NCCN Guidelines on B-Cell Lymphomas

Sponsored

JCP Special Report

Overview of the Updated NCCN Guidelines on B-Cell Lymphomas

06/15/2022

Robert Fee ;

Updated multiple times in 2022, the National Comprehensive Cancer Network (NCCN) Guidelines for B-Cell Lymphomas provide recommendations for the prevention, diagnosis, and management of malignancies.

Updated multiple times in 2022,...

06/15/2022

Journal of Clinical Pathways

Sponsored

Special Report

Expanded Indication for Zanubrutinib: Marginal Zone Lymphoma and Waldenström’s Macroglobulinemia

06/10/2022

In an interview with Journal of Clinical Pathways, Mitul Gandhi, MD, reviews the clinical impact and treatment approaches and challenges for patients with marginal zone lymphoma and Waldenström’s macroglobulinemia.

In an interview with Journal of...

06/10/2022

Journal of Clinical Pathways

JCP Special Report

Recommendations for Creating an Oncology Clinical Pathways Framework Tool Based on Payer, Provider, and Patient Priorities: Findings From the 2021 Care Pathways Working Group

05/20/2022

Robin T. Zon , MD, FACP, FASCO;

Gordon Kuntz ;

Winston Wong , PharmD;

The Journal of Clinical Pathways convened the 2021 Care Pathways Working Group to identify and reconcile the different pathway drivers for each stakeholder now and five years into the future, creating a framework tool based on oncology care...

The Journal of Clinical Pathways...

05/20/2022

Journal of Clinical Pathways

Overview of the Updated NCCN Guidelines on Triple-Negative Breast Cancer

Sponsored

Special Report

Overview of the Updated NCCN Guidelines on Triple-Negative Breast Cancer

12/15/2021

Deborah Abrams Kaplan ;

NCCN released updates to its practice guidelines on treating triple-negative breast cancer, featuring updates on sacituzumab govitecan recommendations and more.

NCCN released updates to its...

12/15/2021

Journal of Clinical Pathways

Journal of Clinical Pathways Newsletter

Recent Stories

Interview

Integrating Real-World Evidence and Clinical Trial Data in CLL Management

01/17/2025

Matthew S. Davids , MD, MMSc;

In this interview, Matthew S. Davids, MD, MMSc, discusses the role of real-world evidence in chronic lymphocytic leukemia (CLL) management, treatment considerations for high-risk and comorbid patients, strategies for therapy discontinuation,...

In this interview, Matthew S....

01/17/2025

Journal of Clinical Pathways

Interview

Personalizing Radiotherapy for Endometrial Cancer: Balancing Innovation and Quality of Life

01/17/2025

Kara D. Romano , MD;

In this interview, Kara D. Romano, MD, University of Virginia, discusses advancements in radiation therapy for endometrial cancer, the integration of systemic therapies, the impact of treatment on quality of life, and the role of clinical...

In this interview, Kara D....

01/17/2025

Journal of Clinical Pathways

Interview

Advancing Biomarker Validation and Emerging Treatments in Metastatic Breast Cancer

01/16/2025

Cynthia X. Ma , MD, PhD;

In this interview, Cynthia X. Ma, MD, PhD, Washington University School of Medicine, discusses challenges in biomarker validation, the role of combination therapies, promising novel agents, and the integration of emerging treatments for...

In this interview, Cynthia X....

01/16/2025

Journal of Clinical Pathways

News

Racial Disparities in Immune Checkpoint Inhibitor Treatment and Survival Among Patients With NSCLC

01/15/2025

Hannah Musick ;

Disparities in immune checkpoint inhibitor utilization and survival based on race and socioeconomic factors in patients with metastatic non-small cell lung cancer (NSCLC) underscore the need for improved access to care for vulnerable...

Disparities in immune checkpoint...

01/15/2025

Journal of Clinical Pathways

Interview

Novel Therapies and Their Impact on Treatment Pathways for Chronic Lymphocytic Leukemia

01/13/2025

Seema A. Bhat , MD;

In this interview, Seema A. Bhat, MD, discusses the evolving treatment landscape for chronic lymphocytic leukemia, including strategies for treatment selection, sequencing, addressing diagnostic gaps, and exploring novel therapies to improve...

In this interview, Seema A....

01/13/2025

Journal of Clinical Pathways

Data Engineering: The Foundation of Data Science Efforts

Current Issue

Special Reports

Subscribe

Recent Stories

Specialties

Events

Year Round Education

HMP Global Products