Data Science for HCPs

Extraction, Transformation, and Loading: The ETL of Data Engineering

05/18/2020

A question that data engineers and scientists often get is: “Why can’t we just use the data as is?” It would indeed be a wonderful world if data were to arrive in a standard format, ready to be used by data scientists, consumed by machine learning models, and plotted into eye-catching graphs. However, it is seldom the case that data can be used “out of the box.” In fact, a common misstep that sabotages many data science efforts early on is the failure to dedicate the proper infrastructure, data engineering team, and time allotment for data acquisition, preparation, and storage. The acronym often wielded in conversation by data scientists, “ETL,” refers to the theme of this post: extraction, transformation, and loading of data.

Data Starts With Extraction

The retrieval and extraction of medical data continues to be a laborious process due to the lack of integration between organizations and the obstacles of limited resources. Health data can be retrieved from a variety of data sources, hardware, and software formats. It often comes in non-standard formats, with challenging topology and accessibility.

To better illustrate these data structure obstacles, let’s take a look at a file we recently received from one of our partnering sites. The comma-separated values (CSV) delimited file contained several columns, which included zip code and pharmaceutical company name. Leading zeros, for zip codes that started with a zero, were dropped when viewed in Microsoft Excel (for example, 08904 became 8904). Commas in pharmaceutical company names were falsely interpreted as cues to split the cell-contained content into separate columns. When we attempted to ingest this data file, our process failed due to the easily overlooked issue of comma placement. This is only one of many hidden issues with source data, which can turn into hours or days of root cause analysis.

Data engineering during data acquisition, while simple in this provided scenario, is a critical preliminary step in data pre-processing as it ensures the integrity of its content and thus the accuracy of produced results. This process becomes more involved with increasingly intricate topologies, which is often the case with the multi-institutional biomedical “Big Data” that we work with in biotechnology. Dedicating a strong, appropriately resourced, and highly-dynamic team of data engineers to this problem is therefore crucial.

Accessing critical health data can also be hindered by institutional usage agreements as well as data volume requirements. There still remain barriers to sharing data between academics, health care institutions, researchers, vendors, and other stakeholders. We will explore these concepts in an upcoming post.

Transformation: What it Takes to Wrangle Data

When a team receives messy data, they develop a workflow to wrangle the data into a usable shape. This section will explore the “T” in ETL. Often referred to as data pre-processing, transformation encompasses any steps taken to reformat or restructure the data for downstream analysis. Data wrangling frequently includes, but is not limited to:

Data cleaning or cleansing: to detect and correct corrupt data entries
Normalization, standardization, scaling, and calibration: using either simple or complex approaches for ensuring proper alignment of the data’s distribution
Feature engineering: transforming data into acceptable formats for modeling. This is an incredibly rich subject that goes beyond logarithmic and quadratic transformations; such measures will be explored in an upcoming blog post

There is an inherent tradeoff between storing additional formats of data and computing such transformations on the fly; therefore, some teams may decide to carry out such transformations later in a data engineering pipeline. Complex transformation decisions often rely on a team’s skills and access to engineering resources. In the aforementioned example, the data for each column/field was transformed into a text format which preserved leading zeros and ignored commas in company names like "EMD Serono, Inc." Often the solution to this interplay between extraction and transformation is not as simple. It may involve mathematical transformations, dropping columns or rows that are not meaningful, changing data types, or even creating new data columns that are derived from existing data (eg, inserting a new field into a CSV containing the output of a model that indicates whether a patient meets the definition of having multiple sclerosis, given their clinical history at a specific time point). Transformations include complex methodologies that will be covered in more detail in future articles.

Deriving new data, known as data or feature engineering, is an example of data transformation. A classic example of this is when a data source contains a patient's birthdate. In many data science models, a patient’s age is necessary to determine the desired clinical knowledge (eg, birthdate can be used to calculate a patient’s age). This is new data that was not in the original source, but rather inferred. A patient’s age is protected health information (PHI) that can be obfuscated by deriving an age “bin” (for example, 30-39). Data binning is another example of derived data. Once the birthdate data has been transformed into a categorical (or continuous) format, it is then more portable for inclusion as a feature in a statistical model.

It may be a good time to remind you that data science is still science. Therefore, all transformations to data should be carefully and quantitatively documented in such a way as to allow for the recovery of the original data. The simplest way to do this is to never delete or overwrite the original copy of your data (whether those are columns in a working table or a separate table/file entirely); this is referred to as the “single version of the truth.” Even if it requires extensive transformation before proceeding with an analysis, having the ability to revert back to the original data represents a safety net of which every data scientist will be glad in the fullness of time. Transformed or derived data can live alongside the original data. It is also a good practice to store original data in a location that is accessed infrequently to lower the possibility that it will be overwritten or deleted by accident.

Loading: Where Does Data Go After Data Engineering?

After initial processing and exploration, data acquisition and preparation are formalized into data pipelines, which incorporate the learnings from data acquisition and clean-up, create production-ready workflows for continuing to ingest and prepare data from a source, and for use by a data science effort. Data scientists are provided with documentation for the format, topology, transformations that were employed, and final location of data for their use. This type of documentation is referred to as a “data dictionary.”

Data engineering efforts do not stop there; they continue with providing any necessary support a data science team needs to work effectively with data to develop and deliver solutions. Ongoing data engineering efforts may include refactoring the data pipeline given new discoveries about the data, designing data queries, transforming the data into other formats like Graph for advanced analytics, building dashboards for the pipeline, creating AI/ML (artificial intelligence/machine learning) workflows that move data and run ML models, monitoring and expediting computational runtime, or any other support that a team requires. Collaboratively, data science and data engineering teams contribute to the data lifecycle, ranging from simple monitoring and updating to MLOps once data has been put into production.

Next Post

In our upcoming discussion, we will explore data privacy and security, a topic especially important to the practice of health care in an increasingly data-driven age. This article will address best practices for handling medical records and demystify the Health Insurance Portability and Accountability Act, as well as other regulatory considerations.

About David Hughes

Hughes David Hughes is the Principal Machine Learning Data Engineer for Octave Bioscience. He develops cloud-based architectures and solutions for surfacing clinical intelligence from complex medical data. He leverages his interest in graph based data and population analytics to support data science efforts. David is using his experience leading clinical pathways initiatives in oncology to facilitate stakeholder engagement in the development of pathways in neurodegenerative diseases. With Octave, he is building a data driven platform for improving patient experience, mitigating cost, and advancing health care delivery for patients and families.

About Octave Bioscience

Octave The challenges for MS are significant, the issues are overwhelming, and the needs are mostly unmet. That is why Octave is creating a comprehensive, measurement driven Care Management Platform for MS. Our team is developing novel measurement tools that feed into structured analytical data models to improve patient management decisions, create better outcomes and lower costs. We are focused on neurodegenerative diseases starting with MS.

Current Issue

February 2025

Volume 11

Issue 1

Current Issue

Issue Archive

Special Reports

Special Report

Predictable Cost of Care Model for Treatment Decisions: Working Group Consensus Statements for Metastatic Non-Small Cell Lung Cancer

02/19/2025

Edward Arrowsmith , MD;

Vishnukamal Golla , MD, MPH;

Rhonda Henschel , MBA;

Andrew Hertler , MD, FACP;

David Jackman , MD;

Gordon Kuntz ;

Olaf Lodbrok , MS, MBA;

Carole Tremonti , RN, MBA;

Lalan Wilfong , MD;

The Predictable Cost of Care Working Group developed a model that can be used by various entities evaluating the impact of treatment on the total cost of care.

The Predictable Cost of Care...

02/19/2025

Journal of Clinical Pathways

Sponsored

Special Report

2024 NCCN Clinical Practice Guidelines in Oncology (NCCN Guidelines®) Update: Impact on NSCLC Landscape

11/15/2024

The updated 2024 NCCN Guidelines® recommend broad genomic testing and the need for multidisciplinary care to accurately diagnose and treat non-small cell lung cancer. View this special report to learn more.

The updated 2024 NCCN...

11/15/2024

Journal of Clinical Pathways

Sponsored

JCP Special Report

The Value of Tissue-Based Genomic Profiling in Oncology

09/17/2024

Innovations in precision oncology have helped healthcare providers to create more personalized treatment plans and improve patient outcomes. View this special report to learn more.

Innovations in precision...

09/17/2024

Journal of Clinical Pathways

Sponsored

Special Report

Shedding Light on Non-Small Cell Lung Cancer & Its Impact on Patients

04/03/2023

This supplement aims to raise awareness about non-small cell lung cancer (NSCLC) and its impact on patients by providing comprehensive information to help improve early detection and appropriate care.

This supplement aims to raise...

04/03/2023

Journal of Clinical Pathways

Sponsored

JCP Special Report

Brukinsa® (Zanubrutinib) for Chronic Lymphocytic Leukemia

03/31/2023

In this product monograph, read an interview with Jeff P. Sharman, MD, as he discusses important BRUKINSA® trial data including efficacy, safety, dosing, administration, and other relevant data. These key findings supported the Food and Drug...

In this product monograph, read...

03/31/2023

Journal of Clinical Pathways

Sponsored

JCP Special Report

Tumor Lysis Syndrome: Early Diagnosis and Management

01/04/2023

This review summarizes the diagnosis, pathophysiology, and evidence-based guidelines for the prevention and management of tumor lysis syndrome, a common, acute, life-threatening disease primarily in patients with hematologic cancers and solid...

This review summarizes the...

01/04/2023

Journal of Clinical Pathways

Sponsored

JCP Special Report

A Tumor Lysis Syndrome Risk Assessment and Its Impact on Patients

08/09/2022

In an interview with Journal of Clinical Pathways, Nicholas Short, MD, shares objectives on the design and benefits of MD Anderson’s Tumor Lysis Syndrome clinical assessment for patient risk and impact on patient care.

In an interview with Journal of...

08/09/2022

Journal of Clinical Pathways

Updated NCCN Guidelines on B-Cell Lymphomas

Sponsored

JCP Special Report

Overview of the Updated NCCN Guidelines on B-Cell Lymphomas

06/15/2022

Robert Fee ;

Updated multiple times in 2022, the National Comprehensive Cancer Network (NCCN) Guidelines for B-Cell Lymphomas provide recommendations for the prevention, diagnosis, and management of malignancies.

Updated multiple times in 2022,...

06/15/2022

Journal of Clinical Pathways

Sponsored

Special Report

Expanded Indication for Zanubrutinib: Marginal Zone Lymphoma and Waldenström’s Macroglobulinemia

06/10/2022

In an interview with Journal of Clinical Pathways, Mitul Gandhi, MD, reviews the clinical impact and treatment approaches and challenges for patients with marginal zone lymphoma and Waldenström’s macroglobulinemia.

In an interview with Journal of...

06/10/2022

Journal of Clinical Pathways

JCP Special Report

Recommendations for Creating an Oncology Clinical Pathways Framework Tool Based on Payer, Provider, and Patient Priorities: Findings From the 2021 Care Pathways Working Group

05/20/2022

Robin T. Zon , MD, FACP, FASCO;

Gordon Kuntz ;

Winston Wong , PharmD;

The Journal of Clinical Pathways convened the 2021 Care Pathways Working Group to identify and reconcile the different pathway drivers for each stakeholder now and five years into the future, creating a framework tool based on oncology care...

The Journal of Clinical Pathways...

05/20/2022

Journal of Clinical Pathways

Journal of Clinical Pathways Newsletter

Recent Stories

Videos

Progress and Unmet Needs in Post-IO Treatment for Metastatic RCC

03/19/2025

Neil Shah , MD;

In this interview, Dr Neil Shah discusses the challenges of treatment sequencing for metastatic renal cell carcinoma (mRCC) after immune checkpoint inhibitor therapy, highlighting findings from his study on post-IO treatment efficacy and the...

In this interview, Dr Neil Shah...

03/19/2025

Journal of Clinical Pathways

News

Academic Cancer Centers Improve Outcomes for Patients With Mantle Cell Lymphoma

03/18/2025

Lisa Kuhns , PhD, MD;

Patients with mantle cell lymphoma (MCL) who receive treatment at academic cancer centers experience longer overall survival and higher participation in clinical trials compared to those treated in community-based practices, according to a...

Patients with mantle cell...

03/18/2025

Journal of Clinical Pathways

Interview

The Future of Chronic Lymphocytic Leukemia Management with Targeted Therapies

03/17/2025

In this interview, an expert discusses the preclinical development and clinical impact of zanubrutinib in chronic lymphocytic leukemia, highlighting key findings from pivotal trials, comparisons with other therapies, and future research...

In this interview, an expert...

03/17/2025

Journal of Clinical Pathways

Videos

Advancing Value-Based Oncology With the Predictable Cost of Care Working Group: Part 2

03/17/2025

Gordon Kuntz ;

Carole Tremonti , RN, MBA;

In this episode of Oncology Innovations, experts discuss the Predictable Cost of Care Working Group’s efforts to develop a standardized model for assessing oncology treatment costs, highlighting insights from phase I, the goals of phase II,...

In this episode of Oncology...

03/17/2025

Journal of Clinical Pathways

News

Improving Health Care Access for Immigrants With Disabilities in North America

03/14/2025

Hannah Musick ;

Exploring the barriers and facilitators of immigrants with disabilities accessing health care in the US and Canada, this qualitative meta-synthesis reveals the structural and cultural obstacles faced by this marginalized population and calls...

Exploring the barriers and...

03/14/2025

Journal of Clinical Pathways

Extraction, Transformation, and Loading: The ETL of Data Engineering

Current Issue

Special Reports

Subscribe

Recent Stories

Specialties

Events

Year Round Education

HMP Global Products