Data Science for HCPs

The Guardrails of Data Science: Regulations and Certifications for Health Care

01/20/2021

In this post we explore the guidelines intended to provide guardrails for data science efforts. Guardrails often are in the form of generally accepted standards or examples of best practices as presented by data scientists and acquired through rigorous scientific experiments. Regulations are the formalization of best practices and set a standard for ensuring data scientists, like clinical providers, do no harm to patients. We now explore some of these guidelines and formal regulations intended to provide tools and guardrails for data science in health care. In our exploration we will discover that the context (e.g. performing health care operations vs conducting scientific research) under which health care data is used informs what regulations apply to a data science effort.

HIPAA: Privacy and Patient Trust

HIPAA stands for Health Insurance Portability and Accountability Act. It was signed by President Bill Clinton in 1996 and further enforced with the HITECH Act of 2009. It was created to ensure that personal identifying information (PII) and protected health information (PHI) collected by health care is not disclosed without the patient’s consent or knowledge.

To be able to search for patterns in the health-related data and draw conclusions, data-driven entities need to have access to PHI information of patients such as comorbidities, diagnosis, and historical records of medications. PII—such as age, address, or demographics—is collected to include this information in analysis. The full name of the patient, which helps track patient changes within different insurances and collect longer periods of data, is a key factor, particularly in chronic conditions like multiple sclerosis (MS).

This information is necessary to perform proper analysis on health-related fields; in fact, many innovations and scientific advances would not have been possible without this structured annotation. For example, the estimation of the cancer risk factor based on genetic tests has shown great progress since we started collecting genetic data to be able to find similarities and discrepancies in between different cancer subtypes and diagnoses.

To ensure compliance under the HIPAA Security Rule, access to PHI and sensitive information is restricted to authorized personnel only. Typically, the PII and PHI received data is preprocessed to “de-identify” it, before making it accessible to the personnel who require this data to perform analysis. The identification consists on matching different sources of data (e.g. MRIs, EMR, biomarkers, claims data, etc.) to the same patient, dropping every PII and associating to all these data sources a random and unique “patient_id” that allows matching the sources, if required. A patient’s various data sources are often mapped to the patient using a master patient index solution. Data lineage policies and practices track where data comes from, how it has been processed, who accesses the data, and where it has been moved to through its journey as part of a data science effort. This level of diligence is important to ensure that there is no inappropriate disclosure of PHI, security breaches, or violation of policy and procedure that would negatively impact a patient’s privacy.

The intent of PII protections is to mitigate and reduce risk, not to entirely eliminate it. De-identification and anonymization of data to enable its use in data science efforts is an example of regulatory risk mitigation. Cases such as Breyer v Germany demonstrate it can be difficult to identify what is PII and entirely eliminate risk through regulatory efforts. In this 2016 case brought before German courts the question of whether a dynamically assigned IP (internet protocol) address, which is assigned whenever a user browses the Internet, should be considered PII. The answer, as in many cases, is that it depends on a complex combination of context and details. Technology through scientific progress is evolving and safeguarding patient privacy will require ongoing diligence.

Not being HIPAA compliant is a serious topic that carries serious consequences at the State and Federal level. The penalties for noncompliance are based on the level of negligence and can range from $100 to $50,000 per violation (or per record), with a maximum penalty of $1.5 million per year for violations of an identical provision. Violations can also carry criminal charges that can result in jail time.

Global Privacy: GDPR

General Data Protection Regulation, is the European Union standard for personal data protection. This regulation affects numerous aspects of the protection of each individual's data, including:

the type of data and processing allowed,
conditions under which they must opt-in,
the ability to have their data removed,
increased transparency and accountability in data processing, especially regarding the sharing of data with third parties.

Whereas HIPAA’s focus is on protecting a person’s health data, GDPR emphasizes a person's ownership and control of their data. In particular, this includes the requirement that patients specifically opt into data collection as well as the ability to have their data easily removed from companies to which they have previously given consent. The general takeaways from GDPR is that you need to make your system have the ability to allow your customers to fully remove themselves from your system with the same level of effort as it was to onboard the person. You must also provide information on what other third party groups are using the customers data for. A good overview of GDPR requirements can be found here.

The points in GDPR regarding enhanced accountability, transparency, and consent are, while not without cost, generally things that fall into the category of good scientific practice and good science communication. The staff of a company, university, or hospital should be able to explain to a trial subject, or an app user the goals, methodology, and benefits of a study or analysis in a way that is compelling enough to convince a member of the public. The question of deleting one’s data from an organization’s records raises a particularly interesting question about model development from an auditing perspective. If someone’s (or several someones’) data is used to train a model for use in a clinical setting, and that individual/those individuals decide(s) to delete their data from the organization performing that work, then that data is no longer available for the construction and training of future models. The model itself (i.e. the digital object that estimates a probability used to apply some clinical label) will, after training, testing, and validation, be stored for future use. Its creators will, however, no longer be able to train a subsequent model, whether an upgrade or audit check, using exactly the same data. This means that model builders need to both understand for themselves and communicate to the world the uncertainties associated with their model. For example, if the first version of a model estimates the probability that some patient has a particular medical disorder at 5.3±1.4%, and an updated model trained on a slightly (or even completely) different data set estimates that probability as 4.9±1.3%, those results are statistically consistent with one another. In fact, many machine learning models have an element of stochasticity in terms of the grouping or ordering of samples as the model arrives at its final configuration. Those random changes to model parameters are small, but not zero. The concept of both experimental and modeling uncertainty is a vitally important one, and should be a part of the way we think about clinical tests, and indeed many other types of statistical analysis (physical measurements, lab tests, political polls, etc.). This too, can be filed under the heading of good scientific practice that should be adopted more broadly regardless of any external regulatory burden.

Is Cloud Computing Compliant or Safe?

Cloud Providers, Cloud services and cloud data centers have taken the existing business model of renting physical space in a data center and abstracted the physical layer to only allow digital access to the resources you pay for. The largest vendors, Amazon Web Services, Microsoft Azure and Google Cloud Platform do the heavy lifting of getting different security certifications, and are regularly audited so you do not have to, along with providing BAAs upon request. The caveat is that, because the underlying physical infrastructure, and networking abides by all the certifications they proudly boast on their websites, does not mean the work loads you run are following those standards. It is the responsibility of customers of the cloud services to take HIPAA eligible services and follow the guidelines to make those services compliant.

Policy of Least Privilege (PoLP)

A philosophy that proposes individuals should only have access to the data necessary to conduct their work sounds reasonable, right? We believe PoLP is a foundational concept and agree that this is a good approach to data governance and the data stewardship that data science teams play a role in. It is important to implement a PoLP guideline at the beginning of any effort involving patient data since retrofitting it into a project can be difficult or impossible. Generally, data scientists and engineers should be limited to data sets and individual data elements that enable their effort without limiting innovation or clinical insights. Implementing PoLP can be technically difficult and requires an organization to establish policies and procedures that are supported by all stakeholders. A strong PoLP policy protects patients while concurrently enables the development of clinical insights. An often overlooked aspect of PoLP is that it not only applies to humans in a data science effort but also to the engineering pipelines, machine learning algorithms, data storage systems, and overall workflows that any effort employs.

Software as a Medical Device (SaMD)

The International Medical Device Regulators Forum (IMDRF) regulates medical devices and products through a collaborative lens that safeguards the use of medical technology throughout international locations. Software as a Medical Device (SaMD) is a framework from the IMDRF Working Group which is chaired by the US FDA and was first established in 2013. The framework’s intent is to provide safe guidelines for software that is embedded in medical hardware, or otherwise comprises a medical service that guides or suggests medical care. The evolving nature of data science and software advances requires that SaMD likewise evolves. Currently there are IMDRF SaMD Working Group activities that address Artificial Intelligence Medical Devices (AIMDs), Medical Device Cybersecurity Guide, and Personalized Medical Devices which demonstrate the complexity of defining and regulating software which plays a role in the treatment of patients.

The SaMD risk categorization framerwork has been proposed to establish a common lexicon and approach to determining the levels of risk that software in the health care environment poses for patients and the public. There are four risk categories/levels which are demonstrated in Table 1 below.

SaMD is evaluated using the Software as a Medical Device: Clinical Evaluation guide which describes a SaMD application’s performance with regard to analytical and technical accuracy as well as clinical validation. Clinical assessment is complemented by guidelines for software quality and engineering standards in the SaMD: Application of Quality Management System (QMS) document. The SaMD regulatory pathway is captured in Figure 1 below from the IMDRF documentation.

In this approach, the FDA would expect a commitment from manufacturers on transparency and real-world performance monitoring for artificial intelligence and machine learning-based software as a medical device, as well as periodic updates to the FDA on what changes were implemented as part of the approved pre-specifications and the algorithm change protocol.

The proposed regulatory framework could enable the FDA and manufacturers to evaluate and monitor a software product from its premarket development to postmarket performance. This potential framework allows for the FDA’s regulatory oversight to embrace the iterative improvement power of artificial intelligence and machine learning-based software as a medical device, while assuring patient safety.

Other Data Science Standards and Frameworks

There are numerous existing standards that directly or indirectly regulate data science efforts. This is not a comprehensive list, each effort requires a review of relevant regulations and consideration of evolving frameworks:

NIST: https://www.nist.gov/cyberframework
ISO: https://www.iso.org/isoiec-27001-information-security.html
SOC2/3: https://cloud.google.com/security/compliance/soc-3/, https://www.imperva.com/learn/data-security/soc-2-compliance/

The Tradeoffs of Regulation vs Innovation

Innovation is inherently novel and uncertain--and therefore risky--while regulation implies control of such risks from new, untried products or services. Health care records, unlike credit cards or passwords, are especially sensitive because they can’t be canceled, changed, or reset in the event of a breach. Technological advances have led to increasingly stringent regulations because of the ability to identify patients based on their digital footprint. Blinding demographic variables and not sharing patient information across institutions are measures often taken as safeguards for privacy, but these limit data scientists’ ability to build precision models or sufficiently power studies. Putting securely engineered systems in place can be time-consuming and expensive.

Regulation is often perceived as hindering innovation, but it can also serve as an opportunity for organizations to be even more innovative while adhering to across-the-board guidelines. As with all data-driven medical solutions, regulatory agencies have been attempting to achieve the best balance allowing promising innovations into the medical marketplace where they can be field-tested while providing access to patients willing to accept the risk. Informed consent is a key part of understanding and communicating what terms a user is accepting when their data is used for business and research purposes. Having safeguards in place also ensures that progress with every therapeutic, diagnostic, and device is medically necessary with proven efficacy and safety.

Regardless of the size of your data-driven organization, it is always a good idea to proactively plan out your project: 1) explore usage cases that require handling PII/PHI, 2) categorize the risk and impact of scenarios, and 3) test the robustness and compliance for the system that you have in place (whether that be manual or automated). Smaller and typically more dynamic organizations in proof-of-concept or beta-testing often want to work more nimbly, but poorly created prototypes can make it challenging to retroactively layer on encryption and security. Larger institutions with more patients and stakeholders are typically more risk-averse, and it is important to embed security checkpoints and synchronization with development cycles as opposed to being reactive to a breach.

As discussed in our last article, the COVID-19 pandemic has led to the scientific method unfolding live, as guidelines and recommendations are communicated to the public as more samples are acquired. This public health pressure to innovate has led to unique privacy considerations when it comes to contact-tracing and free genetic testing, so make sure to understand the terms of service and level of risk mitigation whether you are a patient, provider, payer, or pharmaceutical company.

Next Article

In our upcoming blog posts we will explore our experience and thoughts about building machine learning models.

About David Hughes

Hughes David Hughes is the Principal Machine Learning Data Engineer for Octave Bioscience. He develops cloud-based architectures and solutions for surfacing clinical intelligence from complex medical data. He leverages his interest in graph based data and population analytics to support data science efforts. David is using his experience leading clinical pathways initiatives in oncology to facilitate stakeholder engagement in the development of pathways in neurodegenerative diseases. With Octave, he is building a data driven platform for improving patient experience, mitigating cost, and advancing health care delivery for patients and families.

About Octave Bioscience

Octave The challenges for MS are significant, the issues are overwhelming, and the needs are mostly unmet. That is why Octave is creating a comprehensive, measurement driven Care Management Platform for MS. Our team is developing novel measurement tools that feed into structured analytical data models to improve patient management decisions, create better outcomes and lower costs. We are focused on neurodegenerative diseases starting with MS.

Current Issue

April 2025

Volume 11

Issue 2

Current Issue

Issue Archive

Special Reports

Special Report

Predictable Cost of Care Model for Treatment Decisions: Working Group Consensus Statements for Metastatic Non-Small Cell Lung Cancer

02/19/2025

Edward Arrowsmith, MD; Vishnukamal Golla, MD, MPH; Rhonda Henschel, MBA; Andrew Hertler, MD, FACP; David Jackman, MD; Gordon Kuntz; Olaf Lodbrok, MS, MBA; Carole Tremonti, RN, MBA; Lalan Wilfong, MD

The Predictable Cost of Care Working Group developed a model that can be used by various entities evaluating the impact of treatment on the total cost of care.

The Predictable Cost of Care...

02/19/2025

Journal of Clinical Pathways

Sponsored

Special Report

2024 NCCN Clinical Practice Guidelines in Oncology (NCCN Guidelines®) Update: Impact on NSCLC Landscape

11/15/2024

The updated 2024 NCCN Guidelines® recommend broad genomic testing and the need for multidisciplinary care to accurately diagnose and treat non-small cell lung cancer. View this special report to learn more.

The updated 2024 NCCN...

11/15/2024

Journal of Clinical Pathways

Sponsored

JCP Special Report

The Value of Tissue-Based Genomic Profiling in Oncology

09/17/2024

Innovations in precision oncology have helped healthcare providers to create more personalized treatment plans and improve patient outcomes. View this special report to learn more.

Innovations in precision...

09/17/2024

Journal of Clinical Pathways

Sponsored

Special Report

Shedding Light on Non-Small Cell Lung Cancer & Its Impact on Patients

04/03/2023

This supplement aims to raise awareness about non-small cell lung cancer (NSCLC) and its impact on patients by providing comprehensive information to help improve early detection and appropriate care.

This supplement aims to raise...

04/03/2023

Journal of Clinical Pathways

Sponsored

JCP Special Report

Brukinsa® (Zanubrutinib) for Chronic Lymphocytic Leukemia

03/31/2023

In this product monograph, read an interview with Jeff P. Sharman, MD, as he discusses important BRUKINSA® trial data including efficacy, safety, dosing, administration, and other relevant data. These key findings supported the Food and Drug...

In this product monograph, read...

03/31/2023

Journal of Clinical Pathways

Sponsored

JCP Special Report

Tumor Lysis Syndrome: Early Diagnosis and Management

01/04/2023

This review summarizes the diagnosis, pathophysiology, and evidence-based guidelines for the prevention and management of tumor lysis syndrome, a common, acute, life-threatening disease primarily in patients with hematologic cancers and solid...

This review summarizes the...

01/04/2023

Journal of Clinical Pathways

Sponsored

JCP Special Report

A Tumor Lysis Syndrome Risk Assessment and Its Impact on Patients

08/09/2022

In an interview with Journal of Clinical Pathways, Nicholas Short, MD, shares objectives on the design and benefits of MD Anderson’s Tumor Lysis Syndrome clinical assessment for patient risk and impact on patient care.

In an interview with Journal of...

08/09/2022

Journal of Clinical Pathways

Updated NCCN Guidelines on B-Cell Lymphomas

Sponsored

JCP Special Report

Overview of the Updated NCCN Guidelines on B-Cell Lymphomas

06/15/2022

Robert Fee

Updated multiple times in 2022, the National Comprehensive Cancer Network (NCCN) Guidelines for B-Cell Lymphomas provide recommendations for the prevention, diagnosis, and management of malignancies.

Updated multiple times in 2022,...

06/15/2022

Journal of Clinical Pathways

Sponsored

Special Report

Expanded Indication for Zanubrutinib: Marginal Zone Lymphoma and Waldenström’s Macroglobulinemia

06/10/2022

In an interview with Journal of Clinical Pathways, Mitul Gandhi, MD, reviews the clinical impact and treatment approaches and challenges for patients with marginal zone lymphoma and Waldenström’s macroglobulinemia.

In an interview with Journal of...

06/10/2022

Journal of Clinical Pathways

JCP Special Report

Recommendations for Creating an Oncology Clinical Pathways Framework Tool Based on Payer, Provider, and Patient Priorities: Findings From the 2021 Care Pathways Working Group

05/20/2022

Robin T. Zon, MD, FACP, FASCO; Gordon Kuntz; Winston Wong, PharmD

The Journal of Clinical Pathways convened the 2021 Care Pathways Working Group to identify and reconcile the different pathway drivers for each stakeholder now and five years into the future, creating a framework tool based on oncology care...

The Journal of Clinical Pathways...

05/20/2022

Journal of Clinical Pathways

Journal of Clinical Pathways Newsletter

Recent Stories

Videos

Why Pathways Matter: Editorial Advisory Board Testimonials

04/24/2025

Lalan Wilfong, MD

For the past 10 years, Journal of Clinical Pathways has been at the forefront of advancing evidence-based, value-driven care through clinical pathways. As we celebrate our 10th anniversary, we’re highlighting the voices of key thought leaders...

For the past 10 years, Journal...

04/24/2025

Journal of Clinical Pathways

Videos

Confronting Intersectional Stigmas Across the Cancer Care Continuum

04/24/2025

Gretchen McNally, PhD, MPH, discusses how intersectional stigmas—rooted in social biases around race, gender, socioeconomic status, and other identities—negatively affect cancer care across the continuum, influencing treatment access,...

Gretchen McNally, PhD, MPH,...

04/24/2025

Journal of Clinical Pathways

News

EHR-Based Model Accurately Identifies High-Risk Patients for Gastric Cancer

04/24/2025

Lisa Kuhns, PhD, MD

An electronic health record (EHR)-based logistic regression model demonstrated strong performance in identifying individuals at high risk for noncardia gastric cancer (NCGC), offering a potential tool to guide targeted screening in the US,...

An electronic health record...

04/24/2025

Journal of Clinical Pathways

Videos

Strengthening Pathway-Driven Care Through Payer Collaboration

04/23/2025

Gordon Kuntz; Lalan Wilfong, MD

In this episode of Oncology Innovations, Gordon Kuntz and Dr Lalan Wilfong discusses the evolving role of payers and value-based care intermediaries in oncology, emphasizing the importance of whole-person support, coordinated care, and...

In this episode of Oncology...

04/23/2025

Journal of Clinical Pathways

Videos

Evaluating the Performance of Biomarkers from Metastatic vs Primary Sites in Clear Cell Renal Cell Carcinoma

04/23/2025

Steven Monda, MD

Steven Monda, MD, shares his research on how the origin of tumor biopsies can impact biomarker performance and treatment strategies in metastatic clear cell renal cell carcinoma.

Steven Monda, MD, shares his...

04/23/2025

Journal of Clinical Pathways

The Guardrails of Data Science: Regulations and Certifications for Health Care

Current Issue

Special Reports

Subscribe

Recent Stories

Specialties

Events

Year Round Education

HMP Global Products