What is the difference between a data lake and a data lakehouse for pharma?

A data lake is a raw storage layer — unstructured, schema-on-read, no built-in query optimization. A data lakehouse (Delta Lake, Apache Iceberg, Databricks Unity Catalog) adds ACID transactions, schema enforcement, and SQL query capabilities directly on the lake storage. For pharma GMP use, the lakehouse architecture is strongly preferred because ACID transactions support the audit-trail and data integrity requirements of ALCOA+ — raw data lakes without transaction support cannot reliably enforce 'Attributable, Legible, Contemporaneous, Original, Accurate' guarantees.

Does ALCOA+ apply to data in a pharma data lake?

Yes. EU GMP Chapter 4 (2023 revision) and FDA 21 CFR Part 211 require that all GMP data — including raw instrument data, processed results, and audit trails — meet ALCOA+ principles. For a data lake, this means: raw data is immutable once ingested (Original), every transformation is logged with user, timestamp, and rationale (Attributable, Contemporaneous), all versions of processed data are retained (Complete), and access controls prevent unauthorized modification (Accurate). Full ALCOA+ guidance is in our Data Integrity blueprint.

Should pharma data lakes be on-premise or cloud?

Both are viable with proper GMP qualification. On-premise offers data sovereignty and lower latency for OT-connected data ingestion; cloud (Azure, AWS, GCP) offers scalability, managed services, and lower infrastructure maintenance burden. The hybrid architecture — OT data aggregated on-premise via historian, then replicated to cloud lakehouse for analytics — is the most common 2025–2026 deployment pattern for regulated pharma sites. The cloud provider must be qualified as a GxP vendor (supplier qualification per GMP Chapter 7).

How does AVEVA PI connect to a pharma data lake?

AVEVA PI serves as the OT data aggregation layer — collecting, normalizing, and storing time-series data from SCADA, DCS, PLCs, and field instruments with GMP-compliant audit trails. The PI data lake connector (AVEVA DataHub or PI OLEDB Enterprise) enables periodic or near-real-time replication of PI tag data into the enterprise data lake. This preserves PI's role as the validated, audit-trailed source of truth for OT data while enabling AI/ML workloads on the lake that would be impractical to run directly against the operational PI server.

What data sources should be unified in a pharma data lake?

The five data domains that must be unified for pharma AI use cases are: OT/process data (historian — CPP time-series), MES/batch records (batch identifiers, recipe parameters, in-process checks, EBR records), LIMS/QC results (CQA measurements, specifications, OOS investigations), environmental monitoring (EMS/BMS — temperature, humidity, differential pressure, particle counts), and ERP/supply chain (raw material lot data, batch traceability, release status). Without all five domains linked at the batch level, AI models cannot be built on complete, representative training data.

What is the medallion architecture and why is it suited to pharma?

The medallion architecture organizes a data lake into three layers: Bronze (raw ingested data — immutable, ALCOA+ Original layer), Silver (cleaned, standardized, enriched data — business logic applied, audit-trailed transformations), and Gold (analytics-ready, aggregated datasets for specific use cases like AI model training or management reporting). For pharma, the Bronze layer is the regulatory archive — it is never modified after ingestion, providing the 'Original' and 'Accurate' ALCOA+ guarantee. The Silver layer applies GMP data transformations (unit conversions, outlier flagging) under change control. Gold supports AI teams without touching regulated data.

Does a pharma data lake require GxP validation?

Yes, if it stores or processes GxP data. The validation scope follows GAMP 5: infrastructure components (cloud platform, storage, orchestration) qualify as Category 1 (infrastructure) or Category 3 (non-configured software); lakehouse platform (Databricks, Azure Synapse) qualifies as Category 4 (configurable); custom data pipelines and transformation code qualify as Category 5 (bespoke). A risk-based validation approach allows lightweight qualification for infrastructure while focusing validation effort on the transformation pipelines that affect GxP data quality.

pharma data lake architecture

Pharma Data Lake Architecture: GMP-Compliant Design Guide

TL;DR: Every AI project in the N2 cluster — predictive maintenance, computer vision, digital twins, PAT/RTRT — depends on a single enabling infrastructure: a GMP-compliant data lake that unifies OT (historian), MES, LIMS, EMS, and ERP data at the batch level with ALCOA+ guarantees. This guide covers the medallion architecture, OT/IT data integration patterns, AVEVA PI connectivity, cloud vs. on-premise decision, and the GAMP 5 validation scope for a pharma data lake. (~80 words)

Why Pharma AI Projects Fail at the Data Layer

Post-mortems of failed pharma AI pilots consistently cite the same root cause: insufficient data infrastructure. The pattern is predictable — a data science team builds a promising predictive maintenance or yield optimization model in a sandbox environment using a curated dataset from one production line. The model works. Then the team tries to scale it to three lines or integrate it into the MES workflow, and discovers that data from line 2 uses different tag naming conventions, LIMS results for some batches are missing batch IDs that match the historian tags, and the environmental monitoring data is stored in a standalone BMS system with no API. The project stalls.

The solution is not a better model. It is a unified, GMP-compliant data lake that solves the integration problem once, for all AI projects. The investment in data infrastructure is the highest-leverage investment a pharma site can make for its AI program — not because the lake itself does anything intelligent, but because it removes the data quality and access barrier that blocks every model from production.

Architecture Design: Medallion for Pharma

The medallion architecture — Bronze, Silver, Gold layers — is the industry standard for pharma data lakes because its immutable Bronze layer directly satisfies the ALCOA+ "Original" requirement.

Bronze Layer — Raw Ingested Data: Raw data as received from source systems, written once and never modified. This layer is the regulatory archive. Every PI tag value, every MES batch record field, every LIMS result is stored exactly as it arrived, with the ingestion timestamp and source system identifier. No transformations, no enrichment, no filtering. Retention: match the longer of the product retention requirement (typically batch record retention ≥ 1 year post-expiry date, often 5–10 years for regulatory submissions) or the AI model training archive requirement (often 3–5 years of batch history for adequate model calibration). Technology: object storage (Azure Data Lake Storage Gen2, AWS S3, on-premise Ceph) with write-once access controls.

Silver Layer — Cleaned and Standardized Data: Transformations applied to Bronze data under change control: unit conversion (°F to °C, PSI to bar), tag name normalization across sites, batch ID reconciliation across MES and historian, outlier flagging (values outside physical plausibility bounds). Every transformation is logged with user, timestamp, transformation rule version, and input/output record identifiers. Technology: Apache Spark on Databricks or Azure Synapse for scalable transformation; Delta Lake format for ACID transaction support (critical for audit trail integrity).

Gold Layer — Analytics-Ready Datasets: Aggregated, structured datasets designed for specific use cases: AI model training tables (one row per batch, columns = CPP statistics + CQA results), management reporting views (batch success rate by line, CPP trend charts), QC dashboard feeds. Gold layer data is derived from Silver — it can be regenerated if the derivation logic changes. Technology: Delta Lake tables or Azure Synapse dedicated SQL pools; Databricks Unity Catalog for data governance and access control.

For the data quality principles governing all three layers, see Data Integrity ALCOA+ →.

Data Source Integration Map

OT Data — Process Historian: AVEVA PI is the most common OT data aggregation layer in large pharma (see Data Historian: AVEVA PI vs OSS → for the selection decision). PI stores CPP time-series at tag level. The data lake integration uses AVEVA DataHub (cloud connector) or PI OLEDB Enterprise (on-premise query interface) to replicate PI tag data into the Bronze layer on a scheduled (hourly/daily) or event-triggered (batch completion) basis. Important: the PI server remains the validated, operational source of truth for OT data — the lake receives copies for analytics, not the primary operational record.

MES/EBR Data: MES batch records contain the structured batch information that gives OT time-series its context: which product, which recipe, which production order, which operator, which deviations. MES integration uses REST APIs (preferred — modern MES like Körber PAS-X, Werum, or Rockwell PharmaSuite all support REST) or database-level integration (older MES — SQL views to Bronze layer via CDC). The batch ID must be normalized to match between MES and historian — this is the most common integration challenge in pharma data lake projects.

LIMS/QC Data: CQA measurement results — the "output" data that AI models predict. LIMS integration follows the same REST API or database pattern as MES. Critical requirement: LIMS results must be linkable to the historian CPP data at batch level. A batch without linked LIMS results cannot be used as a training point for CQA prediction models.

Environmental Monitoring (EMS/BMS): Temperature, humidity, differential pressure, and particle count data from classified areas. This data is critical for computer vision QC (cleanroom conditions at time of inspection), predictive maintenance (HVAC performance correlation to equipment failure), and for GMP compliance documentation. EMS/BMS integration into the lake enables correlation analysis not possible when environmental data sits in a standalone monitoring system.

ERP/Supply Chain: Raw material lot data, batch traceability, release status. ERP integration enables material-to-batch lineage in the lake — essential for investigating raw material lot effects on CQA variation and for supply chain analytics.

Technology Decision: AVEVA PI vs. Open-Source Historian

This is covered in depth in the dedicated Quick-Reference: Data Historian: AVEVA PI vs Open Source →. The summary for data lake architecture design:

AVEVA PI as the OT historian + cloud/on-premise lakehouse for the analytics layer is the most common architecture at large regulated pharma sites. The combination leverages PI's validated GMP compliance and deep OT connectivity while using modern lakehouse tools for the analytics workloads that PI is not designed for (large-scale ML training, cross-functional data joins, data science notebook environments).

Open-source time-series databases (InfluxDB, TimescaleDB) as the historian + open-source lakehouse (Apache Hudi, Delta Lake on MinIO) is viable for greenfield, cost-sensitive, or smaller sites. The validation investment for an open-source stack is equivalent to — often higher than — a commercial platform, because the qualification work must be done entirely by the site's team rather than leveraging vendor-supplied validation packages.

Cloud vs. On-Premise Decision Framework

Factor	Favors On-Premise	Favors Cloud
OT data latency requirements	Real-time OT analytics (<1s)	Batch analytics (hourly/daily acceptable)
Data sovereignty requirements	Strict national/contractual requirements (VN data, US ITAR)	No restrictions
IT maintenance capability	Limited cloud skills on site	Cloud team available
Scale	<50 TB data, <10 AI use cases	>50 TB, multiple sites, >10 use cases
GxP validation preference	Prefer physical infrastructure	Comfortable with cloud vendor qualification
Budget model	CapEx preferred	OpEx preferred

The hybrid model — on-premise for OT aggregation (PI server + edge historian), cloud for the Bronze/Silver/Gold lake layers — is the most common 2025–2026 architecture for mid-to-large pharma sites. It avoids moving raw OT data to the cloud (addressing latency and data sovereignty concerns) while using cloud scalability for analytics workloads that are computationally intensive but latency-tolerant.

GxP Validation Scope

A pharma data lake that stores or processes GxP data requires validation under GAMP 5. The scope:

Infrastructure (GAMP Cat 1): Cloud platform, storage infrastructure, network — qualified via vendor qualification (IQ only, based on vendor SOC2/ISO27001 documentation).

Lakehouse Platform (GAMP Cat 4): Databricks, Azure Synapse, Snowflake — configurable software. IQ/OQ required. PQ against defined data quality KPIs (completeness %, latency, audit trail integrity). Vendor validation package review accelerates this significantly.

Data Pipelines / Transformation Code (GAMP Cat 5): Custom Spark jobs, Python ETL scripts, data quality check logic — bespoke software requiring full URS/FS/DS/IQ/OQ/PQ. This is the highest-effort validation component. Apply risk-based scoping: pipelines that handle GxP-critical data (CPP, CQA, batch records) receive full validation; pipelines handling only management reporting data receive lighter qualification.

AI/ML Models Trained on Lake Data (GAMP Cat 4–5): Models trained using lake data are validated separately per the AI validation framework — see GAMP 5 Validation for AI/ML →.

Vietnam Context

Vietnamese pharmaceutical manufacturers face a specific data lake implementation challenge: legacy OT infrastructure with no historian connectivity means the Bronze layer cannot receive clean, timestamped OT data from day one. The practical approach for Vietnamese sites is a phased implementation: Phase 1 deploys edge-to-historian connectivity (IIoT gateways on priority equipment, feeding a validated AVEVA PI or open-source historian) — this is the N3 cluster topic. Phase 2 deploys the Bronze lake layer using historian data as the OT source. Phase 3 integrates MES/LIMS data. Phase 4 activates AI use cases on the unified dataset. The PVCFC energy management case study (details here) demonstrates that Phase 1 (OT connectivity) can be completed in 6–12 months for a brownfield industrial site — the same timeline applies to pharma OT connectivity, adjusted for GMP instrumentation qualification requirements. For the full ISA-95 architecture context that defines where the data lake sits in the system hierarchy, see Architecture Overview →.

References

C&F SA — Data Lakes in Pharma: 5 Tips for Success: https://candf.com/our-insights/articles/data-lake-in-the-pharmaceutical-industry-5-things-to-keep-in-mind-to-get-it-right/
Splashlake — Pharma/Biopharma Lab Data Integration: https://www.splashlake.com/industries/pharmaceutical
PharmTech — Hybrid Cloud Architecture in Pharma: https://www.pharmtech.com/view/hybrid-cloud-architecture-in-pharmaceutical-development-and-manufacturing-a-strategic-imperative-for-life-sciences
Intuition Labs — Private LLM Deployment in Pharma: https://intuitionlabs.ai/articles/private-llm-pharma-compliance-architecture
EU GMP Chapter 4 (2023 revision) — data governance: https://health.ec.europa.eu/consultations/stakeholders-consultation-eudralex-volume-4-good-manufacturing-practice-guidelines-chapter-4-annex_en
UMH App — Historians vs Open-Source Databases: https://learn.umh.app/blog/historians-vs-open-source-databases-which-is-better/
AVEVA PI System: https://www.aveva.com/en/products/aveva-pi-system/
ISPE GAMP Guide: Artificial Intelligence (July 2025): https://ispe.org/publications/guidance-documents/gamp-guide-artificial-intelligence
FDA 21 CFR Part 211 — electronic records in manufacturing: https://www.ecfr.gov/current/title-21/chapter-I/subchapter-C/part-211

Cluster Progress — N2 COMPLETE ✅

ID	Title	Status
N2.P	AI & Data Science Hub	✅ Written
N2.1	EU AI Act for Pharma Manufacturing	✅ Written
N2.2	Predictive Maintenance Pharma GMP	✅ Written
N2.3	Computer Vision QC for Pharma	✅ Written
N2.4	Digital Twin for Pharma Manufacturing	✅ Written
N2.5	PAT Integration with AI/ML	✅ Written
N2.6	Pharma Data Lake Architecture	✅ Written

Checklist triển khai

Áp dụng theo từng bước để đảm bảo tính tuân thủ GMP và khả năng vận hành ổn định.

TYPE 2 — Expert synthesis based on industry-standard GMP guidelines, regulatory publications and real-world pharmaceutical automation deployments in Vietnam and Southeast Asia. Transparency note: This resource reflects the author's professional experience and publicly available regulatory guidance. Readers should verify specific requirements with their qualified regulatory consultants.