Complete Guide: Notebook-to-Pipeline Data Engineering

Data engineering is rapidly evolving, and professionals in the field need to adapt to modern tools and methodologies to stay ahead. In a tutorial presented at PyData Boston, Gilbert Hernandez, a seasoned developer advocate, walked through the entire process of building a data pipeline - from ingesting data to delivering polished data products - using Python and modern cloud technologies like Snowflake. This article will break down the key concepts, technical approaches, and insights shared during the session. Whether you're an aspiring data engineer or a mid-career tech professional looking to level up, this guide provides actionable steps to bridge the gap between theory and real-world application.

Introduction: Why Data Engineering Matters

Data engineering is the backbone of any data-driven organization. It involves creating pipelines that automate the transfer, transformation, and delivery of data, enabling meaningful analysis. As modern cloud platforms revolutionize how data is processed and shared, data engineers must embrace new paradigms such as automation, incremental updates, and integration with AI models.

This article explains a practical approach to building an end-to-end data pipeline. The focus is on ingestion, transformation, and delivery (ITD) - a framework that simplifies understanding and implementing data pipelines. By the end, you’ll gain a clear roadmap for creating efficient, scalable pipelines while leveraging cutting-edge technologies like dynamic tables and semantic models to support advanced workflows.

The ITD Framework: Simplifying Data Pipelines

Before diving into the hands-on portion, Hernandez outlined a foundational framework for understanding data pipelines:

Ingestion (I): Loading raw, unprocessed data into a data platform.
Transformation (T): Converting raw data into meaningful formats using tools like SQL or Python.
Delivery (D): Delivering analyzed data to end-users or systems, often as a self-service data product.

This ITD framework helps data engineers conceptualize pipelines at a high level, regardless of platform-specific implementations. For example, whether you use Snowflake, Databricks, or another cloud platform, the principles of ingestion, transformation, and delivery remain constant.

Building a Data Pipeline: The Step-by-Step Process

The live tutorial showcased a practical example where participants built a pipeline for a fictional company, Tasty Bites. This company runs hundreds of food trucks globally and relies on data engineers to create systems that allow analysts to query sales data efficiently. Below is a detailed breakdown of the steps:

### 1. Setting Up the Environment

Tools and Requirements:

Python (3.10 or later): Used for coding and data manipulations.
Jupyter Notebooks: For interactive, step-by-step development.
Relevant Python Libraries: Snowflake Connector and Snowflake Snowpark for data operations.

Hernandez offered two options for setting up the environment:

Option A: Use a local development setup with Jupyter Notebooks.
Option B: Use Snowflake's built-in Notebook interface for a streamlined, no-installation-required setup.

Creating the Connection:

The first step involves connecting your development environment to Snowflake, the cloud data platform used in this tutorial. By setting up a session object in Python, you can programmatically access and interact with your Snowflake account.

connection_parameters = {
    "account": "<your_account_identifier>",
    "user": "<your_username>",
    "password": "<your_password>"
}
session = Session.builder.configs(connection_parameters).create()

2. Defining the Raw Data Architecture

Creating the Database and Schema:

Data objects were organized into a database (tasty_bites_db) with two schemas:

Raw Schema: To hold the unprocessed data pulled from CSV files.
Analytics Schema: For dynamic tables and downstream data products.

raw_schema = root.database.create_schema("raw")
analytics_schema = root.database.create_schema("analytics")

Defining Raw Tables:

The pipeline's first tier involves creating three tables to hold raw data:

Order Header: Contains basic details about each order.
Order Detail: Expands on order header with item-level granularity.
Menu: Lists the products sold by Tasty Bites.

Each table was defined programmatically in Python, specifying column names and their data types.

order_header_table = raw_schema.create_table(
    "order_header", 
    columns=[
        Column("order_id", DataType.INTEGER),
        Column("order_date", DataType.TIMESTAMP),
        # Other columns...
    ]
)

3. Loading Data into Tables

Once the raw tables were created, the next step was ingesting data from external cloud storage (AWS S3). Using Snowflake's COPY INTO command, about a billion rows of data (split into multiple CSV files) were ingested into the raw tables.

copy_command = f"""
COPY INTO raw.order_header
FROM @my_stage/order_header/
FILE_FORMAT = (TYPE = 'CSV')
"""
session.sql(copy_command).execute()

The COPY INTO command supports high-speed ingestion and incremental loading, making it ideal for large datasets.

4. Transforming Data with Dynamic Tables

Understanding Dynamic Tables:

Dynamic tables are declarative objects in Snowflake that automatically process incremental changes from upstream tables. Unlike traditional ETL workflows that require manual orchestration, dynamic tables update themselves based on defined lag parameters (e.g., every 5 minutes or 12 hours).

Building Tiered Transformations:

Tier 1 Tables: Add enriched dimensions to raw data.
- For example, the orders_enriched table combines temporal information with raw order data.
Tier 2 Tables: Create fact tables that consolidate insights like revenue calculations.
- Example: The order_fact table aggregates data for analysis.
Tier 3 Tables: Generate the final tables containing daily business metrics and product performance metrics.

orders_enriched = session.create_dynamic_table(
    name="orders_enriched",
    query="""
    SELECT
        order_id,
        DATE(order_date) AS order_date,
        -- Additional transformations...
    FROM raw.order_header
    """,
    lag="12 hours"  # Automatically refresh every 12 hours
)

5. Delivering Data Products

The final stage involves delivering analyzed data to end-users. In this case, the pipeline produced two key outputs:

Daily Business Metrics: High-level metrics like total sales.
Product Performance Metrics: Insights into top-performing products.

These tables were delivered to analysts and integrated into an AI-powered agent.

Enhancing with AI: Building Semantic Models and Agents

To enable self-service analytics, the pipeline included a semantic model. This model provided a structured description of the data, allowing an AI agent to understand the schema and field relationships. The agent could then respond to natural language queries, such as:

"What are the top 10 products by revenue?"

How It Works:

Semantic Model: A YAML-based representation of the tables and their structure.
AI Agent: Built on Snowflake Intelligence, this agent leverages the semantic model to dynamically query data and present insights visually.

The result is a user-friendly, conversational interface where non-technical stakeholders can extract actionable insights without writing SQL.

Key Takeaways

Understand the ITD Framework: Data pipelines involve three key components - ingestion, transformation, and delivery.
Leverage Dynamic Tables: Automate transformations and incremental updates using declarative dynamic tables.
Choose the Right Tools: Snowflake offers powerful options for building pipelines with Python, SQL, and cloud-native features.
Focus on Data Products: Deliver polished, self-service data outputs that empower analysts.
Incorporate AI: Semantic models and intelligent agents enable natural language querying and improve data accessibility.

Conclusion

Building modern data pipelines requires a combination of technical expertise, strategic design, and the right tools. By following the ITD framework and leveraging dynamic tables, you can create pipelines that are scalable, efficient, and adaptable to changing requirements. The integration of semantic models and AI agents further enhances the utility of your data products, making them accessible to a broader audience. Whether you're new to data engineering or looking to deepen your skills, this approach provides a solid foundation for success in today's data-driven world.

Source: "Gilberto Hernandez - Notebook to Pipeline: Hands-On Data Engineering w Python - PyData Boston 2025" - PyData, YouTube, Dec 15, 2025 - https://www.youtube.com/watch?v=Rj4_atYG3MY