Data Lake Modeling: Compressing 100 TBs to 5 TBs with Parquet

In the rapidly growing fields of data engineering and AI, efficiency and scalability are critical. As data continues to balloon into the petabyte range, professionals must adopt advanced modeling techniques to ensure data remains usable, compact, and performant. One such approach - data lake modeling using Parquet’s compression capabilities - can drastically reduce storage requirements while enabling efficient query processing.

This article dives deep into the intricacies of data modeling, focusing on techniques such as cumulative table design, leveraging Parquet file compression, and finding the balance between usability and compactness. Whether you're a data engineer, AI specialist, or aspiring professional looking to elevate your skills, this guide will provide actionable insights into achieving optimal data architecture.

The Importance of Data Modeling

Data modeling is foundational to managing and analyzing massive datasets. It enables data engineers to structure raw information into purposeful formats that can serve diverse consumers, such as analysts, machine learning models, and executives. However, the way data is modeled impacts more than usability - it also dictates scalability and performance.

As a case in point, companies like Facebook, Airbnb, and Netflix have adopted sophisticated modeling strategies to handle enormous datasets. Drawing from these real-world experiences, we’ll explore how to model data effectively while reducing storage footprints.

Understanding Dimensions in Data Modeling

A pivotal concept in data modeling is the dimension. Dimensions describe attributes of entities, creating a structure for organizing data. For example:

Fixed Dimensions: These never change, such as user IDs, birthdays, or sign-up dates. They are simple to model and require minimal processing.
Slowly Changing Dimensions: These change over time, such as a user’s favorite food or preferences. Modeling these dimensions requires more complexity and has a significant impact on storage and usability.

The key takeaway? Modeling dimensions thoughtfully is not just a technical exercise but also an empathetic one. You must understand how your data consumers - whether analysts, executives, or machine learning models - will interact with the data. Without doing so, even the most compact data designs can become impractical.

Cumulative Table Design: A Game-Changer for Efficiency

One of the standout techniques in data lake modeling is cumulative table design. Instead of storing daily snapshots of data in separate partitions, cumulative tables aggregate all historical information into a single row per entity. This is particularly useful for tracking metrics such as user activity over time.

How It Works

Raw Snapshots: Start with daily snapshots of your data (e.g., user activity on January 1, January 2, etc.).
Full Outer Joins: Combine these snapshots using full outer joins to capture changes across days.
Arrays for Historical Context: Store historical values (e.g., daily activity status) as arrays in a single row. This compresses multiple rows into one, reducing storage requirements.

For example, consider tracking a user’s activity across seven days. In a traditional approach, you would store seven rows - one for each day. Using a cumulative table design, you store one row with an array capturing all seven days’ activity. This format minimizes storage size while enabling quick queries for trends over time.

Benefits

Massive Storage Reduction: By eliminating redundant data, cumulative tables can reduce storage needs by more than 50%.
Faster Queries: Instead of querying multiple partitions, you can retrieve all necessary data from a single row.
Historical Insights Without Shuffling: Since historical data is bundled into one row, there’s no need for expensive shuffling operations during queries.

Challenges

Although efficient, cumulative table design has drawbacks:

Sequential Backfills: Backfilling historical data (e.g., for eight years) must be done sequentially, which can be time-consuming. This is especially challenging for data engineers working on large-scale systems.
PII Management: Since historical data is carried forward in cumulative tables, ensuring compliance with privacy regulations (e.g., GDPR) requires additional effort to remove inactive or deleted users.

Harnessing Parquet Compression for Data Optimization

Parquet

When it comes to compact storage, Parquet files and their run-length encoding (RLE) compression shine. Parquet’s ability to reduce redundant data, particularly for repeated values, can lead to substantial file size reductions.

How Parquet’s Compression Works

Run-length Encoding: If a column contains repeated values (e.g., user names), Parquet replaces duplicate entries with a single value and a repetition count. For example:
```
Original: [John, John, John, Mary, Mary, Alice]
Compressed: [John (x3), Mary (x2), Alice]
```
Sorting for Efficiency: Keeping similar rows together maximizes compression. However, post-processing operations like joins can disrupt this sorting, leading to larger file sizes.

The Trade-off with Cardinality

A common challenge is temporal cardinality explosions, which occur when adding a time dimension (e.g., daily availability for a year). For example:

A dataset with 10 million Airbnb listings expands to 3.65 billion rows when modeling daily availability for one year.
Using arrays instead of rows for time-series data can keep the dataset compact while preserving historical context.

Balancing Compactness and Usability

As data engineers, it’s tempting to prioritize compact data formats to save storage and cloud costs. However, this can compromise usability, particularly for non-technical data consumers such as analysts or business executives.

Compact vs. Usable Tables

Most Compact Tables: Use blobs, arrays, and custom codecs to minimize data size. These are ideal for online systems where latency is critical, but they require technical expertise to decode.
Most Usable Tables: Use simple data types like strings, integers, and booleans. These are easy to query but can take up significantly more space.
Middle Ground: Use structured data types (e.g., arrays, maps, structs) to balance compactness with usability. This approach is ideal for master datasets consumed by both engineers and analysts.

Key Considerations

Know Your Audience: Analysts prefer easy-to-query tables, while engineers can work with more complex formats.
Optimize for Downstream Efficiency: If downstream consumers need to manipulate data (e.g., join tables), ensure your design preserves compression and avoids unnecessary shuffling.

Key Takeaways

Dimensional Data Modeling: Fixed dimensions are simple to model, but slowly changing dimensions require advanced techniques for scalability.
Cumulative Table Design: Reduces storage and accelerates queries by aggregating historical data into arrays. However, sequential backfills can be time-intensive.
Parquet Compression: Run-length encoding significantly reduces file sizes, enabling efficient storage of large datasets.
Trade-offs in Design: Compactness saves storage but can hinder usability, especially for non-technical users like analysts. Consider using structured data types for a balanced approach.
Know Your Consumers: Tailor data models to your users’ needs, whether they require technical flexibility or simple query interfaces.
Avoid Temporal Cardinality Explosions: Use arrays or nested data structures to model high-cardinality dimensions, such as daily availability for listings.
Future-Proof Designs: Maintain compliance with privacy regulations by filtering out inactive or deleted users from cumulative datasets.

Conclusion

Data lake modeling is both a science and an art. By leveraging techniques like cumulative table design and Parquet’s compression, you can create systems that are not only scalable but also efficient and user-friendly. The ultimate goal is to strike a balance between compactness and usability, ensuring that your data serves all consumers - from analysts to AI models - while maintaining performance at scale.

For data engineers and AI specialists, mastering these techniques is not just a technical skill but a career-defining capability. The next time you encounter a 100-terabyte dataset, challenge yourself to compress it down to 5 TB - without losing usability. The right data modeling practices will make it possible.

Source: "Data Lake Modeling: 100 TBs into 5 TBs at Airbnb with Parquet + Run Length Encoding - DataExpert.io" - Data with Zach, YouTube, May 3, 2024 - https://www.youtube.com/watch?v=7JbCVXmJ1bs