In this article, I have tried explain what a data lake is, why do we need a data lake and what is the process involved in building the data lake.
What is Amazon Data lake?
A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale.
Data from different sources are transported using various tools into Amazon S3 (where the data lake is created).
Then the data is crawled to understand what the data is, how the data can be useful and what metadata can be created for cataloguing purpose.
Then the curated data is ready for use by different applications or departments or personas.
Amazon data lake can have all forms/types of data from any data source in its raw format and is scalable to any extent.
What are Structured and Unstructured data?
Structured Data - As its name implies, structured data are highly organized, fits neatly within fixed fields and columns under a Relational database. Few examples of Structured data are names, phone numbers, Zip/Pin codes, dates, credit card numbers location etc.
Unstructured Data - Unstructured data do not have pre-defined data model. In, other words. they cannot be fit into a Relational database. Few examples of Unstructured data are free text, video files, audio files, mobile activity, social media posts, satellite images, surveillance images etc.
What are the data sources?
The data can typically come from multiple sources depending upon the type of Business. It can be from a database, video streams, Social media clicks, audio files etc.
Why do we need a Data lake?
Data lakes have the ability to collect any form of data, from anywhere within an enterprise’s numerous data sources and silos. From revenue numbers to social media streams, and anything in between.
Data lakes reduce the effort needed to analyze or process the same data set for different purposes by different applications.
Data lakes keep the whole operation cost efficient, with the ability to scale up storage and compute capacities as required, and independent of each other.
Data lake creation process
A data lake can be created in three simple stages as mentioned below. In Amazon, the data lake is created as S3 buckets for all the 3 zones separately. Amazon Glue is used to crawl the data from the S3 buckets and catalogue accordingly from all buckets (Raw Staging and Processed).
Landing Zone: This is the area where all the raw data comes in, from all the different sources within the enterprise. The zone is strictly meant for data ingestion and no modelling or extraction should be done at this stage.
Curation Zone: Here’s where you get to play with the processed data (Staging bucket). The entire extract-transform-load (ETL) process takes place at this stage, where the data is crawled to understand what it is and how it might be useful. The creation of metadata, or applying different modelling techniques to it to find potential uses, is all done here.
Production Zone: This is where your data is ready to be consumed by different applications, or to be accessed by different personas (Processed bucket).
Amazon Data lake Architecture
Tools that can be used for Data Ingestion
Amazon Direct Connect: Establish a dedicated connect between your premises or data centre and the AWS cloud for secure data ingestion. With an industry standard 802.1q VLAN, the Amazon Direct Connect offers a more consistent network connection for transmitting data from your on premise systems to your data lake.
S3 Accelerator: Another quick way to enable data ingestion into an S3 bucket is to use the Amazon S3 Transfer Acceleration. With this, your data gets transferred to any of the globally spread out edge locations, and then routed to your S3 bucket via an optimized and secure pathway.
AWS Snowball: You can securely transfer huge volumes of data onto the AWS cloud with AWS Snowball. It’s designed for large-scale data transport and is one-fifth of the cost of transferring data via high-speed internet. It’s a great option for transferring voluminous data assets like genomics, analytics, image or video repositories.
Amazon Kinesis: Equipped to handle massive amounts of streaming data, Amazon Kinesis can ingest, process and analyze real-time data streams. The entire infrastructure is managed by AWS to that it’s highly efficient and cost-effective.
Kinesis Data Streams: Ingest real-time data streams into AWS from different sources and create arbitrary binary data streams that are on multiple availability zones by default.
Kinesis Firehose: You can capture, transform, and quickly load data onto Amazon S3, RedShift, or ElastiSearch with Kinesis Firehose. The AWS managed system auto-scales to match your data throughput, and can batch, process and encrypt data to minimize storage costs.
Kinesis Data Analytics: One of the easiest ways to analyze streaming data, Kinesis Data Analytics pick any streaming source, analyze it, and push it out to another data stream or Firehose.
Tools that can be used for Storage in Data Lake
Storage - Amazon S3: One of the most widely used cloud storage solution, the Amazon S3 is perfect for data storage in the landing zone. S3 is a region level, multi availability zone storage options. It’s a highly scalable object storage solution offering 99.999999999% durability. But capacity aside, the Amazon S3 is suitable for a data lake because it allows you to set a lifecycle for data to move through different storage classes.
Amazon S3 Standard: to store hot data that is being immediately used across different enterprise applications
Amazon S3 Infrequent Access: to hold warm data, that accessed less across the enterprise but needs to be accessed rapidly whenever required.
Amazon S3 Glacier: to archive cold data at a very low cost as compared to on premise storage.
Tools that can be used for data movement
On-Premise data movement
AWS Direct Connect, AWS Snowball, AWS Snowmobile, AWS Database Migration Services
But capacity aside, the Amazon S3 is suitable for a data lake because it allows you to set a lifecycle for data to move through different storage classes.
Real-time data movement
AWS IoT Core, AWS Kinesis Data Firehose, AWS Kinesis Data Streams, AWS Kinesis Video Streams
Because information in the data lake is in the raw format, it can be queried and utilized for multiple different purposes, by different applications. But to make that possible, usable metadata that reflects technical and business meaning also has to be stored alongside the data. This means you need to have a process to extract metadata, and properly catalogue it.
The meta data contains information on the data format, security classification-sensitive, confidential etc, additional tags-source of origin, department, ownership and more. This allows different applications, and even data scientists running statistical models, to know what is being stored in the data lake.
The typical cataloguing process involves lambda functions written to extract metadata, which get triggered every time object enters Amazon S3. This metadata is stored in a SQL database and uploaded to AWS ElasticSearch to make it available for search.
AWS Glue is an Amazon solution that can manage this data cataloguing process and automate the extract-transform-load (ETL) pipeline. The solutions runs on Apache Spark and maintains Hive compatible metadata stores. Here’s how it works:
Define crawlers to scan data coming into S3 and populate the metadata catalog. You can schedule this scanning at a set frequency or to trigger at every event
Define the ETL pipeline and AWS Glue with generate the ETL code on Python
Once the ETL job is set up, AWS Glue manages its running on a Spark cluster infrastructure, and you are charged only when the job runs.
The AWS Glue catalog lives outside your data processing engines, and keeps the metadata decoupled. So different processing engines can simultaneously query the metadata for their different individual use cases. The metadata can be exposed with an API layer using API Gateway and route all catalog queries through it.
Production Zone - Serve Processed Data
With processing, data lake is now ready to push out data to all necessary applications and stakeholders. So you can have data going out to legacy applications, data warehouses, BI applications and dashboards. This can be accessed by analysts, data scientists, business users, and other automation and engagement platforms.
I hope this article will be useful to under AWS Data lakes from an overall perspective. More details on data lakes can be found at Amazon.
Please provide your comments on this article and share with your known groups if you like the content.