Data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. You can store your data as-is, without having to first structure the data, and run different types of analytics — from dashboards and visualizations to big data processing, real-time analytics, and machine learning to guide better decisions.
Data in this universe is increasing drastically and it’s estimated that the global datasphere will grow to 175 zettabytes by 2025. Around 90% of the data is mostly unstructured or semi-structured data. There are many solutions to store and process structured data but when it comes to any form of the data be it structured/semi-structured or unstructured, then the data lake comes into the picture.
A data lake maintains data in its native formats and handles the three Vs of big data — volume, velocity, and variety — while providing tools for analyzing, querying, and processing. Data lakes eliminate all the restrictions of a typical data warehouse system by providing unlimited space, unrestricted file size, schema on reading, and various ways to access data (including programming, SQL-like queries, and REST calls).
Need of Data lake
If you are facing the below issues, then definitely you need to create an AWS Data lake.
- Businesses having too many data stores and not having any single source of data. Facing difficulties in fetching data from multiple sources
- Data is increasing day by day, too much spending on data storage.
- The structure of data is varying a lot. For example business having user audit data, IoT devices data, logs data, image gallery.
- Slow data analytics on big data.
By this, it would be clear to you that Data lake is a need for your organisation or not!
Now we have to choose the right Data lake Architecture!
Data Lake Architecture
Many companies use cloud storage services such as Google Cloud Storage and Amazon S3 or a distributed file system such as Apache Hadoop. For this blog, I’m going to use AWS Data lake as an example as it comes under the Free tier plan.
Data Lake Layers
A data lake can have any numbers of layers as it’s not a product or a tool, it’s a process. For our use case, we would be discussing the following layers.
Let’s go through each of them one by one. I know this would be an exciting part as no one really want’s to get the theory
Hands-on AWS Data lake
First, we will ingest our raw data as it is in the data lake, so for this use case, we would be using PostgresSQL table records. We will be ingesting all the records to the Data lake. Let’s create some big data in the Postgres table.
We would be creating a town table with random data like id, code, article and name and with the use of the generate_series function, we would be creating 100k records.
CREATE TABLE towns ( id SERIAL UNIQUE NOT NULL, code VARCHAR(10) NOT NULL, — not unique article TEXT, name TEXT NOT NULL — not unique); insert into towns ( code, article, name)select left(md5(i::text), 10), md5(random()::text), md5(random()::text)from generate_series(1, 100000) s(i)
The data is looking something like below
Now our data is ready and all we need to do is to ingest this data in the data lake. So first we need to create a database in AWS Data lake to store all the unstructured and structured data.
For Complete hands-on please check out the setup by setup tutorial on my blogging website — Progress Story
You can choose any other data lake to provide like Azure or GCP that will not make any difference. But the only thing matter is what storage engine you are using. In my use case, I used Amazon s3 cause that is a hell of a lot cheaper than other options out there.
I hope, it has helped you or it will help you. If you want to discuss anything or anything related to tech, you can contact me here or on the Contact Page. If you are interested in becoming a part of the Progress story please reach out to me or check out the Create Blog page.
See you next time! Peace Out ✌️