Azure Data Lake is a Microsoft service built for simplifying big data storage and analytics. It is a system for storing vast amounts of data in its original format for processing and running analytics. It is useful for developers, data scientists, and analysts as it simplifies data management and processing. Azure Data Lake offers seamless integration and is the optimal solution to productivity and scalability challenges faced by organizations.
Example: Apache Hadoop
Table Of Content - Azure Data Lake |
Data Lake is a large centralized repository for storing vast amounts of raw data in its original format for future use by a data engineer. A wide range of structured, semi-structured, and unstructured data can be stored in its native form for processing and in-depth analysis. Data Lakes provide unlimited storage space without any restrictions on file size, or data access (including programming, SQL-like queries, and REST calls). It supports metadata extraction, indexing, formatting and conversion, segregation, augmentation, aggregation, and cross-linking.
In April 2015, Microsoft Azure announced Data Lake Service for Enterprise customers. With Data Lake services Microsoft shifted its data storage and analytics service from a basic storage platform to a fully-realized platform for distributed analytics and clustering for HDInsight.
Built on YARN and HDFS Azure Data Lake is a large central storage repository based on Apache Hadoop. It is an alternative to enterprise data silos and holds a massive amount of data in their original format. Data Lake in Azure has the ability to store and analyze large volumes of a variety of data at varying speeds. It is not concerned about the source and purpose of data. It just provides a common repository to perform deep analytics.
Interested in Microsoft Azure training and certification course for professionals: Register now for our 30 hours "Azure online training" course offered by ‘Mindmajix - A Global online training platform’. |
1. Azure Data Lake Store:
Data Lake Store is a hyper-scale repository for big data analytics workloads. It allows users to store data irrespective of size and format such as social media content, relational databases, and logs. It provides unlimited storage for unstructured and structured data without any restrictions. An individual file can be a petabyte in size and with no retention policy. It uses the Hadoop Distributed File System (HDFS) for the cloud.
Service Integration for Data Lake Store
Microsoft is planning to introduce integration services for Microsoft’s Revolution-R Enterprise, Hortonworks, Cloudera, and MapR, and Hadoop projects such as Spark, Storm, and HBase.
Data Lake Store supports POSIX-style permissions exposed through the WebHDFS-compatible REST APIs. The WebHDFS protocol makes it possible to support all HDFS operations such as read, write, accessing block locations, and configuring replication factors. Besides, WebHDFS can use the full bandwidth of the Hadoop cluster for streaming data.
A new file system-Azure Data Lake Filesystem (adl://) is introduced for directly accessing the repository. Applications and System that are capable of using the new file system gains additional flexibility and performance over WebHDFS. Systems not compatible with the new file system can continue to use the WebHDFS-compatible APIs.
2. Azure Data Lake Analytics:
Azure Data Lake Analytics is the latest Microsoft data lake offering. It is an in-depth data analytics tool for Users to write business logic for data processing. The most important feature of Data Lake Analytics is its ability to process unstructured data by applying schema on reading logic, which imposes a structure on the data as you retrieve it from its source. The data source could be Data Lake Store or Azure Storage.
It supports U-SQL language, which allows users to run custom logic and user-defined functions. U-SQL provides more control and scalability over jobs. Data Lake Analytics executes a U-SQL job as a batch script, with data retrieved in a rowset format. If the source data is in files, U-SQL schematizes the data upon extraction.
3. Azure HDInsight:
HDInsight is a fully managed Hadoop cluster service that supports a wide range of analytic engines, including Spark, Storm, and HBase. It is designed to take advantage of the Data Lake Store in order to maximize security, scalability, and throughput. It supports managed clusters in Linux and Windows.
U-SQL
U-SQL is a language that combines declarative SQL with imperative C# to let you process data at any scale.
U-SQL can process unstructured data by applying schema on reading and inserting custom logic. Each query produces a row set and the row set can be assigned to a variable.
The EXTRACT keyword reads data from a file and defines the schema on reading. The OUTPUT writes data from a row set to a file. These two statements use the Azure Data Lake file path.
Example: adl://mystore.azuredatalakestore.net/Samples/Data/SearchLog.tsv
Example Script:
@searchlog =
EXTRACT
FROM "/Samples/Data/SearchLog.tsv"
USING Extractors.Tsv();
OUTPUT @searchlog
TO "/output/SearchLog-first-u-sql.csv"
USING Outputters.Csv();
This script reads from the source file called SearchLog.tsv, schematizes it and writes the rowset back into a file called SearchLog-first-u-sql.csv
Also Read: HDInsight of Microsoft Azure
Azure Data Lake is built on top of Apache Hadoop and based on the Apache YARN cloud management tool. It is Microsoft’s Implementation for the HDFS file system in the cloud. Azure Data Lake is a completely cloud-based solution and does not require any hardware or server to be installed on the user end. It can be scaled according to need.
Azure Storage API and Hadoop Distributed File System are compatible with Data Lake.
Data Lake is compatible with Azure Active Directory and uses it for security and authentication.
Data Lake is designed to have very low latency and near real-time analytics for web-analytics, IoT analytics, and sensor information processing.
Data can be gathered from any sources like social media, website and app logs, devices and sensors, etc. and can be stored in the near-original format.
Related Article: Azure Data Factory Tutorial
Data Warehouse | Data Lake | |
Data |
Structured and Processed |
Semi-structured, unstructured and Structured |
Processing |
Schema on write |
Schema on reading |
Storage |
Expensive |
Low cost |
Agility |
Less agile and fixed-configuration |
Highly agile and fully configurable |
Security |
Mature |
Mature |
Users |
Business professionals |
Data Scientists |
[Related Article: Azure Interview Questions]
Data Lake authentication uses the azure active directory for authentication of users and enforcing policies.
Authorization and access control are stored separately in Data Lake and using below settings
Network isolation provides firewalls and defines an IP address range for trusted clients and only these clients can access Data Lake. Data Protection uses Transport Layer Security (TLS) protocol to secure data over the network. Auditing and diagnostic logs are shown in the Azure portal.
Built on Azure Blob, the Azure Data Lake Storage Gen2 offers capabilities like file system semantics, directory, file level security, low-cost, tiered storage, high availability/disaster recovery and scalability. Its set of capabilities consists of the best features from Azure Blob storage and Azure Data Lake Storage Gen1.
Data Lake Store:
Pay-as-you-go
Usage |
Price/Month |
First 100 TB |
Rs. 2.58 per GB |
Next 100 TB to 1,000 TB |
Rs. 2.52 per GB |
Next 1,000 TB to 5,000 TB |
Rs. 2.45 per GB |
Over 5,000 TB |
Custom by contacting Microsoft |
Monthly commitment packages
Committed Capacity |
Price/Month |
Savings over pay-as-you-go |
1 TB |
Rs. 2,313.37 |
12% |
10 TB |
Rs. 21,150.80 |
19% |
100 TB |
Rs. 1,91,679.13 |
27% |
500 TB |
Rs. 8,79,080.13 |
31% |
1,000 TB |
Rs. 17,18,502.50 |
33% |
Over 1,000 TB |
Custom by contacting Microsoft |
|
Price for the transaction:
Usage |
Price |
Write operations (per 10,000) |
Rs. 3.31 |
Read operations (per 10,000) |
Rs. 0.27 |
Delete operations |
Free |
Transaction size limit |
No limit |
Data Lake Analytics:
Pay-as-you-go
Usage |
Price |
Analytics Unit |
Rs. 132.20/hour |
Monthly Committed Price
Included Analytics Unit Hours |
Price/Month |
Savings over Pay-As-You-Go |
100 |
Rs. 6,610 |
50% |
500 |
Rs. 29,744 |
55% |
1,000 |
Rs. 52,877 |
60% |
5,000 |
Rs. 2,37,947 |
64% |
10,000 |
Rs. 4,29,626 |
67% |
50,000 |
Rs. 19,16,792 |
71% |
1,00,000 |
Rs. 34,37,005 |
74% |
> 1,00,000 |
Custom by contacting Microsoft |
|
Microsoft Azure Data Lake Architecture is helping data scientists, engineers, and analysts by solving much of their big data dilemma. This scalable cloud data lake offers a single storage structure for multiple analytic projects of different sizes. Our online certification helps you learn Azure Data Lake from basic to advanced levels.
If you interested to learn Microsoft Azure Data Lake and build a career in Cloud Computing Technology? Then check out our Azure Certification Training Course at your near Cities.
Microsoft Azure Course Bangalore, Microsoft Azure Course Hyderabad, Microsoft Azure Course Pune, Microsoft Azure Course Delhi, Microsoft Azure Course Chennai, Microsoft Azure Course Newyork, Microsoft Azure Course Washington, Microsoft Azure Course Dallas, Microsoft Azure Course Maryland, Microsoft Azure Training Virginia, Microsoft Azure Training Pennsylvania
These courses are incorporated with Live instructor-led training, Industry Use cases, and hands-on live projects. This training program will make you an expert in Microsoft Azure and help you to achieve your dream job.
Our work-support plans provide precise options as per your project tasks. Whether you are a newbie or an experienced professional seeking assistance in completing project tasks, we are here with the following plans to meet your custom needs:
Name | Dates | |
---|---|---|
Azure Training | Nov 19 to Dec 04 | View Details |
Azure Training | Nov 23 to Dec 08 | View Details |
Azure Training | Nov 26 to Dec 11 | View Details |
Azure Training | Nov 30 to Dec 15 | View Details |
Anji Velagana is working as a Digital Marketing Analyst and Content Contributor for Mindmajix. He writes about various platforms like Servicenow, Business analysis, Performance testing, Mulesoft, Oracle Exadata, Azure, and few other courses. Contact him via anjivelagana@gmail.com and LinkedIn.