Processing billions of records using Azure Data Factory, Data Lake and Databricks

Processing billions of records using Azure Data Factory, Data Lake and Databricks

Introduction:

In this blog, we will learn How to process billions of records using Azure Data Factory, Data Lake and Databricks.

Each step mentioned here will be explained in detail in subsequent articles.

Problem Statement:

We are having billions of audit data in Azure table storage and want to generate a high-level report so that the management team can take some important decision.

We tried different approaches to solve the problem.

Attempt with Approach 1:

Export Azure table storage data into a CSV file and perform a reporting using Excel.
Limitation:
Excel won’t hold more than the 1 million lines of data.

Attempt with Approach 2:

Migrated data to SQL server to process billions of records.
Limitation:
The SQL Server couldn’t handle large data processing and a simple query on it took more than 2.5 – 3 hours to execute.

Attempt with Approach 3:

Directly connect Power BI to Azure table Storage.
Limitation:
Power BI does not support a direct Query on Azure Table Storage. It tried to load a complete dataset on a local machine, which is not feasible in a practical situation.

Finally, we found a solution, how to process billions of data using Azure Data Lake, Data Factory and Databricks.

Below is the high-level diagram depicting the approach to solving this critical business requirement.


Let’s understand the step by step process of configuring various services and facilitating communication between different resources, to achieve the desired business solution.

Steps:

1. Create a Resource Group.
2. Create an Azure Data Lake account.
3. Create an Azure Data Factory.
4. Transfer the data from Table Storage to Azure Data Lake using Azure Data Factory.
5. Create an Azure AD application for Azure Databricks.
6. Assign a Contributor and Storage Blob Data Contributor role to the registered Azure AD Application at a subscription level.
7. Create an Azure Databricks service.
8. Connect Azure Data Lake to Azure Databricks using Notebook.
9. Connect Power BI to Azure Databricks for better visualization.

This approach will work for other sources as well. Example SQL Database, Cosmos DB, CSV.

In a series of blogs, we will see how each of the above steps can be configured. The next blog will start with explaining how a resource group can be created.

Support us!

If you like this site please help and make click on any of these buttons!

MANGAL PAWAR in

HEAD OF CUSTOMER ENGAGEMENT & DELIVERY


Mangal Pawar is the Head of Customer Engagement & Delivery for US at Kalpavruksh. He has extensively worked with clients in establishing and growing their outsourced setups, and ensuring that the setups become effective & productive for their businesses. Having worked with companies like eConnect India and Infosys — the much celebrated Indian MNC that worked its way to the NASDAQ — Mangal’s rich experience in customer engagement and delivery isn’t an accident. It’s been worked to the bone and chiselled to finesse…! He is currently responsible for engaging with and facilitating growth of our US customers, by leveraging our unique & innovative engagement model.

 Mangal.Pawar@Kalpavruksh.Com
 +1 (201) 699 6908

TROND SKUNDBERG in

STRATEGIC ADVISER

Trond Skundberg is a digital advisor specializing in India outsourcing and conceptualization for digital entrepreneurs. Trond is visionary and execution specialist, rolled into one. He is the board member of the Norway India Chamber of Commerce and Industry and the CEO of Skundberg Limited. He’s been on the board for many other companies such as Devant Digital Media, Zett Interactive, and blogs at MyFantastic India . Trond is a big-picture visionary, a serial entrepreneur, and a passionate enthusiast on technology.

 Trond.Skundberg@Kalpavruksh.Com
 +47 97 025 025

NIELS AHLMANN-OHLSEN in

BOARD MEMBER

Neils Ahlmann-Ohlsen has been on continuous love affair with India ever since he started a production company in Pune circa 1978. He worked as a consultant for the former Daimler Chrysler Aerospace AG, Noxitest AG, and B & W Energy A/S since the year 2005. Neils also worked in the capacity of a European Director of PCS Technology Ltd, the company co-founders of Kalpavruksh Technologies were associated with. Neils, however, has more to him than just work. He is best known in Denmark for his political work with 17 years as a conservative MP. Neils is also the chairman of the Indian forum in Denmark and also for the board of Lotek A/S.

MARTIN DOMMERBY in

MANAGING DIRECTOR, KALPAVRUKSH TECHNOLOGIES

Martin Dommerby is the Managing Director of Kalpavruksh Technologies, and a sales person at heart. Having worked (and continues to) as a board member on many innovative and technology companies, he is the harbinger of hope where business meets technology. Martin is a believer in the human cloud. While he is the epitome of success himself, his vision for a free, open, and borderless world where work is just a click away drives him to be a staunch advocate of outsourcing. As an able manager, he brings the best people have to offer to the advantage of his company’s global clientele.

 Martin.Dommerby@Kalpavruksh.Com
 +45 2624 6462

MICHAEL FRIANG JENSEN in

DIGITAL PROJECT MANAGER

Michael Friang Jensen is our Digital Project Manager at Kalpavruksh in Denmark. Specializing in technology and business, digital concepts, Michael is adept at handling strategic business applications of commerce, marketing, and project management. Michael bring concepts to digital format, manage projects efficiently, deal with a global workforce of distributed talent, and has his pulse on the ever-changing opportunities in digital media.

 Michael.Friang@Kalpavruksh.Com
 +45 2728 8404