Processing billions of records using Azure Data Factory, Data Lake and Databricks

Processing billions of records using Azure Data Factory, Data Lake and Databricks

Introduction:

In this blog, we will learn How to process billions of records using Azure Data Factory, Data Lake and Databricks.

Pre-requisites:

A user with a Contributor role in Azure Subscription.
Click here to learn how to assign a role to the user.

Problem Statement:

We are having billions of audit data in Azure table storage and want to generate a high-level report so that management team can take some important decision.

We tried different approaches to solve the problem.

Failed Attempt 1:

Export Azure table storage data into a CSV file and perform a reporting using Excel.
Limitation:
Excel won’t hold more than the 1 million lines of data.

Failed Attempt 2:

Migrated data to SQL server to process billions of records.
Limitation:
We had executed a simple query which did not process in 2-3 hours. The SQL server Can’t handle a large data processing.

Failed Attempt 3:

Directly connect Power BI to Azure table Storage.
Limitation:
Power BI does not support a direct Query on Azure Table Storage. It tried to load a complete dataset on a local machine. In a practical scenario, its not feasible to load the data set.

Finally, we found a solution, how to process billions of data using Azure Data Lake, Data Factory and Databricks.
Below is the high-level diagram to approach complex business scenario.


Let’s understand step by step process, how to configure each service and how to communicate different resource to get achieve a business solution.

Steps:

1. Create a Resource Group.
2. Create an Azure Data Lake account.
3. Create an Azure Data Factory.
4. Transfer the data from Table Storage to Azure Data Lake using Azure Data Factory.
5. Create an Azure AD application for Azure Databricks.
6. Assign a Contributor and Storage Blob Data Contributor role to the registered Azure AD Application at a subscription level.
7. Create an Azure Databricks service.
8. Connect Azure Data Lake to Azure Databricks using Notebook.
9. Connect Power BI to Azure Databricks for better visualization.

This approach will work for other sources as well. Example SQL Database, Cosmos DB, CSV.

In the next blog, we will explain how to create a resource group. Other parts of the blogs will release very soon.

Support us!

If you like this site please help and make click on any of these buttons!

MANGAL PAWAR in

HEAD OF CUSTOMER ENGAGEMENT & DELIVERY


Mangal Pawar is the Head of Customer Engagement & Delivery for US at Kalpavruksh. He has extensively worked with clients in establishing and growing their outsourced setups, and ensuring that the setups become effective & productive for their businesses. Having worked with companies like eConnect India and Infosys — the much celebrated Indian MNC that worked its way to the NASDAQ — Mangal’s rich experience in customer engagement and delivery isn’t an accident. It’s been worked to the bone and chiselled to finesse…! He is currently responsible for engaging with and facilitating growth of our US customers, by leveraging our unique & innovative engagement model.

 Mangal.Pawar@Kalpavruksh.Com
 +1 (201) 699 6908

TROND SKUNDBERG in

STRATEGIC ADVISER

Trond Skundberg is a digital advisor specializing in India outsourcing and conceptualization for digital entrepreneurs. Trond is visionary and execution specialist, rolled into one. He is the board member of the Norway India Chamber of Commerce and Industry and the CEO of Skundberg Limited. He’s been on the board for many other companies such as Devant Digital Media, Zett Interactive, and blogs at MyFantastic India . Trond is a big-picture visionary, a serial entrepreneur, and a passionate enthusiast on technology.

 Trond.Skundberg@Kalpavruksh.Com
 +47 97 025 025

NIELS AHLMANN-OHLSEN in

BOARD MEMBER

Neils Ahlmann-Ohlsen has been on continuous love affair with India ever since he started a production company in Pune circa 1978. He worked as a consultant for the former Daimler Chrysler Aerospace AG, Noxitest AG, and B & W Energy A/S since the year 2005. Neils also worked in the capacity of a European Director of PCS Technology Ltd, the company co-founders of Kalpavruksh Technologies were associated with. Neils, however, has more to him than just work. He is best known in Denmark for his political work with 17 years as a conservative MP. Neils is also the chairman of the Indian forum in Denmark and also for the board of Lotek A/S.

MARTIN DOMMERBY in

MANAGING DIRECTOR, KALPAVRUKSH TECHNOLOGIES

Martin Dommerby is the Managing Director of Kalpavruksh Technologies, and a sales person at heart. Having worked (and continues to) as a board member on many innovative and technology companies, he is the harbinger of hope where business meets technology. Martin is a believer in the human cloud. While he is the epitome of success himself, his vision for a free, open, and borderless world where work is just a click away drives him to be a staunch advocate of outsourcing. As an able manager, he brings the best people have to offer to the advantage of his company’s global clientele.

 Martin.Dommerby@Kalpavruksh.Com
 +45 2624 6462

MICHAEL FRIANG JENSEN in

DIGITAL PROJECT MANAGER

Michael Friang Jensen is our Digital Project Manager at Kalpavruksh in Denmark. Specializing in technology and business, digital concepts, Michael is adept at handling strategic business applications of commerce, marketing, and project management. Michael bring concepts to digital format, manage projects efficiently, deal with a global workforce of distributed talent, and has his pulse on the ever-changing opportunities in digital media.

 Michael.Friang@Kalpavruksh.Com
 +45 2728 8404