In this blog, we will learn How to process billions of records using Azure Data Factory, Data Lake and Databricks.
A user with a Contributor role in Azure Subscription.
Click here to learn how to assign a role to the user.
We are having billions of audit data in Azure table storage and want to generate a high-level report so that management team can take some important decision.
We tried different approaches to solve the problem.
Failed Attempt 1:
Export Azure table storage data into a CSV file and perform a reporting using Excel.
Excel won’t hold more than the 1 million lines of data.
Failed Attempt 2:
Migrated data to SQL server to process billions of records.
We had executed a simple query which did not process in 2-3 hours. The SQL server Can’t handle a large data processing.
Failed Attempt 3:
Directly connect Power BI to Azure table Storage.
Power BI does not support a direct Query on Azure Table Storage. It tried to load a complete dataset on a local machine. In a practical scenario, its not feasible to load the data set.
Finally, we found a solution, how to process billions of data using Azure Data Lake, Data Factory and Databricks.
Below is the high-level diagram to approach complex business scenario.
Let’s understand step by step process, how to configure each service and how to communicate different resource to get achieve a business solution.
1. Create a Resource Group.
2. Create an Azure Data Lake account.
3. Create an Azure Data Factory.
4. Transfer the data from Table Storage to Azure Data Lake using Azure Data Factory.
5. Create an Azure AD application for Azure Databricks.
6. Assign a Contributor and Storage Blob Data Contributor role to the registered Azure AD Application at a subscription level.
7. Create an Azure Databricks service.
8. Connect Azure Data Lake to Azure Databricks using Notebook.
9. Connect Power BI to Azure Databricks for better visualization.
This approach will work for other sources as well. Example SQL Database, Cosmos DB, CSV.
In the next blog, we will explain how to create a resource group. Other parts of the blogs will release very soon.