Big Data Practices and Case Studies

 Home / Services /Big Data Practices and Case Studies

6. Taking a Structured Approach

Data Lakes recommends using agile and iterative implementation techniques as a project methodology for the Big Data Analytics initiative. These techniques deliver instant results focused on business requirement rather than the SDLC application development.

We suggest starting small by identifying the requirements in the business case and then meeting them then slowly expanding the project to towards bigger business strategy.

Big Data Analytics projects are executed in phases

Ingest – Phase 1

This is the first step in the Big Data implementation where the data is acquired. Data from different sources are put on one single platform. Sources like

  • • Structured data acquired from sources like ERP, point of sale transactions, CRM, general ledger transactions, call records and call center transactions
  • • Unstructured data from resources like logs, websites, emails, photos, audio, video, RSS feeds, PDS/A, social media comments (FaceBook, Twitter, Pinterest etc) and XML.

In this phase the complete focus will be on understanding the sources of data and how the collected data is ingested to get unified view of aggregated data from different sources.

Transform – Phase 2

In this phase the ingested data from Phase 1 is

  • • Filtered
  • • Organized
  • • Integrated
  • • Cleansed
  • • Normalized
  • • Aggregated
  • • Transformed

This phase manipulates data multiple times by moving, splitting, sorting, translating, pivoting and more to ensure that the data collected is accurate. The data is validated against the data quality rules.

Analyze – Phase 3

This phase is all about finding the value of the data. With the data mining algorithms, we find insights from the disparate data sets. Most commonly used data mining algorithms are

  • • Decision Trees
  • • naïve Bayes
  • • Association Rules
  • • Logistic Regression
  • • Kohonen
  • • Heuristical K-Means Clustering
  • • O-Cluster
  • • Neural Networks
  • • Survival Analysis

Let us say Banks use Logistic Regression algorithms to predict the probability of default of a loan by analyzing the loan applicants’ details. Some eCommerce business owners can use Association Rules to analyze the data sets and plan for a better offer for the customers in real time

When using the Apache Hadoop environment, model codes are written in Python, HQL – High Query Language or Java. Commonly used algorithms for larger data pattern mining are R & Apache Mahout.

To address the issues of enterprise across functions and business units, data is acquired from different new sources. Hence, it becomes important to build models that can be scaled up without any inconsistencies. This phase incorporates the key business inputs derived from the business process users into specific variables while analyzing the data.

Visualize – Phase 4

In this last phase, the insights gained from the workflow are presented visually. The data is represented visually after

  • • Cleansing the data
  • • Filtering the data and
  • • Replicating the logic of the application

Visualization of the data includes graphs, tables, scatter plots, diagrams and maps.

7.Right tools and Technology for the requirements

With so many Big Data tools and technology options available in the market, it becomes difficult and confusing to pick the right one for your needs and requirements. One size doesn’t fit all. So, if the enterprise has clear expectations of their technical requirements, it becomes easy for them to evaluate the tools offered by different vendors in the market by comparing their strengths and weakness.

A brief overview of different options available to evaluate the tools:

Apache Hadoop:

This popular framework is designed to process terabytes and petabytes of structured and unstructured data. For faster processing, Hadoop breaks down large work loads into small data blocks distributed in a cluster of commodity hardware.

Four key components of Hadoop platform are:

  • • Hadoop distributed file system – This is a scalable storage platform which without a defined organization structure can store data on multiple machines.
  • • MapReduce – This is a software programming model that comes with a parallel data processing engine to process large sets of data
  • • YARN – This is a resource management framework which is used for scheduling jobs and handling resources from distributed application
  • • Hadoop Common – These are the libraries and utilities used by other modules of Hadoop

Apart from Apache Hadoop some vendors also provide Hadoop Distribution Platform like HortonWorks, Cloudera, Amazon Elastic Map Reduce (EMR) and MapR. These distribution platforms come with a package of additional software components, training, tools and documentation etc.

Big vendors like SAS, Teradata, SAP, IBM, Microsoft sell big data suites that are commercially viable. However, these big vendors do not support the integration of Hadoop distribution platforms and also do not support the building of Hadoop solutions to their existing portfolio.

As of today more and more big vendors in the market are changing to the Hadoop framework because it is less expensive. Using hybrid approach most of the enterprises prefer to select tools depending on their maturity and their unique capabilities.

To make a decision, you need to understand the strengths and weakness and potential tradeoff of the tools. So it is advisable to look around in the market for the tools offered by big vendors with unique capabilities. This browsing will help the enterprise to pick right tools for their technology requirement. It is also recommended to analyze the cost benefit between the tools offered by different vendors for enterprise version and commercial support.

8.Early Data Governance

With the velocity, variety and volume of data growing exponentially like streaming of data, diversity of data formats and large scale cloud infrastructures, the concerns for security and privacy of the data has magnified. To have a competitive edge, the enterprises are adapting the Big Data innovations. It has become mandatory for the organizations to build resilience against the threats for their data security and realign the policies, governance and the privacy of the organization when dealing with the privacy laws of other countries and continents.

To manage risks the organizations have to revisit their frameworks and governance structures. This will help them to take informed decisions by identifying and assessing the risk on time.

To protect the privacy and the security of the information, organizations have to consider a data governance strategy in their planning process that also includes the evaluation of pilots.


Big Data Analytics is the most happening and sought after technology in the organizations. Organizations are interested to embrace this technology and take advantage of the analytics and have a competitive edge in the market. The principles discussed here will definitely help the organizations to successfully find right Big Data Analytics initiatives for their business.


About Datalake Solutions

DataLake Solutions is a solution oriented firm indulged in serving organizations with work optimizing products and services since Jan ,2016 in India, based in USA. Data-lake has an experienced and dedicated team equipped with knowledge, skills and expertise of modern era to fulfil business reqments and same is rewarded with customer’s delight.

Read more