Data Lakes recommends using agile and iterative implementation techniques as a project methodology for the Big Data Analytics initiative. These techniques deliver instant results focused on business requirement rather than the SDLC application development.
We suggest starting small by identifying the requirements in the business case and then meeting them then slowly expanding the project to towards bigger business strategy.
Big Data Analytics projects are executed in phases
This is the first step in the Big Data implementation where the data is acquired. Data from different sources are put on one single platform. Sources like
In this phase the complete focus will be on understanding the sources of data and how the collected data is ingested to get unified view of aggregated data from different sources.
In this phase the ingested data from Phase 1 is
This phase manipulates data multiple times by moving, splitting, sorting, translating, pivoting and more to ensure that the data collected is accurate. The data is validated against the data quality rules.
This phase is all about finding the value of the data. With the data mining algorithms, we find insights from the disparate data sets. Most commonly used data mining algorithms are
Let us say Banks use Logistic Regression algorithms to predict the probability of default of a loan by analyzing the loan applicants’ details. Some eCommerce business owners can use Association Rules to analyze the data sets and plan for a better offer for the customers in real time
When using the Apache Hadoop environment, model codes are written in Python, HQL – High Query Language or Java. Commonly used algorithms for larger data pattern mining are R & Apache Mahout.
To address the issues of enterprise across functions and business units, data is acquired from different new sources. Hence, it becomes important to build models that can be scaled up without any inconsistencies. This phase incorporates the key business inputs derived from the business process users into specific variables while analyzing the data.
In this last phase, the insights gained from the workflow are presented visually. The data is represented visually after
Visualization of the data includes graphs, tables, scatter plots, diagrams and maps.
With so many Big Data tools and technology options available in the market, it becomes difficult and confusing to pick the right one for your needs and requirements. One size doesn’t fit all. So, if the enterprise has clear expectations of their technical requirements, it becomes easy for them to evaluate the tools offered by different vendors in the market by comparing their strengths and weakness.
A brief overview of different options available to evaluate the tools:
This popular framework is designed to process terabytes and petabytes of structured and unstructured data. For faster processing, Hadoop breaks down large work loads into small data blocks distributed in a cluster of commodity hardware.
Four key components of Hadoop platform are:
Apart from Apache Hadoop some vendors also provide Hadoop Distribution Platform like HortonWorks, Cloudera, Amazon Elastic Map Reduce (EMR) and MapR. These distribution platforms come with a package of additional software components, training, tools and documentation etc.
Big vendors like SAS, Teradata, SAP, IBM, Microsoft sell big data suites that are commercially viable. However, these big vendors do not support the integration of Hadoop distribution platforms and also do not support the building of Hadoop solutions to their existing portfolio.
As of today more and more big vendors in the market are changing to the Hadoop framework because it is less expensive. Using hybrid approach most of the enterprises prefer to select tools depending on their maturity and their unique capabilities.
To make a decision, you need to understand the strengths and weakness and potential tradeoff of the tools. So it is advisable to look around in the market for the tools offered by big vendors with unique capabilities. This browsing will help the enterprise to pick right tools for their technology requirement. It is also recommended to analyze the cost benefit between the tools offered by different vendors for enterprise version and commercial support.
With the velocity, variety and volume of data growing exponentially like streaming of data, diversity of data formats and large scale cloud infrastructures, the concerns for security and privacy of the data has magnified. To have a competitive edge, the enterprises are adapting the Big Data innovations. It has become mandatory for the organizations to build resilience against the threats for their data security and realign the policies, governance and the privacy of the organization when dealing with the privacy laws of other countries and continents.
To manage risks the organizations have to revisit their frameworks and governance structures. This will help them to take informed decisions by identifying and assessing the risk on time.
To protect the privacy and the security of the information, organizations have to consider a data governance strategy in their planning process that also includes the evaluation of pilots.
Big Data Analytics is the most happening and sought after technology in the organizations. Organizations are interested to embrace this technology and take advantage of the analytics and have a competitive edge in the market. The principles discussed here will definitely help the organizations to successfully find right Big Data Analytics initiatives for their business.