Tag Archives: map reduce

Hadoop and Data Science

Hadoop is more used for Massive Parallel processing MPP architecture.

new MPP platform which can scaleout to petabyte database hadoop which is open source community(around apache, vendor agnostic framework in MPP), can help in faster precessing of heavy loads. Mapreduce can be used for further customisation.

hadoop can help roles CTO  : log analysis of huge data of suppose application logging millions of transaction data .

CMO: targetted offering from social data, target advertisements and customer offerings.

CFO : on using predictive analytics to find toxicity of Loan or mortage from social data of prespects.

As datawarehousing and BI in Technology driven Company people report to CTO only.But it getting pervasive..so user load in BI System increase leading to efficient processing through system like hadoop of social data.

hadoop can help in near realtime analysis of customer like customer click stream real-time analysis,(realtime changing customer interest  can be checked over portal ).

Can bring paradigm shift in Next generation enterprise EDW,SOA(hadoop).  Mapreduce in data virtualitzation.In  cloud we have  (platform,Infrastructure,software).

mahout : Framework for machine learning for analyzing huge data and predictive analytic on it. Open source framework support for Mapreduce.Real time analytic helps in figuring trend very early from customer perspective hence adoption level should be high in customer Relationship management modules so it growth of Salesforce.com depicts.

HDFS: is suited for batch processing.

HBase: for but near realtime

casendra : optimized real tim e distributed environment.

Hr Analytics: There are  high degree of silos: cycle through lots survey data :–> prepare report –> generalized problem  –> find solutions for generalized data . Data from perspective of application, application as perspective of data.

BI help us in getting single version of truth about structure data but unstructured data is where Hadoop helps. Hadoop can process: (structureed,un-structured, timeline etc..across enteripse) data.from service oriented Architeture we need to move from SOA towards  SOBA Service oriented business Architecture.SOBAs are applications composed of services in a declarative manner .The SOA Programming Model specifications include the Service Component Architecture (SCA) to simplify the development of creating business services and Service Data Objects (SDO) for accessing data residing in multiple locations and formats.Moving towards data driven application architectures.Rather than application arranged around data have to otherwise application arranged around data. 

Architect view point: 1. people and process as overlay of technology. Expose data trough service oriented data access.  Hadoop helps in processing power in MDM, quality, integrating data outside enterprise.

utility Industry:Is the first industry to adopt Cloud services with smart metering. Which can give smart input to user about load in network rather then calling services provider user is self aware..Its like Oracle brought this concept of Self service applications.

I am going to refine matter further put some more example and ilustrations if time permits..
Read More details at another blog:

Data Integration , map Reduce algorithm , virtualisation relation and trends

In year 2011 This reply i did to a discussion. would later structure it into proper article. 

As of 2010 data virtualization had begun to advance ETL processing. The application of data virtualization to ETL allowed solving the most common ETL tasks of data migration and application integration for multiple dispersed data sources. So-called Virtual ETL operates with the abstracted representation of the objects or entities gathered from the variety of relational, semi-structured and unstructured data sources. ETL tools can leverage object-oriented modeling and work with entities’ representations persistently stored in a centrally located hub-and-spoke architecture. Such a collection that contains representations of the entities or objects gathered from the data sources for ETL processing is called a metadata repository and it can reside in memory[1] or be made persistent. By using a persistent metadata repository, ETL tools can transition from one-time projects to persistent middleware, performing data harmonization and data profiling consistently and in near-real time.

– More then colmunar databases i see probalistic databases : link:http://en.wikipedia.org/wiki/Probabilistic_database

probabilistic database is an uncertain database in which the possible worlds have associated probabilities. Probabilistic database management systems are currently an active area of research. “While there are currently no commercial probabilistic database systems, several research prototypes exist…”[1]

Probabilistic databases distinguish between the logical data model and the physical representation of the data much like relational databases do in the ANSI-SPARC Architecture. In probabilistic databases this is even more crucial since such databases have to represent very large numbers of possible worlds, often exponential in the size of one world (a classical database), succinctly.

For Bigdata analysis the software which is getting popular today is IBM big data analytics
I am writing about this too..already written some possible case study where and how to implement.
Understanding Big data PDF attached.
There are lot of other vendors which are also moving in products for cloud computing..in next release on SSIS hadoop feed will be available as source.
– Microstraegy and informatica already have it.
– this whole concept is based on mapreduce algorithm from google..There are online tutorials on mapreduce.(ppt attached)

Without a doubt, data analytics have a powerful new tool with the “map/reduce” development model, which has recently surged in popularity as open source solutions such as Hadoop have helped raise awareness.

Tool: You may be surprised to learn that the map/reduce pattern dates back to pioneering work in the 1980s which originally demonstrated the power of data parallel computing. Having proven its value to accelerate “time to insight,” map/reduce takes many forms and is now being offered in several competing frameworks.

If you are interested in adopting map/reduce within your organization, why not choose the easiest and best performing solution? ScaleOut StateServer’s in-memory data grid offers important advantages, such as industry-leading map/reduce performance and an extremely easy to use programming model that minimizes development time.

Here’s how ScaleOut map/reduce can give your data analysis the ideal map/reduce framework:

Industry-Leading Performance

  • ScaleOut StateServer’s in-memory data grids provide extremely fast data access for map/reduce. This avoids the overhead of staging data from disk and keeps the network from becoming a bottleneck.
  • ScaleOut StateServer eliminates unnecessary data motion by load-balancing the distributed data grid and accessing data in place. This gives your map/reduce consistently fast data access.
  • Automatic parallel speed-up takes full advantage of all servers, processors, and cores.
  • Integrated, easy-to-use APIs enable on-demand analytics; there’s no need to wait for batch jobs.