Tag Archives: Mahout

Collaboration Management System relation to Analytics and data Science

Collaboration tools integrated offering (course grain integration using ) integration tools like TIBCO, Oracle BPEL, : Components to be integrated:
1. Content management system CMS  (SharePoint, Joomla, drupal) and
2. Document Management system like (liferay, Document-um, IBM file-net) can be integrated using flexible integration tools.

3. Communication platform like Windows Communication Foundation ,IBM lotus notes integrated with mail client and Social network like Facebook using Facebook API, LinkedIn API, twitter API ,skype API to direct plugin as well as data Analysis of Social networking platform unstructured data captured of the collaboration for the project discussion.
soft-phone using Skype offering recording conversation facility for later use.


Oracle Web centre:
4. Integrated Project specific Wikki/Sharepoint/other CMS pages integrated with PMO site Artefacts, Enterprise Architecture Artefacts.
5. seamless integration to Enterprise Search using Endeca or Microsoft FAST for discovery of document, information, answers from indexed,tagged repository of data.
6. Structured and Unstructured data : hosted on Hadoop clusters using Map-reduce algorithm to Analyse data, consolidate data using Hadoop Hive, HBase and mining to discover hidden information using data mining library in Mahout for unstructured data.
Structured data kept in RDBMS clusters like RAC rapid application clusters.

7. Integrated with Domain specific Enterprise resource planning ERP packages the communication, collaboration,Discovery, Search layer.
8. All integrated with mesh up architecture providing real-time information maps of resource located and information of nearest help.
9. messaging and communication layer integrated with all on-line company software.
10.Process Orchestration and integration Using Business Process Management tool BPM tool, PEGA BPM, Jboss BPM , windows workflow foundation depending landscape used.
11. Private cloud integration using Oracle cloud , Microsoft Azure, Eucalyptus, open Nebula integrated with web API other web platform landscape.
12. Integrated BI system with real time information access by tools like TIBCO spotfire which can analyse real time data flowing between integrated systems.
Data centre API and virtualisation plaform can also throw in data for analysis to hadoop cluster.
External links for reference: http://www.sap.com/index.epx
AP XI: http://help.sap.com/saphelp_nw04/helpdata/en/9b/821140d72dc442e10000000a1550b0/content.htm

Oracle Web centre: http://www.oracle.com/technetwork/middleware/webcenter/suite/overview/index.html

CMS: http://www.joomla.org/,http://www.liferay.com/http://www-03.ibm.com/software/products/us/en/filecontmana/
Hadoop: http://hadoop.apache.org/

Map reduce: http://hadoop.apache.org/docs/stable/mapred_tutorial.html
acebook API: https://developers.facebook.com/docs/reference/apis/
inkedin API: http://developer.linkedin.com/apis
witter API: https://dev.twitter.com/

Use Case: Big data Analytics: Cisco,Machine to Machine,image processing,economics

The 5V volume, variety, velocity,value,variability Story:

Datawarehouses maintain data loaded from operational databases using Extract Transform Load ETL tools like informatica, datastage, Teradata ETL utilities etc…
Data is extracted from operational store (contains daily operational tactical information) in regular intervals defined by load cycles. Delta or Incremental load or full load is taken to datwarehouse containing Fact and dimension tables which are modeled on STAR (around 3NF )or SNOWFLAKE schema.
During business Analysis we come to know what is granularity at which we need to maintain data. Like (Country,product, month) may be one granularity and (State,product group,day) may be requirement for different client. It depends on key drivers what level do we need to analyse business.

There many databases which are specially made for datawarehouse requirement of low level indexing, bit map indexes, high parallel load using multiple partition clause for Select(during Analysis), insert( during load). data warehouses are optimized for those requirements.
For Analytic we require data should be at lowest level of granularity.But for normal DataWarehouses its maintained at a level of granularity as desired by business requirements as discussed above.
for Data characterized by 3V volume, velocity and variety of cloud traditional datawarehouses are not able to accommodate high volume of suppose video traffic, social networking data. RDBMS engine can load limited data to do analysis.. even if it does with large not of programs like triggers, constraints, relations etc many background processes running in background makes it slow also sometime formalizing in strict table format may be difficult that’s when data is dumped as blog in column of table. But all this slows up data read and writes. even is data is partitioned.
Since advent of Hadoop distributed data file system. data can be inserted into files and maintained using unlimited Hadoop clusters which are working parallel and execution is controlled byMap Reduce algorithm . Hence cloud file based distributed cluster databases proprietary to social networking needs like Cassandra used by facebook etc have mushroomed.Apache hadoop ecosystem have created Hive (datawarehouse)

With Apache Hadoop Mahout Analytic Engine for real time data with high 3V data Analysis is made possible.  Ecosystem has evolved to full circle Pig: data flow language,Zookeeper coordination services, Hama for massive scientific computation,

HIPI: Hadoop Image processing Interface library made large scale image processing using hadoop clusters possible.

Realtime data is where all data of future is moving towards is getting traction with large server data logs to be analysed which made Cisco Acquired Truviso Rela time data Analytics http://www.cisco.com/web/about/ac49/ac0/ac1/ac259/truviso.html

Analytic being this of action: see Example:

with innovation in hadoop ecosystem spanning every direction.. Even changes started happening in other side of cloud stack of vmware acquiring nicira. With huge peta byte of data being generated there is no way but to exponentially parallelism data processing using map reduce algorithms.
There is huge data out yet to generated with IPV6 making possible array of devices to unique IP addresses. Machine to Machine (M2M) interactions log and huge growth in video . image data from vast array of camera lying every nuke and corner of world. Data with a such epic proportions cannot be loaded and kept in RDBMS engine even for structured data and for unstructured data. Only Analytic can be used to predict behavior or agents oriented computing directing you towards your target search. Bigdatawhich technology like Apache Hadoop,Hive,HBase,Mahout, Pig, Cassandra, etc…as discussed above will make huge difference.

Some of the technology to some extent remain Vendor Locked, proprietory but Hadoop is actually completely open leading the the utilization across multiple projects. Every product have data Analysis have support to Hadoop. New libraries are added almost everyday. Map and reduce cycles are turning product architecture upside down. 3V (variety, volume,velocity) of data is increasing each day. Each day a new variety comes up, and new speed or velocity of data level broken, records of volume is broken.
The intuitive interfaces to analyse the data for business Intelligence system is changing to adjust such dynamism  since we cannot look at every bit of data not even every changing data we need to our attention directed to more critical bit of data out of heap of peta-byte data generated by huge array of devices , sensors and social media. What directs us to critical bit ? As given example
or Hedge funds use hedgehog language provided by :
such processing can be achieved using Hadoop or map-reduce algorithm. There are plethora of tools and technology which are make development process fast. New companies are coming  from ecosystem which are developing tools and IDE to make transition to this new development  easy and fast.

When market gets commodatizatied as it hits plateu of marginal gains of first mover advantage the ability to execute becomes critical. What Big data changes is cross Analysis kind of first mover validation before actually moving. Here speed of execution will become more critical. As production function Innovation givesreturns in multiple. so the differentiate or die or Analyse and Execute feedback as quick and move faster is market…

This will make cloud computing development tools faster to develop with crowd sourcing, big data and social Analytic feedback.





Cloud Computing relation to Business Intelligence and Datawarehousing
Read :

Cloud Computing and Unstructured Data Analysis Using
Apache Hadoop Hive
Also it compares Architecture of 2 Popular BI Tools.

Cloud Data warehouse Architecture:

Future of BI
No one can predict future but these are directions where it moving in BI.

Bigdata,cloud , business Intelligence and Analytics

There huge amount of data being generated by BigData Chractersized by 3V (Variety,Volume,Velocity) of different variety (audio, video, text, ) huge volumes (large video feeds, audio feeds etc), and velocity ( rapid change in data , and rapid changes in new delta data being large than existing data each day…) Like facebook keep special software which keep latest data feeds posts on first layer storage server Memcached (memory caching) server bandwidth so that its not clogged and fetched quickly and posted in real time speed the old archive data stored not in front storage servers but second layer of the servers.
Bigdata 3V characteristic data likewise stored in huge (Storage Area Network) SAN of cloud storage can be controlled by IAAS (infrastucture as service) component software like Eucalyptus to create public or private cloud. PAAS (platform as service) provide platform API to control package and integrate to other components using code. while SAAS provide seamless Integration.
Now Bigdata stored in cloud can analyzed using hardtop clusters using business Intelligence and Analytic Software.
Datawahouse DW: in RDBMS database to in Hadoop Hive. Using ETL tools (like Informatica, datastage , SSIS) data can be fetched operational systems into data ware house either Hive  for unstructured data or RDBMS for more structured data.

BI over cloud DW: BI can create very user friendly intuitive reports by giving user access to layer of SQL generating software layer called semantic layer which can generate SQL queries on fly depending on what user drag and drop. This like noSQL and HIVE help in analyzing unstructured data faster like data of social media long text, sentences, video feeds.At same time due to parallelism in Hadoop clusters and use of map reduce algorithm the calculations and processing can be lot quicker..which is fulling the Entry of Hadoop and cloud there.
Analytics and data mining is expension to BI. The social media data mostly being unstructured and hence cannot be analysed without categorization and hence quantification then running other algorithm for analysis..hence Analytics is the only way to get meaning from terabyte of data being populated in social media sites each day.

Even simple assumptions like test of hypothesis cannot be done with analytics on the vast unstructured data without using Analytics. Analytics differentiate itself from datawarehouse as it require much lower granularity data..or like base/raw data..which is were traditional warehouses differ. some provide a workaround by having a staging datawarehouse but still  data storage here has limits and its only possible for structured data. So traditional datawarehouse solution is not fit in new 3V data analysis. here new Hadoop take position with Hive and HBase and noSQL and mining with mahout.
Look At the other Article: For practical tool based Case Study on Social Media Analytics

Data Integration , map Reduce algorithm , virtualisation relation and trends

In year 2011 This reply i did to a discussion. would later structure it into proper article. 

As of 2010 data virtualization had begun to advance ETL processing. The application of data virtualization to ETL allowed solving the most common ETL tasks of data migration and application integration for multiple dispersed data sources. So-called Virtual ETL operates with the abstracted representation of the objects or entities gathered from the variety of relational, semi-structured and unstructured data sources. ETL tools can leverage object-oriented modeling and work with entities’ representations persistently stored in a centrally located hub-and-spoke architecture. Such a collection that contains representations of the entities or objects gathered from the data sources for ETL processing is called a metadata repository and it can reside in memory[1] or be made persistent. By using a persistent metadata repository, ETL tools can transition from one-time projects to persistent middleware, performing data harmonization and data profiling consistently and in near-real time.

– More then colmunar databases i see probalistic databases : link:http://en.wikipedia.org/wiki/Probabilistic_database

probabilistic database is an uncertain database in which the possible worlds have associated probabilities. Probabilistic database management systems are currently an active area of research. “While there are currently no commercial probabilistic database systems, several research prototypes exist…”[1]

Probabilistic databases distinguish between the logical data model and the physical representation of the data much like relational databases do in the ANSI-SPARC Architecture. In probabilistic databases this is even more crucial since such databases have to represent very large numbers of possible worlds, often exponential in the size of one world (a classical database), succinctly.

For Bigdata analysis the software which is getting popular today is IBM big data analytics
I am writing about this too..already written some possible case study where and how to implement.
Understanding Big data PDF attached.
There are lot of other vendors which are also moving in products for cloud computing..in next release on SSIS hadoop feed will be available as source.
– Microstraegy and informatica already have it.
– this whole concept is based on mapreduce algorithm from google..There are online tutorials on mapreduce.(ppt attached)

Without a doubt, data analytics have a powerful new tool with the “map/reduce” development model, which has recently surged in popularity as open source solutions such as Hadoop have helped raise awareness.

Tool: You may be surprised to learn that the map/reduce pattern dates back to pioneering work in the 1980s which originally demonstrated the power of data parallel computing. Having proven its value to accelerate “time to insight,” map/reduce takes many forms and is now being offered in several competing frameworks.

If you are interested in adopting map/reduce within your organization, why not choose the easiest and best performing solution? ScaleOut StateServer’s in-memory data grid offers important advantages, such as industry-leading map/reduce performance and an extremely easy to use programming model that minimizes development time.

Here’s how ScaleOut map/reduce can give your data analysis the ideal map/reduce framework:

Industry-Leading Performance

  • ScaleOut StateServer’s in-memory data grids provide extremely fast data access for map/reduce. This avoids the overhead of staging data from disk and keeps the network from becoming a bottleneck.
  • ScaleOut StateServer eliminates unnecessary data motion by load-balancing the distributed data grid and accessing data in place. This gives your map/reduce consistently fast data access.
  • Automatic parallel speed-up takes full advantage of all servers, processors, and cores.
  • Integrated, easy-to-use APIs enable on-demand analytics; there’s no need to wait for batch jobs.