Cassandra Data modelling best practise
Cassandra DB clustering other configuration
Big data: noSQL: cassandra data modelling best practises
Big data: noSQL vendor cassandra architecture
Data Science – Analytics Case Study Retail, finance,security
Data Science relation to big data, and Analytic case study
Data Science relation to big data, and Analytic case study
Mining patterns using ROC curve Use case economics,Biometrics
Receiver Operator curve or ROC curve are used in data mining , machine learning. from area under ROC curve u can calculate Gini coefficient. I have made an excel template
Example to show how its calculated.
if AUC is area under curve then,
G= 2AUC-1
Gini coefficient the most watched coefficient of economics these days :
I wrote a article comparing different countries of world with data available
Gini coefficient AUC has some component of noise which called to question of better measures which are used in machine learning DeltP or informedness ,mattews correlation coefficient each one is suitable to its own field while informedness=1 shows perfect performance while -1 represent perverse of negative performance despite all informedness. Economics Gini zero shows perfect equality.
So parameters keep improving there is no end result and there cannot be as our understanding increases we come at better measures and change is constant..but what is truth today was mystery or magic for old and would be kind of half truth for future..But the subjects are interconnected the branching of knowledge areas is going on since last 250 yrs.. earlier there was no engineering everything was under philosophy during Socrates. Socrates rightly said : that you cannot say anything with absolute certainty. But you can have informed decision that is what informedness quantifies that your decision how much they are informed decisions.
See a case from Biometrics:
Data Science, Master data management,hadoop and Informatica
MDM:-> What does it do?
MDM seeks to ensure that an organization does not use multiple version/terms (potentially inconsistent) versions of the same master data in different parts of its operations, which can occur in large organizations.Thus CRM, DW/BI, Sales,Production ,finance each has its own way of representing things
There are lot of Products in MDM space One that have good presence in market are:
Tibco Information collaboration tool leader
Collaborative Information Manager.
– work on to standardize across ERP,CRM,DW,PLM
– cleanising and aggregation.
– distribute onwers to natural business users of data(sales,Logistics,Finance,HR,Publishing)
– automated Business Processes to clollaborate to maintain info asset and data governace poilcy
– built in data models can extended (industry template,validation rule)
– built in process to manage change elliminate confusion manageing change ,estb clear audit and governace trail for reporting.
– sync relevant subset of info downstream application trading partner and exchanges.SOA to pass data to as web service to composite applications.
IBM MDM Inforsphere MDM Server
Still its incomplete i will continue to add on this.
Product detail( informatica.com)
source: (http://www.biia.com/wp-content/uploads/2012/01/White-Paper-1601_big_data_wp.pdf)
Short Notes below taken from source:+ My comments on them.
Informatica MDM capabilities:
Informatica 9.1 supplies master data management (MDM) and data quality technologies to
enable your organization to achieve better business outcomes by delivering authoritative, trusted data to business processes, applications, and analytics, regardless of the diversity or scope of Big
Data.
Single platform for all MDM architectural styles and data domains Universal MDM capabilities
in Informatica 9.1 enable your organization to manage, consolidate, and reconcile all master
data, no matter its type or location, in a single, unified solution. Universal MDM is defined by four
characteristics:
• Multi-domain: Master data on customers, suppliers, products, assets, locations, can be managed, consolidated, and accessed.
• Multi-style: A flexible solution may be used in any style: registry, analytical, transactional, or
co-existence.
• Multi-deployment: The solution may be used as a single-instance hub, or in federated, cloud, or service architectures.
• Multi-use: The MDM solution interoperates seamlessly with data integration and data quality technologies as part of a single platform.
Universal MDM eliminates the risk of standalone, single MDM instances—in effect, a set of data silos meant to solve problems with other data silos.
• Flexibly adapt to different data architectures and changing business needs
• Start small in a single domain and extend the solution to other enterprise domains, using any style
• Cost-effectively reuse skill sets and data logic by repurposing the MDM solution
“No data is discarded anymore!
U.S. xPress leverages a large scale of transaction data and a diversity of interaction data, now extended
to perform big data processing like Hadoop with Informatica 9.1. We assess driver performance with image files and pick up
customer behaviors from texts by customer service reps. U.S. xPress saved millions of dollars per year by reducing fuels and optimizing
routes augmenting our enterprise data with sensor, meter, RFID tags, and geospatial data.” Tim Leonard Chief Technology Officer
Source: U.S. xPress Big Data Unleashed: Turning Big Data into Big Opportunities with the Informatica 9.1 Platform.
Reusable data quality policies across all project types Interoperability among the MDM, data quality, and data integration capabilities in Informatica 9.1 ensures that data quality rules can
be reused and applied to all data throughout an implementation lifecycle, across both MDM and data integration projects (see Figure 3).
• Seamlessly and efficiently apply data quality rules regardless of project type, improving data accuracy
• Maximize reuse of skills and resources while increasing ROI on existing investments
• Centrally author, implement, and maintain data quality rules within source applications and propagate downstream
Proactive data quality assurance Informatica 9.1 delivers technology that enables both business and IT users to proactively monitor and profile data as it becomes available, from
internal applications or external Big Data sources. You can continuously check for completeness, conformity, and anomalies and receive alerts via multiple channels when data quality issues are
found.
• Receive “early warnings” and proactively identify and correct data quality problems before they happen
• Prevent data quality problems from affecting downstream applications and business processes
• Shorten testing cycles by as much as 80 percent
Putting Authoritative and Trustworthy Data to Work
The diversity and complexity of Big Data can worsen the data quality problems that exist in
many organizations. Standalone, ad hoc data quality tools are ill equipped to handle large-scale
streams from multiple sources and cannot generate the reliable, accurate data that enterprises
need. Bad data inevitably means bad business. In fact, according to a CIO Insight report, 46
percent of survey respondents say they’ve made an inaccurate business decision based on bad or
outdated data.9
MDM and data quality are prerequisites for making the most of the Big Data opportunity. Here are
two examples:
•
Using social media data to attract and retain customers For some organizations, tapping
social media data to enrich customer profiles can be putting the cart before the horse. Many
companies lack a single, complete view of their customers, ranging from reliable and consistent
names and contact information to the products and services in place. Customer data is
often fragmented across CRM, ERP, marketing automation, service, and other applications.
Informatica 9.1 MDM and data quality enable you to build a complete customer profile from
multiple sources. With that authoritative view in place, you’re poised to augment it with the
intelligence you glean from social media.
•
Data-driven response to business issues Let’s say you’re a Fortune 500 manufacturer and
a supplier informs you that a part it sold you is faulty and needs to be replaced. You need
answers fast to critical questions: In which products did we use the faulty part? Which
customers bought those products and where are they? Do we have substitute parts in stock?
Do we have an alternate supplier?
But the answers are sprawled across multiple domains of your enterprise—your procurement
system, CRM, inventory, ERP, maybe others in multiple countries. How can you respond swiftly
and precisely to a problem that could escalate into a business crisis? Business issues often
span multiple domains, exerting a domino effect across the enterprise and confounding
an easy solution. Addressing them depends on seamlessly orchestrating interdependent
processes—and the data that drives them.
With the universal MDM capabilities in Informatica 9.1, our manufacturer could quickly locate
reliable, authoritative master data to answer its pressing business questions, regardless of
where the data resided or whether multiple MDM styles and deployments were in place.
Self-Service
Big Data’s value is limited if the business depends on IT to deliver it. Informatica 9.1 enables your
organization to go beyond business/IT collaboration to empower business analysts, data stewards,
and project owners to do more themselves without IT involvement with the following capabilities
Analysts and data stewards can assume a greater role in
defining specifications, promoting a better understanding of the data, and improving productivity
for business and IT.
• Empower business users to access data based on business terms and semantic metadata
• Accelerate data integration projects through reuse, automation, and collaboration
• Minimize errors and ensure consistency by accurately translating business requirements into
data integration mappings and quality rules
Application-aware accelerators for project owners:
empowers project owners to rapidly understand and access data for data
warehousing, data migration, test data management, and other projects. Project owners can
source business entities within applications instead of specifying individual tables that require
deep knowledge of the data models and relational schemas.
•Reduce data integration project delivery time
•Ensure data is complete and maintains referential integrity
• Adapt to meet business-specific and compliance requirements
Informatica 9.1 introduces complex event processing (CEP) technology into data quality and
integration monitoring to alert business users and IT of issues in real time. For instance, it will notify an analyst if a data quality key performance indicator exceeds a threshold, or if integration processes differ from the norm by a predefined percentage.
• Enable business users to define monitoring criteria by using prebuilt templates
• Alert business users on data quality and integration issues as they arise
• Identify and correct problems before they impact performance and operational systems
• Speeding and strengthening business effectiveness Informatica 9.1 makes “MDM-aware”
everyday business applications such as Salesforce.com, Oracle, Siebel, SAP for CRM, ERP, and
others by presenting reconciled master data directly within those applications. For example,
Informatica’s MDM solution will advise a salesperson creating a new account for “John Jones”
that a customer named Jonathan Jones, with the same address, already exists. Through
the Salesforce interface, the user can access complete, reliable customer information that
Informatica MDM has consolidated from disparate applications.
She can see the products and services that John has in place and that he follows her
company’s Twitter tweets and is a Facebook fan. She has visibility into his household and
business relationships and can make relevant cross-sell offers. In both B2B and B2C scenarios,
MDM-aware applications spare the sales force from hunting for data or engaging IT while
substantially increasing productivity.
• Giving business users a hands-on role in data integration and quality Long delays and
high costs are typical when the business attempts to communicate data specifications to
IT in spreadsheets. Part of the problem has been the lack of tools that promote business/IT
collaboration and make data integration and quality accessible to the business user.
As Big Data unfolds, Informatica 9.1 gives analysts and data stewards a hands-on role. Let’s
say your company has acquired a competitor and needs to migrate and merge new Big Data
into your operational systems. A data steward can browse a data quality scorecard and identify
anomalies in how certain customers were identified and share a sample specification with IT.
Once validated, the steward can propagate the specification across affected applications. A
role-based interface also enables the steward to view data integration logic in semantic terms
and create data integration mappings that can be readily understood and reused by other
business users or IT. Big Data Unleashed: Turning Big Data into Big Opportunities with the Informatica 9.1 Platform
Collaboration Management System relation to Analytics and data Science
Collaboration tools integrated offering (course grain integration using ) integration tools like TIBCO, Oracle BPEL, : Components to be integrated:
1. Content management system CMS (SharePoint, Joomla, drupal) and
2. Document Management system like (liferay, Document-um, IBM file-net) can be integrated using flexible integration tools.
3. Communication platform like Windows Communication Foundation ,IBM lotus notes integrated with mail client and Social network like Facebook using Facebook API, LinkedIn API, twitter API ,skype API to direct plugin as well as data Analysis of Social networking platform unstructured data captured of the collaboration for the project discussion.
soft-phone using Skype offering recording conversation facility for later use.
http://sandyclassic.wordpress.com/2013/06/19/how-to-do-social-media-analysis/
Oracle Web centre:
http://sandyclassic.wordpress.com/2011/11/04/new-social-computing-war-oracle-web-centre/
4. Integrated Project specific Wikki/Sharepoint/other CMS pages integrated with PMO site Artefacts, Enterprise Architecture Artefacts.
5. seamless integration to Enterprise Search using Endeca or Microsoft FAST for discovery of document, information, answers from indexed,tagged repository of data.
6. Structured and Unstructured data : hosted on Hadoop clusters using Map-reduce algorithm to Analyse data, consolidate data using Hadoop Hive, HBase and mining to discover hidden information using data mining library in Mahout for unstructured data.
Structured data kept in RDBMS clusters like RAC rapid application clusters.
http://sandyclassic.wordpress.com/2011/10/19/hadoop-its-relation-to-new-architecture-enterprise-datawarehouse/
http://sandyclassic.wordpress.com/2013/07/02/data-warehousing-business-intelligence-and-cloud-computing/
7. Integrated with Domain specific Enterprise resource planning ERP packages the communication, collaboration,Discovery, Search layer.
8. All integrated with mesh up architecture providing real-time information maps of resource located and information of nearest help.
9. messaging and communication layer integrated with all on-line company software.
10.Process Orchestration and integration Using Business Process Management tool BPM tool, PEGA BPM, Jboss BPM , windows workflow foundation depending landscape used.
11. Private cloud integration using Oracle cloud , Microsoft Azure, Eucalyptus, open Nebula integrated with web API other web platform landscape.
http://sandyclassic.wordpress.com/2011/10/20/infrastructure-as-service-iaas-offerings-and-tools-in-market-trends/
12. Integrated BI system with real time information access by tools like TIBCO spotfire which can analyse real time data flowing between integrated systems.
Data centre API and virtualisation plaform can also throw in data for analysis to hadoop cluster.
External links for reference: http://www.sap.com/index.epx
http://www.oracle.com,http://www.tibco.com/,http://spotfire.tibco.com/,
http://scn.sap.com/thread/1228659
SAP XI: http://help.sap.com/saphelp_nw04/helpdata/en/9b/821140d72dc442e10000000a1550b0/content.htm
Oracle Web centre: http://www.oracle.com/technetwork/middleware/webcenter/suite/overview/index.html
CMS: http://www.joomla.org/,http://www.liferay.com/, http://www-03.ibm.com/software/products/us/en/filecontmana/
Hadoop: http://hadoop.apache.org/
Map reduce: http://hadoop.apache.org/docs/stable/mapred_tutorial.html
facebook API: https://developers.facebook.com/docs/reference/apis/
Linkedin API: http://developer.linkedin.com/apis
Twitter API: https://dev.twitter.com/