Unit 5 Data warehousing and data Mining
Unit 5
Data warehousing and data Mining
(Data Visualization and Overall
Perspective)
Aggregation
Data aggregation refers to a process of
collecting information from different sources and presenting it in a summarized
format so that business analysts can perform statistical analyses of business
schemes. The collected information may be gathered from various data sources to
summarize these data sources into a draft for data analysis.
Data aggregation plays a vital role in
finance, product, operations, and marketing strategies in any business
organization. Aggregated data is present in the data warehouse that can enable
one to solve various issues, which helps solve queries from data sets.
Data aggregation work
Historical information
Historical data, in a broad context,
is data collected
about past events and circumstances pertaining to a particular subject.
By definition, historical data includes
most data generated either manually or automatically within an
enterprise.
OLAP function and Tool
·
The OLAP server
collects data from multiple data sources, including relational databases and
data warehouses.
·
Then, the extract,
transform, and load (ETL) tools clean, aggregate, precalculated, and store data
in an OLAP cube according to the number of dimensions specified.
·
Business analysts use
OLAP tools to query and generate reports from the multidimensional data in the
OLAP cube.
·
Has intuitive
easy-to-use interface;
·
OLAP supports
complex calculations;
·
Provides data
view in multidimensional manner;
·
Time
intelligence.
1.Transparency: Make the technology, underlying
information repository, computing operations, and the dissimilar nature of
source data totally transparent to users. Such transparency helps to improve
the efficiency and productivity of the users.
2. Accessibility: It provides access only to the
data that is actually required to perform the particular analysis, present a
single, coherent, and consistent view to the clients. The OLAP operations should be sitting between
data sources (e.g., data warehouses) and an OLAP front-end.
3. Consistent Reporting Performance: To make sure that the
users do not feel any significant degradation in documenting performance as the
number of dimensions or the size of the database increases. .
OLAP Server
Online Analytical
Processing(OLAP) refers to a set of software tools used for data analysis in
order to make business decisions. OLAP provides a platform for gaining insights
from databases retrieved from multiple database systems at the same time. Many
Businesses Intelligence (BI) applications rely on OLAP technology.
Type of OLAP servers:
The three major types of OLAP
servers are as follows:
·
ROLAP
·
MOLAP
·
HOLAP
Relational OLAP (ROLAP):
Relational On-Line Analytical
Processing (ROLAP) is primarily used for data stored in a relational database,
where both the base data and dimension tables are stored as relational tables.
ROLAP servers are used to bridge the gap between the relational back-end server
and the client’s front-end tools. ROLAP servers store and manage warehouse data
using RDBMS, and OLAP middleware fills in the gaps.
Benefits:
·
It is compatible with data warehouses and OLTP systems.
·
The data size limitation of ROLAP technology is determined by
the underlying RDBMS. As a result, ROLAP does not limit the amount of data that
can be stored.
Limitations:
·
SQL functionality is constrained.
·
It’s difficult to keep aggregate tables up to date.
Multidimensional OLAP
(MOLAP):
Multidimensional On-Line
Analytical Processing (MOLAP) supports multidimensional views of data. Storage
utilization in multidimensional data stores may be low if the data set is
sparse.
It is used for OLAP, which is based on the
arrays’ random-access capability. Dimension instances determine array elements,
and the data or measured value associated with each cell is typically stored in
the corresponding array element. The multidimensional array is typically stored
in MOLAP in a linear
Benefits:
·
Suitable for slicing and dicing operations.
·
Outperforms ROLAP when data is dense.
·
Capable of performing complex calculations.
Limitations:
·
It is difficult to change the dimensions without re-aggregating.
·
Since all calculations are performed when the cube is built, a
large amount of data cannot be stored in the cube itself.
Hybrid OLAP (HOLAP):
ROLAP and MOLAP are combined
in Hybrid On-Line Analytical Processing (HOLAP). HOLAP offers greater
scalability than ROLAP and faster computation than MOLAP.HOLAP is a hybrid of
ROLAP and MOLAP. HOLAP servers are capable of storing large amounts of detailed
data. On the one hand, HOLAP benefits from ROLAP’s greater scalability. HOLAP,
on the other hand, makes use of cube technology for faster performance and
summary-type information. Because detailed data is stored in a relational
database, cubes are smaller than MOLAP.
Benefits:
·
HOLAP combines the benefits of MOLAP and ROLAP.
·
Provide quick access at all aggregation levels.
Limitations
·
Because it supports both MOLAP and ROLAP servers, HOLAP
architecture is extremely complex.
·
There is a greater likelihood of overlap, particularly in their
functionalities.
Data Mining Interface
A data mining interface facilitates the
exploration and extraction of actionable insights from large datasets using
data mining techniques.
1. Query Interface:
·
Provides users with tools to define and execute data mining
queries against the data warehouse.
·
Supports various query languages or graphical interfaces for
defining mining tasks.
2. Data Exploration Tools:
·
Enables users to explore data visually and interactively to
identify patterns, trends, and anomalies.
·
Includes features such as data visualization, clustering,
classification, and association rule discovery.
3. Model Building and
Evaluation:
·
Allows users to build predictive models using machine learning
algorithms and evaluate their performance.
·
Provides tools for model training, testing, and validation using
techniques like cross-validation.
4. Integration with BI Tools:
·
Integrates with business intelligence (BI) tools and dashboards
to visualize and present data mining results.
·
Enables users to incorporate predictive insights into
decision-making processes.
Security
Data warehouse security is crucial for
protecting sensitive information and ensuring compliance with regulatory
requirements. Here are key security measures:
1. Access Control:
·
Implement role-based access control (RBAC) to restrict access to
data based on users’ roles and responsibilities.
·
Enforce strong authentication mechanisms, such as multi-factor
authentication (MFA), to prevent unauthorized access.
2. Data Encryption:
·
Encrypt data at rest and in transit to prevent unauthorized
access or interception.
·
Use encryption techniques such as SSL/TLS for network
communication and encryption algorithms for data storage.
3. Auditing and Monitoring:
·
Implement auditing and logging mechanisms to track user
activities and changes to data.
·
Monitor access patterns and detect suspicious behaviour to
prevent security breaches.
4. Data Masking and
Anonymization:
·
Mask sensitive data to anonymize personally identifiable
information (PII) and protect privacy.
·
Replace sensitive data with pseudonymized or randomized values
to ensure confidentiality.
5. Compliance and Governance:
·
Ensure compliance with regulations such as GDPR, HIPAA, and
PCI-DSS by implementing data governance policies and controls.
·
Conduct regular security assessments and audits to identify
vulnerabilities and ensure adherence to security standards.
Backup and Recovery
Backup and recovery processes are
essential for data warehouse reliability and resilience. Here’s how it’s
managed:
1. Regular Backups:
·
Schedule regular backups of the data warehouse to ensure data
availability in case of data loss or corruption.
·
Implement full, incremental, or differential backup strategies
based on recovery requirements.
2. Redundant Storage:
·
Store backup copies of data in redundant storage locations, such
as cloud storage or off-site data centers.
·
Ensure data redundancy and fault tolerance to mitigate the risk
of data loss due to hardware failures or disasters.
3. Point-in-Time Recovery:
·
Maintain transaction logs or incremental backups to facilitate
point-in-time recovery to a specific moment in the past.
·
Enable rollback or recovery to restore the data warehouse to a
consistent state after data corruption or accidental changes.
4. Disaster Recovery Planning:
·
Develop and test disaster recovery plans to ensure business
continuity in the event of catastrophic failures or natural disasters.
·
Establish procedures for failover, data restoration, and system
recovery to minimize downtime and data loss.
5. Automated Backup Solutions:
·
Use automated backup solutions and backup scheduling tools to
streamline backup and recovery processes.
·
Monitor backup jobs and receive alerts for any failures or
anomalies to ensure timely resolution.
Tunning Data warehouse
The tuning of
the data warehouse can be done to improve the performance. The data warehouse
is the entry point into the system, and it provides the first opportunity to
improve performance. If the checks are performed on the data warehouse system,
either before or after the data is loaded, then it will have a direct effect on
the capacity and the performance of the system.
For example,
if the data is telephone call records, it can be checked that each call has a
valid customer identifier. If the data is sales information, then it can be
checked that whether the commodity being sold has a valid product identifier.
- Tune the business rules.
- Tune the data design.
- Tune the application design.
- Tune the logical structure of the database.
- Tune the database operations.
- Tune the access paths.
- Tune I/O and physical structure.
- Tune resource contention.
- Tune the underlying pattern.
Testing Data warehouse
Testing is very important for data warehouse systems
for data validation and to make them work correctly and efficiently.
There are three basic levels of testing performed on data warehouse which are
as follows :
- Unit Testing –
This type of testing is being performed at the developer’s end. In unit testing, each unit/component of modules is separately tested. Each modules of the whole data warehouse, i.e. program, SQL Script, procedure,, Unix shell is validated and tested. - Integration Testing –
In this type of testing the various individual units/ modules of the application are brought together or combined and then tested against the number of inputs. It is performed to detect the fault in integrated modules and to test whether the various components are performing well after integration. - System Testing –
System testing is the form of testing that validates and tests the whole data warehouse application. This type of testing is being performed by technical testing team. This test is conducted after developer’s team performs unit testing and the main purpose of this testing is to check whether the entire system is working altogether or not.
Challenges of data warehouse testing are :
- Data selection from multiple source and
analysis that follows pose great challenge.
- Volume and complexity of the data, certain
testing strategies are time consuming.
- ETL testing requires hive SQL skills, thus it
pose challenges for tester who have limited SQL skills.
- Redundant data in a data warehouse.
- Inconsistent and inaccurate reports.
ETL testing is performed in five stages :
- Identifying data sources and requirements.
- Data acquisition.
- Implement business logic’s and dimensional
modeling.
- Build and populate data.
- Build reports.
Warehousing application and Recent Trends
1. Data Warehousing
Cloud data warehousing has increasingly become a part of energy
reduction plans for businesses. It’s great to see a green data warehousing
trend making its way onto the scene while we are tackling climate change.
Cloud data centres operate with energy efficiencies well above
industry averages. In 2021, companies in the EU reported an 80% energy consumption reduction following
the migration of their enterprise data to SaaS storage providers.
As a marketing agency, switching to a green data warehousing solution
such as a cloud data warehouse can bring multiple benefits beyond energy
savings. It’s beneficial for our planet, and it will reduce the carbon
footprint of your agency.
#2. Outsourcing Data Management
- Data warehousing and automation
- Encoding,
- Compilation,
- Auditing,
- System architecture design, and more.
#3. Data Warehousing AI Solutions
#3. Data Warehousing AI Solutions
As the growth of the volume of data being
processed is
forecasted to spike, businesses increasingly need to offload data
operations to faster machine learning-enabled AI systems.
With trend and pattern analysis capabilities improving by leaps and
bounds, businesses that integrate AI into their data warehousing solutions can
reduce their operational costs and perform data operations more efficiently.
#5. Virtual Data Warehousing
Virtual data warehouses are sets of separate databases that can be
queried simultaneously by means of middleware. Virtual data warehousing is
trending because it’s cost-effective and can be deployed faster than physical
solutions.
By forgoing physical data replication, virtualization can improve
operating speeds and reduce operating costs by as much as four times.
#.4 In-Memory Computing
In-memory computing uses clusters of servers to pool total available RAM
and CPU power. This design distributes data handling and processing
tasks across the cluster for radically enhanced speed and scalability.
#5. In-Database Analytics
In-database
analytics refers to a method of analysis that processes data within its
storage site—a database or data warehouse. In-database analytics are
built into the storage architecture and replace the use of separate
applications after transfers
Types of
Warehousing Application
Host-Based
Data Warehouses
- Host-Based mainframe warehouses which reside
on a high-volume database. Supported by robust and reliable high-capacity
structure such as IBM system/390, UNISYS and Data General sequent systems,
and databases such as Sybase, Oracle, Informix, and DB2.
- Host-Based LAN data warehouses, where data
delivery can be handled either centrally or from the workgroup
environment. The size of the data warehouses of the database depends on
the platform.
Before
embarking on designing, building and implementing such a warehouse.
- Such databases generally have very high
volumes of data storage.
- Such warehouses may require support for both
MVS and customer-based report and query facilities.
- These warehouses have complicated source
systems.
- Such systems needed continuous maintenance
since these must also be used for mission-critical objectives.
To make
such data warehouses building successful, the following phases are generally
followed:
- Unload Phase: It contains selecting and
scrubbing the operation data.
- Transform Phase: For translating it into
an appropriate form and describing the rules for accessing and storing it.
- Load Phase: For moving the record
directly into DB2 tables or a particular file for moving it into another
database or non-MVS warehouse.
Host-Based
(UNIX) Data Warehouses
Oracle and
Informix RDBMSs support the facilities for such data warehouses. Both of these
databases can extract information from MVS� based databases as well as a higher
number of other UNIX� based databases. These types of warehouses follow the
same stage as the host-based MVS data warehouses. Also, the data from different
network servers can be created. Since file attribute consistency is frequent across
the inter-network.
LAN-Based
Workgroup Data Warehouses
A LAN based
workgroup warehouse is an integrated structure for building and maintaining a
data warehouse in a LAN environment. In this warehouse, we can extract
information from a variety of sources and support multiple LAN based
warehouses, generally chosen warehouse databases to include DB2 family, Oracle,
Sybase, and Informix. Other databases that can also be contained through
infrequently are IMS, VSAM, Flat File, MVS, and VH.
Host-Based
Single Stage (LAN) Data Warehouses
LAN based
data warehouse, data delivery can be handled either centrally or from the
workgroup environment so business groups can meet process their data needed
without burdening centralized IT resources, enjoying the autonomy of their data
mart without comprising overall data integrity and security in the enterprise.
Web
Mining
Web
Mining is the process of Data Mining techniques to automatically discover and extract information from
Web documents and services. The main purpose of web mining is to discover
useful information from the World Wide Web and its usage patterns.Combining all
the required methods from data mining, machine learning, artificial
intelligence, statistics, and information retrieval, web mining is an
interdisciplinary field for the overall system. Analysing user behaviour and
website traffic is the one basic type or example of web mining.
Applications
of Web Mining
Web mining is
the process of discovering patterns, structures, and relationships in web data.
It involves using data mining techniques to analyse web data and extract
valuable insights. The applications of web mining are wide-ranging and include:
- E-commerce: Web mining can be used to analyse customer
behaviour on e-commerce websites. This information can be used to improve
the user experience and increase sales by recommending products based on
customer preferences.
- Search engine optimization: Web mining can be used to analyse search
engine queries and search engine results pages (SERPs).
- Fraud detection: Web mining can be used to detect fraudulent
activity on websites. This information can be used to prevent financial
fraud, identity theft, and other types of online fraud.
- Sentiment analysis: Web mining can be used to analyse social
media data and extract sentiment from posts, comments, and reviews.
- Web content analysis: Web mining can be used to analyse web content
and extract valuable information such as keywords, topics, and themes.
- Customer service: Web mining can be used to analyse customer
service interactions on websites and social media platforms.
- Healthcare: Web mining can be used to analyse
health-related websites and extract valuable information about diseases,
treatments, and medications.
Process
of Web Mining
Spatial
Mining and Temporal Mining
Spatial Data
Mining is the process of discovering interesting and previously unknown, but
potentially useful patterns from spatial databases. In spatial data mining
analysts use geographical or spatial information to produce business
intelligence or other results. Challenges involved in spatial data mining include identifying patterns or finding objects
that are relevant to the research project.
Advantages
of Spatial Data Mining
- Insight Into Geographical Patterns: Spatial statistics assists in
identifying such features that would otherwise, lay undetected, by
enabling organizations and researchers, to identify trends concerning the
area.
- Better Decision Making: Spatial data mining is thus applicable
in areas such as urban planning, environmental management, and logistics
resources in organizations to make a wise decision.
- Enhanced Visualization: The data collected at the different
spatial levels can be presented and represented in maps, which gives a
better view of the trends and patterns.
Disadvantages
of Spatial Data Mining
- Complexity of Data: This means that the
raw data may be huge and intricate, making the application of complex
algorithms necessary together with huge computing facilities for data processing.
- Data Inaccuracy: Some mistakes can be
made while collecting the data like wrong geographical coordinates of any
place and wrong conclusions can be drawn as a result of that.
Temporal
Data Mining
Temporal data
refers to the extraction of simple, non-trivial and potentially useful abstract
information from large collection of temporal data. It is concerned with the analysis of temporal
data and for finding temporal patterns and regularities in sets of temporal
data tasks of temporal data mining are –
- Data Characterization and Comparison
- Cluster Analysis
- Classification
- Association rules
- Prediction and Trend Analysis
- Pattern Analysis
Advantages
of Temporal Data Mining
- Trend Identification: Forecasting is also beneficial in identification
of temporal factors such as seasonality, cycle and trends, or generation
of long-term shifts.
- Forecasting: It can be used in a predictive way to
help the organization on how they can be prepared to face the future
outcomes or trends.
- Anomaly Detection: One of the strengths of temporal data
mining is that it is able to identify patterns that change frequently
especially where there is evidence of sharp changes which may point to
either an event or a problem.
Disadvantages
of Temporal Data Mining
- Handling Data Complexity: A lot of times, time-series data will be
some kind of dependent relationship with time points, and this makes
analysis very difficult.
- Requires Large Historical Data: The problem of temporal data mining is
that it might take a large amount of historical data to find useful
patterns which is not always feasible.
·
Difference Between
Spatial and Temporal Data Mining
Spatial data mining |
Temporal data mining |
It requires
space. |
It requires
time. |
Spatial mining
is the extraction of knowledge/spatial relationship and interesting measures
that are not explicitly stored in spatial database. |
Temporal mining
is the extraction of knowledge about occurrence of an event whether they
follow Cyclic, Random, Seasonal variations etc. |
It deals with spatial
(location, Geo-referenced) data. |
It deals with
implicit or explicit Temporal content, from large quantities of data. |
Spatial
databases reverse spatial objects derived by spatial data. types and spatial
association among such objects. |
Temporal data
mining comprises the subject as well as its utilization in modification of
fields. |
It includes
finding characteristic rules, discriminant rules, association rules and
evaluation rules etc. |
It aims at
mining new and unknown knowledge, which takes into account the temporal
aspects of data. |
It is the method
of identifying unusual and unexplored data but useful models from spatial
databases. |
It deals with
useful knowledge from temporal data. |
Examples – Determining hotspots,Unusual locations. |
Examples – An association rule which looks like – “Any
Person who buys a car also buys steering lock”. By temporal aspect this rule
would be – ” Any person who buys a car also buys a steering lock after that
“. |
Comments
Post a Comment