Unit 5 Data warehousing and data Mining

 Unit 5

 Data warehousing and data Mining

(Data Visualization and Overall Perspective)

 

Aggregation

Data aggregation refers to a process of collecting information from different sources and presenting it in a summarized format so that business analysts can perform statistical analyses of business schemes. The collected information may be gathered from various data sources to summarize these data sources into a draft for data analysis.

Data aggregation plays a vital role in finance, product, operations, and marketing strategies in any business organization. Aggregated data is present in the data warehouse that can enable one to solve various issues, which helps solve queries from data sets.

Data aggregation work

Data aggregation is needed if a dataset has useless information that can not be used for analysis. In data aggregation, the datasets are summarized into significant information, which helps attain desirable outcomes and increases the user experience. Data aggregation provides accurate measurements such as sum, average, and count. The collected, summarized data helps the business analysts to perform the demographic study of customers and their behaviour.

Historical information

Historical data, in a broad context, is data collected about past events and circumstances pertaining to a particular subject.

By definition, historical data includes most data generated either manually or automatically within an enterprise. 

 

OLAP function and Tool

·        The OLAP server collects data from multiple data sources, including relational databases and data warehouses.

·        Then, the extract, transform, and load (ETL) tools clean, aggregate, precalculated, and store data in an OLAP cube according to the number of dimensions specified.

·        Business analysts use OLAP tools to query and generate reports from the multidimensional data in the OLAP cube.

·        Has intuitive easy-to-use interface;

·        OLAP supports complex calculations;

·        Provides data view in multidimensional manner;

·        Time intelligence.

1.Transparency: Make the technology, underlying information repository, computing operations, and the dissimilar nature of source data totally transparent to users. Such transparency helps to improve the efficiency and productivity of the users.

2. Accessibility: It provides access only to the data that is actually required to perform the particular analysis, present a single, coherent, and consistent view to the clients.  The OLAP operations should be sitting between data sources (e.g., data warehouses) and an OLAP front-end.

3. Consistent Reporting Performance: To make sure that the users do not feel any significant degradation in documenting performance as the number of dimensions or the size of the database increases. .

OLAP Server

Online Analytical Processing(OLAP) refers to a set of software tools used for data analysis in order to make business decisions. OLAP provides a platform for gaining insights from databases retrieved from multiple database systems at the same time. Many Businesses Intelligence (BI) applications rely on OLAP technology.

Type of OLAP servers:

The three major types of OLAP servers are as follows:

·         ROLAP

·         MOLAP 

·         HOLAP

Relational OLAP (ROLAP):

Relational On-Line Analytical Processing (ROLAP) is primarily used for data stored in a relational database, where both the base data and dimension tables are stored as relational tables. ROLAP servers are used to bridge the gap between the relational back-end server and the client’s front-end tools. ROLAP servers store and manage warehouse data using RDBMS, and OLAP middleware fills in the gaps.

Benefits:

·         It is compatible with data warehouses and OLTP systems.

·         The data size limitation of ROLAP technology is determined by the underlying RDBMS. As a result, ROLAP does not limit the amount of data that can be stored.

Limitations:

·         SQL functionality is constrained.

·         It’s difficult to keep aggregate tables up to date.



Multidimensional OLAP (MOLAP):

Multidimensional On-Line Analytical Processing (MOLAP) supports multidimensional views of data. Storage utilization in multidimensional data stores may be low if the data set is sparse.

 It is used for OLAP, which is based on the arrays’ random-access capability. Dimension instances determine array elements, and the data or measured value associated with each cell is typically stored in the corresponding array element. The multidimensional array is typically stored in MOLAP in a linear

Benefits:

·         Suitable for slicing and dicing operations.

·         Outperforms ROLAP when data is dense.

·         Capable of performing complex calculations.

Limitations:

·         It is difficult to change the dimensions without re-aggregating.

·         Since all calculations are performed when the cube is built, a large amount of data cannot be stored in the cube itself.








Hybrid OLAP (HOLAP):

ROLAP and MOLAP are combined in Hybrid On-Line Analytical Processing (HOLAP). HOLAP offers greater scalability than ROLAP and faster computation than MOLAP.HOLAP is a hybrid of ROLAP and MOLAP. HOLAP servers are capable of storing large amounts of detailed data. On the one hand, HOLAP benefits from ROLAP’s greater scalability. HOLAP, on the other hand, makes use of cube technology for faster performance and summary-type information. Because detailed data is stored in a relational database, cubes are smaller than MOLAP.

Benefits:

·         HOLAP combines the benefits of MOLAP and ROLAP.

·         Provide quick access at all aggregation levels.

Limitations

·         Because it supports both MOLAP and ROLAP servers, HOLAP architecture is extremely complex.

·         There is a greater likelihood of overlap, particularly in their functionalities.


Data Mining Interface

A data mining interface facilitates the exploration and extraction of actionable insights from large datasets using data mining techniques.

1.     Query Interface:

·         Provides users with tools to define and execute data mining queries against the data warehouse.

·         Supports various query languages or graphical interfaces for defining mining tasks.

2.     Data Exploration Tools:

·         Enables users to explore data visually and interactively to identify patterns, trends, and anomalies.

·         Includes features such as data visualization, clustering, classification, and association rule discovery.

3.     Model Building and Evaluation:

·         Allows users to build predictive models using machine learning algorithms and evaluate their performance.

·         Provides tools for model training, testing, and validation using techniques like cross-validation.

4.     Integration with BI Tools:

·         Integrates with business intelligence (BI) tools and dashboards to visualize and present data mining results.

·         Enables users to incorporate predictive insights into decision-making processes.

Security

Data warehouse security is crucial for protecting sensitive information and ensuring compliance with regulatory requirements. Here are key security measures:

1.     Access Control:

·         Implement role-based access control (RBAC) to restrict access to data based on users’ roles and responsibilities.

·         Enforce strong authentication mechanisms, such as multi-factor authentication (MFA), to prevent unauthorized access.

2.     Data Encryption:

·         Encrypt data at rest and in transit to prevent unauthorized access or interception.

·         Use encryption techniques such as SSL/TLS for network communication and encryption algorithms for data storage.

3.     Auditing and Monitoring:

·         Implement auditing and logging mechanisms to track user activities and changes to data.

·         Monitor access patterns and detect suspicious behaviour to prevent security breaches.

4.     Data Masking and Anonymization:

·         Mask sensitive data to anonymize personally identifiable information (PII) and protect privacy.

·         Replace sensitive data with pseudonymized or randomized values to ensure confidentiality.

5.     Compliance and Governance:

·         Ensure compliance with regulations such as GDPR, HIPAA, and PCI-DSS by implementing data governance policies and controls.

·         Conduct regular security assessments and audits to identify vulnerabilities and ensure adherence to security standards.

Backup and Recovery

Backup and recovery processes are essential for data warehouse reliability and resilience. Here’s how it’s managed:

1.     Regular Backups:

·         Schedule regular backups of the data warehouse to ensure data availability in case of data loss or corruption.

·         Implement full, incremental, or differential backup strategies based on recovery requirements.

 

 

2.     Redundant Storage:

·         Store backup copies of data in redundant storage locations, such as cloud storage or off-site data centers.

·         Ensure data redundancy and fault tolerance to mitigate the risk of data loss due to hardware failures or disasters.

3.     Point-in-Time Recovery:

·         Maintain transaction logs or incremental backups to facilitate point-in-time recovery to a specific moment in the past.

·         Enable rollback or recovery to restore the data warehouse to a consistent state after data corruption or accidental changes.

4.     Disaster Recovery Planning:

·         Develop and test disaster recovery plans to ensure business continuity in the event of catastrophic failures or natural disasters.

·         Establish procedures for failover, data restoration, and system recovery to minimize downtime and data loss.

5.     Automated Backup Solutions:

·         Use automated backup solutions and backup scheduling tools to streamline backup and recovery processes.

·         Monitor backup jobs and receive alerts for any failures or anomalies to ensure timely resolution.

Tunning Data warehouse

The tuning of the data warehouse can be done to improve the performance. The data warehouse is the entry point into the system, and it provides the first opportunity to improve performance. If the checks are performed on the data warehouse system, either before or after the data is loaded, then it will have a direct effect on the capacity and the performance of the system.

For example, if the data is telephone call records, it can be checked that each call has a valid customer identifier. If the data is sales information, then it can be checked that whether the commodity being sold has a valid product identifier.

 

 

  • Tune the business rules.
  • Tune the data design.
  • Tune the application design.
  • Tune the logical structure of the database.
  • Tune the database operations.
  • Tune the access paths.
  • Tune I/O and physical structure.
  • Tune resource contention.
  • Tune the underlying pattern.

Testing Data warehouse

Testing is very important for data warehouse systems for data validation and to make them work correctly and efficiently.
There are three basic levels of testing performed on data warehouse which are as follows :

  1. Unit Testing –
    This type of testing is being performed at the developer’s end. In unit testing, each unit/component of modules is separately tested. Each modules of the whole data warehouse, i.e. program, SQL Script, procedure,, Unix shell is validated and tested.
  2. Integration Testing –
    In this type of testing the various individual units/ modules of the application are brought together or combined and then tested against the number of inputs. It is performed to detect the fault in integrated modules and to test whether the various components are performing well after integration.
  3. System Testing –
    System testing is the form of testing that validates and tests the whole data warehouse application. This type of testing is being performed by technical testing team. This test is conducted after developer’s team performs unit testing and the main purpose of this testing is to check whether the entire system is working altogether or not.

Challenges of data warehouse testing are :

  • Data selection from multiple source and analysis that follows pose great challenge.
  • Volume and complexity of the data, certain testing strategies are time consuming.
  • ETL testing requires hive SQL skills, thus it pose challenges for tester who have limited SQL skills.
  • Redundant data in a data warehouse.
  • Inconsistent and inaccurate reports.

ETL testing is performed in five stages :

  • Identifying data sources and requirements.
  • Data acquisition.
  • Implement business logic’s and dimensional modeling.
  • Build and populate data.
  • Build reports.

 

 

Warehousing application and Recent Trends

1. Data Warehousing

Cloud data warehousing has increasingly become a part of energy reduction plans for businesses. It’s great to see a green data warehousing trend making its way onto the scene while we are tackling climate change.

Cloud data centres operate with energy efficiencies well above industry averages. In 2021, companies in the EU reported an 80% energy consumption reduction following the migration of their enterprise data to SaaS storage providers.

As a marketing agency, switching to a green data warehousing solution such as a cloud data warehouse can bring multiple benefits beyond energy savings. It’s beneficial for our planet, and it will reduce the carbon footprint of your agency.

#2. Outsourcing Data Management

Outsourcing data management operations is becoming a trend amongst marketing agencies looking to elevate their operations.
  • Data warehousing and automation
  • Encoding,
  • Compilation,
  • Auditing,
  • System architecture design, and more.

#3. Data Warehousing AI Solutions

#3. Data Warehousing AI Solutions

A graph indicating projected growth of the global data being processed

As the growth of the volume of data being processed is forecasted to spike, businesses increasingly need to offload data operations to faster machine learning-enabled AI systems.

With trend and pattern analysis capabilities improving by leaps and bounds, businesses that integrate AI into their data warehousing solutions can reduce their operational costs and perform data operations more efficiently.

#5. Virtual Data Warehousing

Virtual data warehouses are sets of separate databases that can be queried simultaneously by means of middleware. Virtual data warehousing is trending because it’s cost-effective and can be deployed faster than physical solutions.

By forgoing physical data replication, virtualization can improve operating speeds and reduce operating costs by as much as four times.

#.4 In-Memory Computing

In-memory computing uses clusters of servers to pool total available RAM and CPU power. This design distributes data handling and processing tasks across the cluster for radically enhanced speed and scalability.

 

 

#5. In-Database Analytics

In-Database Analytics

 

In-database analytics refers to a method of analysis that processes data within its storage site—a database or data warehouse. In-database analytics are built into the storage architecture and replace the use of separate applications after transfers

 

 

Types of Warehousing Application

Host-Based Data Warehouses

 

  • Host-Based mainframe warehouses which reside on a high-volume database. Supported by robust and reliable high-capacity structure such as IBM system/390, UNISYS and Data General sequent systems, and databases such as Sybase, Oracle, Informix, and DB2.
  • Host-Based LAN data warehouses, where data delivery can be handled either centrally or from the workgroup environment. The size of the data warehouses of the database depends on the platform.

Types of Data Warehouses

Before embarking on designing, building and implementing such a warehouse.

  1. Such databases generally have very high volumes of data storage.
  2. Such warehouses may require support for both MVS and customer-based report and query facilities.
  3. These warehouses have complicated source systems.
  4. Such systems needed continuous maintenance since these must also be used for mission-critical objectives.

To make such data warehouses building successful, the following phases are generally followed:

  1. Unload Phase: It contains selecting and scrubbing the operation data.
  2. Transform Phase: For translating it into an appropriate form and describing the rules for accessing and storing it.
  3. Load Phase: For moving the record directly into DB2 tables or a particular file for moving it into another database or non-MVS warehouse.

Host-Based (UNIX) Data Warehouses

Oracle and Informix RDBMSs support the facilities for such data warehouses. Both of these databases can extract information from MVS� based databases as well as a higher number of other UNIX� based databases. These types of warehouses follow the same stage as the host-based MVS data warehouses. Also, the data from different network servers can be created. Since file attribute consistency is frequent across the inter-network.

LAN-Based Workgroup Data Warehouses

A LAN based workgroup warehouse is an integrated structure for building and maintaining a data warehouse in a LAN environment. In this warehouse, we can extract information from a variety of sources and support multiple LAN based warehouses, generally chosen warehouse databases to include DB2 family, Oracle, Sybase, and Informix. Other databases that can also be contained through infrequently are IMS, VSAM, Flat File, MVS, and VH.

Types of Data Warehouses

Host-Based Single Stage (LAN) Data Warehouses

LAN based data warehouse, data delivery can be handled either centrally or from the workgroup environment so business groups can meet process their data needed without burdening centralized IT resources, enjoying the autonomy of their data mart without comprising overall data integrity and security in the enterprise.

Types of Data Warehouses

 

 

Web Mining

Web Mining is the process of Data Mining techniques to automatically discover and extract information from Web documents and services. The main purpose of web mining is to discover useful information from the World Wide Web and its usage patterns.Combining all the required methods from data mining, machine learning, artificial intelligence, statistics, and information retrieval, web mining is an interdisciplinary field for the overall system. Analysing user behaviour and website traffic is the one basic type or example of web mining.

Applications of Web Mining

Web mining is the process of discovering patterns, structures, and relationships in web data. It involves using data mining techniques to analyse web data and extract valuable insights. The applications of web mining are wide-ranging and include:

  • E-commerce: Web mining can be used to analyse customer behaviour on e-commerce websites. This information can be used to improve the user experience and increase sales by recommending products based on customer preferences.
  • Search engine optimization: Web mining can be used to analyse search engine queries and search engine results pages (SERPs).
  • Fraud detection: Web mining can be used to detect fraudulent activity on websites. This information can be used to prevent financial fraud, identity theft, and other types of online fraud.
  • Sentiment analysis: Web mining can be used to analyse social media data and extract sentiment from posts, comments, and reviews.
  • Web content analysis: Web mining can be used to analyse web content and extract valuable information such as keywords, topics, and themes.
  • Customer service: Web mining can be used to analyse customer service interactions on websites and social media platforms.
  • Healthcare: Web mining can be used to analyse health-related websites and extract valuable information about diseases, treatments, and medications.

Process of Web Mining

Web mining process

Spatial Mining and Temporal Mining

Spatial Data Mining is the process of discovering interesting and previously unknown, but potentially useful patterns from spatial databases. In spatial data mining analysts use geographical or spatial information to produce business intelligence or other results. Challenges involved in spatial data mining include identifying patterns or finding objects that are relevant to the research project.

Advantages of Spatial Data Mining

  • Insight Into Geographical Patterns: Spatial statistics assists in identifying such features that would otherwise, lay undetected, by enabling organizations and researchers, to identify trends concerning the area.
  • Better Decision Making: Spatial data mining is thus applicable in areas such as urban planning, environmental management, and logistics resources in organizations to make a wise decision.
  • Enhanced Visualization: The data collected at the different spatial levels can be presented and represented in maps, which gives a better view of the trends and patterns.

Disadvantages of Spatial Data Mining

  • Complexity of Data: This means that the raw data may be huge and intricate, making the application of complex algorithms necessary together with huge computing facilities for data processing.
  • Data Inaccuracy: Some mistakes can be made while collecting the data like wrong geographical coordinates of any place and wrong conclusions can be drawn as a result of that.

Temporal Data Mining

Temporal data refers to the extraction of simple, non-trivial and potentially useful abstract information from large collection of temporal data. It is concerned with the analysis of temporal data and for finding temporal patterns and regularities in sets of temporal data tasks of temporal data mining are –

  • Data Characterization and Comparison
  • Cluster Analysis
  • Classification
  • Association rules
  • Prediction and Trend Analysis
  • Pattern Analysis

Advantages of Temporal Data Mining

  • Trend Identification: Forecasting is also beneficial in identification of temporal factors such as seasonality, cycle and trends, or generation of long-term shifts.
  • Forecasting: It can be used in a predictive way to help the organization on how they can be prepared to face the future outcomes or trends.
  • Anomaly Detection: One of the strengths of temporal data mining is that it is able to identify patterns that change frequently especially where there is evidence of sharp changes which may point to either an event or a problem.

Disadvantages of Temporal Data Mining

  • Handling Data Complexity: A lot of times, time-series data will be some kind of dependent relationship with time points, and this makes analysis very difficult.
  • Requires Large Historical Data: The problem of temporal data mining is that it might take a large amount of historical data to find useful patterns which is not always feasible.

·         Difference Between Spatial and Temporal Data Mining

Spatial data mining

Temporal data mining

It requires space.

It requires time.

Spatial mining is the extraction of knowledge/spatial relationship and interesting measures that are not explicitly stored in spatial database.

Temporal mining is the extraction of knowledge about occurrence of an event whether they follow Cyclic, Random, Seasonal variations etc.

It deals with spatial (location, Geo-referenced) data.

It deals with implicit or explicit Temporal content, from large quantities of data.

Spatial databases reverse spatial objects derived by spatial data. types and spatial association among such objects.

Temporal data mining comprises the subject as well as its utilization in modification of fields.

It includes finding characteristic rules, discriminant rules, association rules and evaluation rules etc.

It aims at mining new and unknown knowledge, which takes into account the temporal aspects of data.

It is the method of identifying unusual and unexplored data but useful models from spatial databases.

It deals with useful knowledge from temporal data.

Examples – Determining hotspots,Unusual locations.

Examples – An association rule which looks like – “Any Person who buys a car also buys steering lock”. By temporal aspect this rule would be – ” Any person who buys a car also buys a steering lock after that “.

 


Comments

Popular posts from this blog

Unit 2

Unit 3 Data warehousing and data mining