Unit 1-Data Warehousing and Data Mining (BCS058)

 

Unit 1

Definition:Data warehousing refers to the process of collecting, storing, and managing large amounts of structured and sometimes semi-structured data from multiple sources in a centralized repository. This repository, known as a data warehouse, is designed to facilitate efficient querying, analysis, and reporting of data. It supports decision-making by allowing organizations to store historical data and perform complex queries on it to generate insights, trends, and reports.

Data warehousing component:

1. Data Sources

·         These are the various systems and databases from which the data is extracted. They can be:

·         Operational databases (e.g., relational databases, transactional systems)

·          

·         External data sources (e.g., third-party data providers)

·         Legacy systems

2. ETL Process (Extract, Transform, Load)

·         Extract: Data is extracted from various data sources.

·         Transform: The extracted data is cleaned, validated, and transformed into a consistent format suitable for analysis.

·         Load: The transformed data is then loaded into the data warehouse.

This process ensures that the data is integrated and stored in a format that supports querying and analysis.

3. Data Storage (Data Warehouse Repository)

This is the central repository where the processed data is stored. It is designed to handle large volumes of data and provide fast querying capabilities.

·         Fact tables: Store quantitative data for analysis (e.g., sales, revenue).

·         Dimension tables: Store descriptive information related to facts (e.g., customer, product).

4. Metadata

·         Metadata is "data about data." It provides information about the data stored in the warehouse, such as data definitions, mappings, source system details, transformation rules, and relationships between tables. It helps users understand the data structure and lineage for effective analysis.

5. Data Marts

·         Data marts are smaller, specialized subsets of the data warehouse, designed to serve the needs of specific business units or departments (e.g., marketing, finance). They allow for focused, quick access to data relevant to specific areas without querying the entire data warehouse.

6. OLAP (Online Analytical Processing)

·         OLAP tools are used to perform multidimensional analysis of the data stored in the warehouse. They allow users to slice, dice, drill down, and pivot data to explore different perspectives and insights.

 

7. Data Warehouse Access Tools

 

·         Query tools: Allow users to run ad-hoc queries and retrieve data.

·         Reporting tools: Generate reports based on predefined templates and criteria.

·         Business Intelligence (BI) tools: Provide advanced analysis, visualizations, and dashboard

Building a Data Warehouse:

Building a data warehouse involves designing a system that aggregates, organizes, and stores data from various sources for efficient querying, reporting, and analysis. Here are the main steps to building a data warehouse:

1. Requirements Gathering

·         Business Objectives: Understand the key business needs and goals, such as reporting, analysis, or decision-making.

·         Data Sources: Identify the source systems (databases, APIs, flat files, etc.) that will provide the raw data for the warehouse.

·         User Requirements: Determine what data the end-users need, how frequently they need it, and in what format (e.g., dashboards, reports).

2. Data Mode-ling

·         Conceptual Design: Define the high-level structure of the data warehouse. This usually involves creating an entity-relationship (ER) diagram to show the data flow.

·         Dimensional Mode-ling: Design the warehouse using star or snowflake schemas. These schemas consist of:

·         Fact Tables: Store quantitative data (e.g., sales, revenue).

·         Dimension Tables: Store descriptive information (e.g., product, time, geography) that provides context to the facts.

·         Normalization vs. Denormalization: While source systems might be normalized for efficiency, the warehouse design typically favors denormalization for faster querying.

3. ETL Process (Extract, Transform, Load)

·         Extract: Pull data from the source systems. This can be done in batch (e.g., once a day) or in real-time (e.g., streaming).

·         Transform: Clean and standardize the data. This includes:

Removing duplicates

Handling missing values

Conforming data formats

Aggregating or calculating derived metrics

·         Load: Insert the transformed data into the warehouse. This is often done incrementally to handle new or changed data without duplicating existing data.

4. Database Selection and Infrastructure

·         Choose a Database System: Common options include:

Relational Databases (e.g., PostgreSQL, MySQL)

Columnar Databases (e.g., Amazon Redshift, Google Big Query, Snowflake)

Data Lakes (e.g., Amazon S3, Azure Data Lake)

·         On-premise vs. Cloud: Decide whether to host the warehouse on-premise or in the cloud. Cloud platforms like AWS, Azure, or Google Cloud offer scalability, ease of management, and integration with many services.

 

 

5. Data Integration and Governance

·         Data Integration: Ensure that data from disparate sources is integrated meaningfully. This includes handling different formats, timestamps, and ensuring consistent identifiers (e.g., customer ID).

·         Data Quality: Implement data validation, error logging, and monitoring tools to ensure data integrity.

·         Data Governance: Establish rules and procedures to manage the lifecycle of the data, including access control, data lineage, privacy policies (e.g., GDPR compliance).

 

Warehouse Database:

A Warehouse Database is a structured collection of data used to manage and track various operations within a warehouse. It supports functions like inventory management, order processing, shipping, receiving, and overall logistics. It’s commonly part of a Warehouse Management System (WMS).

Key Components of a Warehouse Database:

1.Inventory Management:

·         Tracks the stock levels of products in the warehouse.

·         Monitors product quantities, stock location, and inventory movement (inbound/outbound).

2.Order Processing:

·         Handles customer orders, from receiving orders to shipping the products.

·         Stores details about order status, fulfilment times, and any backorders.

3.Shipping and Receiving:

·         Manages incoming shipments (goods coming into the warehouse) and outgoing shipments (dispatches to customers).

·         Keeps track of delivery dates, carrier information, and shipment status.

·         Supports coordination between warehouse, suppliers, and customers.

4.Locations and Bins:

·         Stores information on the layout of the warehouse, including aisles, shelves, racks, and bins.

·         Optimizes picking paths for warehouse staff to reduce retrieval time and enhance efficiency.

5.Product Details:

·         Includes information like product name, SKU (Stock Keeping Unit), dimensions, weight, and category.

·         May include additional attributes like expiry dates, lot numbers, or batch tracking, particularly for industries like food or pharmaceuticals.

6.Employee and Role Management:

 

·         Contains employee details and their roles in the warehouse.

·         Manages permissions for specific tasks (e.g., certain employees can only access inventory, while others can handle shipping).

7.Supplier Information:

·         Tracks data on vendors and suppliers.

·         Stores information such as lead times, supplier reliability, and payment terms.

Common Tables in a Warehouse Database:

·         Products: Contains product details like SKU, description, category, and unit price.

·         Inventory: Tracks stock levels, product locations, lot numbers, and inventory status (available, reserved, in transit).

·         Orders: Includes details on customer orders, order status (pending, fulfilled, shipped), and associated products.

·         Suppliers: Stores data on suppliers, purchase orders, and delivery schedules.

·         Shipping: Holds information about shipments, delivery addresses, and carrier tracking.

·         Employees: Manages employee roles, tasks, and login credentials for the WMS.

 

Mapping the Data Warehouse to a Multiprocessor Architecture:

1. Relational data base technology for data warehouse

·         Linear Speed up: refers the ability to increase the number of processors to reduce

response time

·         Linear Scale up: refers the ability to provide same performance on the same requests as

the database size increases

1.1 Types of parallelism

I. Inter query Parallelism:

In which different server threads or processes handle multiple requests at the sametime.

II. Intra query Parallelism:

·         This form of parallelism decomposes the serial SQL query into lower-level operations

·         such as scan, join, sort etc. Then these lower-level operations are executed concurrentlyin parallel.

- Intra query parallelism can be done in either of two ways:

·         Horizontal parallelism: which means that the data base is partitioned across multipledisks and parallel processing occur within a specific task that is performed concurrently on different processors against different set of data

·         Vertical parallelism: This occurs among different tasks. All query components such as

·         scan, join, sort etc are executed in parallel in a pipelined fashion. In other words, anoutput from one task becomes an input into another task.

III. Data partitioning:

Data partitioning is the key component for effective parallel execution of data baseoperations. Partition can be done randomly or intelligently.

·         Random portioning: Includes random data striping across multiple disks on asingle server.

·         Intelligent partitioning: Assumes that DBMS knows where a specific record is

·         located and does not waste time searching for it across all disks. The various

intelligent partitioning include:

·         Hash partitioning: A hash algorithm is used to calculate the partition numberbased on the value of the partitioning key for each row

·         Key range partitioning: Rows are placed and located in the partitions accordingto the value of the partitioning key. That is all the rows with the key value from Ato K are in partition 1, L to T is in partition 2 and so on.

·         Schema portioning: an entire table is placed on one disk; another table is placedon different disk etc. This is useful for small reference tables.

·         User defined portioning: It allows a table to be partitioned on the basis of a user


defined expression. 

2. Data base architectures of parallel processing

There are three DBMS software architecture styles for parallel processing:

Shared memory or shared everything Architecture

Shared disk architecture

Shred nothing architecture

2.1 Shared Memory (or Shared Everything) Architecture:

It has the following characteristics-

Multiple PUs share memory.

Each PU has full access to all shared memory through a common bus.

Communication between nodes occurs via shared memory.

Performance is limited by the bandwidth of the memory bus.

Disadvantage

Scalability is limited by bus bandwidth and latency, and by available memory.

 



 

2.2 Shared Disk Architecture

Shared disk systems are typically loosely coupled. Such systems, illustrated in figure, have the following characteristics:

Each node consists of one or more PUs and associated memory.

Memory is not shared between nodes.

Communication occurs over a common high-speed bus.

Each node has access to the same disks and other resources.

Bandwidth of the high-speed bus limits the number of nodes (scalability) of thesystem.

Advantages

Shared disk systems permit high availability. All data is accessible even if onenode dies.

These systems have the concept of one database, which is an advantage overshared nothing systems.

Shared disk systems provide for incremental growth.

Disadvantages

Inter-node synchronization is required, involving DLM overhead and greater

dependency on high-speed interconnect.

If the workload is not partitioned well, there may be high synchronizationoverhead.

  


2.2 Shared Nothing Architecture (Distributed)

Shared nothing systems are typically loosely coupled. In shared nothing systems onlyone CPU is connected to a given disk. If a table or database is located on that diskShared nothing systems are concerned with access to disks, not access to memory.

Advantages for parallel processing

Shared nothing systems provide for incremental growth.

System growth is practically unlimited.

MPPs are good for read-only databases and decision support applications.

Failure is local: if one node fails, the others stay up.

Disadvantages for parallel processing

More coordination is required.

More overhead is required for a process working on a disk belonging to anothernode.

 

 


 

Multidimensional Data Model

A Multidimensional Data Model is a key concept in data warehousing and OLAP (Online Analytical Processing), designed to organize data for efficient querying and reporting. It represents data as a cube structure, where each dimension corresponds to a different perspective or attribute of the data, allowing complex queries to be broken down into simpler ones.

 

Here are the main components of a multidimensional data model:



 

1. Facts:

Fact tables contain the numerical data (metrics) that businesses are interested in, such as sales revenue, quantity sold, or profit.

Fact tables are typically made up of:

Measures: The quantitative values (e.g., sales, profit, revenue).

Foreign keys: Links to the dimension tables that provide context for the facts.

 

2. Dimensions

Dimension tables store descriptive information that defines the context for facts (e.g., time, product, geography).

Each dimension table contains attributes that provide more detail. For example, a product dimension could have attributes like product name, category, and brand.

3. Schema Types

Star Schema:

The simplest form of the multidimensional model.

Consists of a central fact table linked to dimension tables.

The fact table has foreign keys connecting to dimension tables.

Snowflake Schema:

A more normalized version of the star schema.

Dimension tables are broken down into smaller tables to reduce data redundancy.

It is more complex but can save storage space by normalizing dimensions.

Galaxy Schema (Fact Constellation):

A combination of multiple star schemas.

Involves more than one fact table sharing dimension tables, useful for handling complex systems.

4. OLAP Operations

Multidimensional models allow for a variety of operations to explore the data cube:

Slice: Selects a specific dimension to filter the cube (e.g., sales in 2023).

Dice: Selects a subset of data by specifying ranges across dimensions (e.g., sales in 2023 for Product A in Region X).

Drill Down: Provides a more detailed view (e.g., from quarterly to monthly sales).

Roll Up: Aggregates data (e.g., from monthly to yearly sales).

Pivot (Rotate): Rotates the data cube for a different perspective.

Example of a Multidimensional Model:

Consider a sales data warehouse with:

 

A Fact Table that tracks sales, with metrics like sales_amount and units_sold.

Dimensions like Time, Product, and Store:

Time dimension might have attributes like year, quarter, month, and day.

Product dimension might have product_name, category, and brand.

Store dimension might have store_name, city, and region.

 Data Cube:

A data cube is used to represent data along various dimensions. For example, if you're analysing sales data, the dimensions could be time, region, and product category. Each cell in the cube contains aggregated data, such as total sales, for a combination of those dimension values.

 

Dimensions: These are perspectives or entities with respect to which an organization wants to keep records. Examples are time, geography, and product.

Facts: These are numerical measures or quantities, like sales, profit, or revenue.

Cells: Each cell in the data cube represents the aggregation (sum, average, etc.) of the data at the intersection of dimensions.

For instance, a 3D data cube could represent:

TimeRegionProductSales (fact)

Q1 2024NorthProduct A                                                                           $5000

Q1 2024SouthProduct B                                                                           $7000

 

Operations on Data Cubes:

Roll-up: Aggregating data by climbing up the hierarchy (e.g., from city to region).

Drill-down: Breaking data down into finer levels of detail (e.g., from year to quarter to month).

Slice: Selecting a single dimension of the cube, producing asub-cube.

Dice: Selecting two or more dimensions, creating a smaller cube.

 

 

Schema Design:

Schema is a logical description of the entire database. It includes the name and description of records of all record types including all associated data-items and aggregates. Much like a database, a data warehouse also requires to maintain a schema. A database uses relational model, while a data warehouse uses Star, Snowflake, and Fact Constellation schema. In this chapter, we will discuss the schemas used in a data warehouse.

Star Schema

·         Each dimension in a star schema is represented with only one-dimension table.

·         This dimension table contains the set of attributes.

·         The following diagram shows the sales data of a company with respect to the four dimensions, namely time, item, branch, and location.


·         There is a fact table at the center. It contains the keys to each of four dimensions.

·         The fact table also contains the attributes, namely dollars sold and units sold.

Note − Each dimension has only one dimension table and each table holds a set of attributes. For example, the location dimension table contains the attribute set {location_key, street, city, province_or_state,country}. This constraint may cause data redundancy. For example, "Vancouver" and "Victoria" both the cities are in the Canadian province of British Columbia. The entries for such cities may cause data redundancy along the attributesprovince_or_state and country.

Snowflake Schema

·         Some dimension tables in the Snowflake schema are normalized.

·         The normalization splits up the data into additional tables.

·         Unlike Star schema, the dimensions table in a snowflake schema are normalized. For example, the item dimension table in star schema is normalized and split into two dimension tables, namely item and supplier table.Now the item dimension table contains the attributes item_key, item_name, type, brand, and supplier-key.

·         The supplier key is linked to the supplier dimension table. The supplier dimension table contains the attributes supplier_key and supplier_type.

Note − Due to normalization in the Snowflake schema, the redundancy is reduced and therefore, it becomes easy to maintain and the save storage space.


Fact Constellation Schema:

·         A fact constellation has multiple fact tables. It is also known as galaxy schema.

·         The following diagram shows two fact tables, namely sales and shipping

·         The sales fact table is same as that in the star schema.

·         The shipping fact table has the five dimensions, namely item_key, time_key, shipper_key, from_location, to_location.

·         The shipping fact table also contains two measures, namely dollars sold and units sold.

·         It is also possible to share dimension tables between fact tables. For example, time, item, and location dimension tables are shared between the sales and shipping fact table.

Concept:

Schema design is a crucial aspect of database and data warehouse systems, especially in the context of organizing data for storage, retrieval, and efficient querying. Here are some fundamental concepts in schema design:

 

1. Entities and Attributes:

Entity: An entity represents a real-world object or concept (e.g., a customer, product, or order). In relational databases, entities are often mapped to tables.

Attributes: Attributes represent the properties or characteristics of an entity. For example, for a customer entity, attributes could be customer ID, name, and contact information.

2. Primary Key (PK):

A primary key is a unique identifier for each record in a table. It ensures that each row in a table is uniquely identifiable (e.g., a customer ID or order ID).

3. Foreign Key (FK):

A foreign key establishes a relationship between two tables. It refers to the primary key in another table, creating a link between records (e.g., linking an order to a customer by storing the customer ID in the order table).

4. Normalization:

Normalization is the process of organizing data in such a way that redundancy is minimized. It involves dividing large tables into smaller, related tables. The aim is to avoid anomalies in data operations (insert, update, delete).

1NF: Eliminate duplicate columns and ensure each cell contains only atomic (indivisible) values.

2NF: Ensure that all non-key attributes are fully dependent on the primary key.

3NF: Ensure that no non-key attribute depends on another non-key attribute.

 

 

5. Denormalization:

Denormalization is the opposite of normalization. It involves combining related tables to improve query performance, evenif it introduces some redundancy. Denormalization is often used in data warehouse schema designs to optimize read-heavy operations.

6. Relationships:

One-to-One: Each record in Table A relates to one record in Table B (e.g., each person has one passport).

One-to-Many: A single record in Table A can relate to multiple records in Table B (e.g., a customer can place multiple orders).

Many-to-Many: Multiple records in Table A can relate to multiple records in Table B (e.g., a student can enroll in multiple courses, and a course can have multiple students). A junction table is often used to implement this relationship.

7. Indexes:

Indexes are used to speed up the retrieval of records from a database. They create a data structure that makes it quicker to find rows in a table based on the values of one or more columns.

 

Comments

Popular posts from this blog

Unit 5 Data warehousing and data Mining

Unit 2

Unit 3 Data warehousing and data mining