Unit 1-Data Warehousing and Data Mining (BCS058)
Unit
1
Definition:Data
warehousing refers to the process of collecting, storing, and managing large
amounts of structured and sometimes semi-structured data from multiple sources
in a centralized repository. This repository, known as a data warehouse, is designed
to facilitate efficient querying, analysis, and reporting of data. It supports
decision-making by allowing organizations to store historical data and perform
complex queries on it to generate insights, trends, and reports.
Data warehousing component:
1. Data Sources
·
These are the various
systems and databases from which the data is extracted. They can be:
·
Operational databases
(e.g., relational databases, transactional systems)
·
·
External data sources
(e.g., third-party data providers)
·
Legacy systems
2. ETL Process
(Extract, Transform, Load)
·
Extract:
Data is extracted from various data sources.
·
Transform:
The extracted data is cleaned, validated, and transformed into a consistent
format suitable for analysis.
·
Load:
The transformed data is then loaded into the data warehouse.
This process ensures
that the data is integrated and stored in a format that supports querying and
analysis.
3. Data Storage (Data
Warehouse Repository)
This is the central
repository where the processed data is stored. It is designed to handle large
volumes of data and provide fast querying capabilities.
·
Fact tables:
Store quantitative data for analysis (e.g., sales, revenue).
·
Dimension tables:
Store descriptive information related to facts (e.g., customer, product).
4. Metadata
·
Metadata is "data
about data." It provides information about the data stored in the
warehouse, such as data definitions, mappings, source system details,
transformation rules, and relationships between tables. It helps users
understand the data structure and lineage for effective analysis.
5. Data Marts
·
Data marts are smaller,
specialized subsets of the data warehouse, designed to serve the needs of
specific business units or departments (e.g., marketing, finance). They allow
for focused, quick access to data relevant to specific areas without querying
the entire data warehouse.
6. OLAP (Online
Analytical Processing)
·
OLAP tools are used to
perform multidimensional analysis of the data stored in the warehouse. They
allow users to slice, dice, drill down, and pivot data to explore different
perspectives and insights.
7. Data Warehouse
Access Tools
·
Query tools:
Allow users to run ad-hoc queries and retrieve data.
·
Reporting tools:
Generate reports based on predefined templates and criteria.
·
Business Intelligence
(BI) tools: Provide advanced analysis,
visualizations, and dashboard
Building a Data
Warehouse:
Building a data
warehouse involves designing a system that aggregates, organizes, and stores
data from various sources for efficient querying, reporting, and analysis. Here
are the main steps to building a data warehouse:
1. Requirements
Gathering
·
Business Objectives:
Understand the key business needs and goals, such as reporting, analysis, or
decision-making.
·
Data Sources:
Identify the source systems (databases, APIs, flat files, etc.) that will
provide the raw data for the warehouse.
·
User Requirements:
Determine what data the end-users need, how frequently they need it, and in
what format (e.g., dashboards, reports).
2. Data Mode-ling
·
Conceptual Design:
Define the high-level structure of the data warehouse. This usually involves
creating an entity-relationship (ER) diagram to show the data flow.
·
Dimensional Mode-ling:
Design the warehouse using star or snowflake schemas. These schemas consist of:
·
Fact Tables:
Store quantitative data (e.g., sales, revenue).
·
Dimension Tables:
Store descriptive information (e.g., product, time, geography) that provides
context to the facts.
·
Normalization vs.
Denormalization: While source systems might be
normalized for efficiency, the warehouse design typically favors
denormalization for faster querying.
3. ETL Process
(Extract, Transform, Load)
·
Extract:
Pull data from the source systems. This can be done in batch (e.g., once a day)
or in real-time (e.g., streaming).
·
Transform:
Clean and standardize the data. This includes:
Removing duplicates
Handling missing values
Conforming data formats
Aggregating or
calculating derived metrics
·
Load:
Insert the transformed data into the warehouse. This is often done
incrementally to handle new or changed data without duplicating existing data.
4. Database Selection
and Infrastructure
·
Choose a Database
System: Common options include:
Relational Databases
(e.g., PostgreSQL, MySQL)
Columnar Databases
(e.g., Amazon Redshift, Google Big Query, Snowflake)
Data Lakes (e.g.,
Amazon S3, Azure Data Lake)
·
On-premise vs. Cloud:
Decide whether to host the warehouse on-premise or in the cloud. Cloud
platforms like AWS, Azure, or Google Cloud offer scalability, ease of
management, and integration with many services.
5. Data Integration and
Governance
·
Data Integration:
Ensure that data from disparate sources is integrated meaningfully. This
includes handling different formats, timestamps, and ensuring consistent
identifiers (e.g., customer ID).
·
Data Quality:
Implement data validation, error logging, and monitoring tools to ensure data
integrity.
·
Data Governance:
Establish rules and procedures to manage the lifecycle of the data, including
access control, data lineage, privacy policies (e.g., GDPR compliance).
Warehouse Database:
A Warehouse Database is
a structured collection of data used to manage and track various operations
within a warehouse. It supports functions like inventory management, order
processing, shipping, receiving, and overall logistics. It’s commonly part of a
Warehouse Management System (WMS).
Key Components of a
Warehouse Database:
1.Inventory Management:
·
Tracks the stock levels
of products in the warehouse.
·
Monitors product
quantities, stock location, and inventory movement (inbound/outbound).
2.Order Processing:
·
Handles customer
orders, from receiving orders to shipping the products.
·
Stores details about
order status, fulfilment times, and any backorders.
3.Shipping and
Receiving:
·
Manages incoming
shipments (goods coming into the warehouse) and outgoing shipments (dispatches
to customers).
·
Keeps track of delivery
dates, carrier information, and shipment status.
·
Supports coordination
between warehouse, suppliers, and customers.
4.Locations and Bins:
·
Stores information on
the layout of the warehouse, including aisles, shelves, racks, and bins.
·
Optimizes picking paths
for warehouse staff to reduce retrieval time and enhance efficiency.
5.Product Details:
·
Includes information
like product name, SKU (Stock Keeping Unit), dimensions, weight, and category.
·
May include additional
attributes like expiry dates, lot numbers, or batch tracking, particularly for
industries like food or pharmaceuticals.
6.Employee and Role
Management:
·
Contains employee
details and their roles in the warehouse.
·
Manages permissions for
specific tasks (e.g., certain employees can only access inventory, while others
can handle shipping).
7.Supplier Information:
·
Tracks data on vendors
and suppliers.
·
Stores information such
as lead times, supplier reliability, and payment terms.
Common Tables in a
Warehouse Database:
·
Products:
Contains product details like SKU, description, category, and unit price.
·
Inventory:
Tracks stock levels, product locations, lot numbers, and inventory status
(available, reserved, in transit).
·
Orders:
Includes details on customer orders, order status (pending, fulfilled,
shipped), and associated products.
·
Suppliers:
Stores data on suppliers, purchase orders, and delivery schedules.
·
Shipping:
Holds information about shipments, delivery addresses, and carrier tracking.
·
Employees:
Manages employee roles, tasks, and login credentials for the WMS.
Mapping the Data Warehouse to a
Multiprocessor Architecture:
1. Relational data base technology for
data warehouse
·
Linear Speed up:
refers the ability to increase the number of processors to reduce
response time
·
Linear Scale up:
refers the ability to provide same performance on the same requests as
the database size increases
1.1 Types of parallelism
I. Inter query Parallelism:
In which different server threads or processes
handle multiple requests at the sametime.
II. Intra query Parallelism:
·
This form of
parallelism decomposes the serial SQL query into lower-level operations
·
such as scan, join,
sort etc. Then these lower-level operations are executed concurrentlyin parallel.
- Intra query parallelism can be done in
either of two ways:
·
Horizontal parallelism:
which means that the data base is partitioned across multipledisks and parallel
processing occur within a specific task that is performed concurrently on
different processors against different set of data
·
Vertical parallelism:
This occurs among different tasks. All query components such as
·
scan, join, sort etc
are executed in parallel in a pipelined fashion. In other words, anoutput from
one task becomes an input into another task.
III. Data partitioning:
Data partitioning is the key component for effective
parallel execution of data baseoperations. Partition can be done randomly or
intelligently.
·
Random portioning:
Includes random data striping across multiple disks on asingle server.
·
Intelligent
partitioning: Assumes that DBMS knows where a
specific record is
·
located and does not
waste time searching for it across all disks. The various
intelligent partitioning include:
·
Hash partitioning:
A hash algorithm is used to calculate the partition numberbased on the value of
the partitioning key for each row
·
Key range partitioning:
Rows are placed and located in the partitions accordingto the value of the
partitioning key. That is all the rows with the key value from Ato K are in
partition 1, L to T is in partition 2 and so on.
·
Schema portioning:
an entire table is placed on one disk; another table is placedon different disk
etc. This is useful for small reference tables.
· User defined portioning: It allows a table to be partitioned on the basis of a user
defined expression.
2. Data base
architectures of parallel processing
There are three DBMS
software architecture styles for parallel processing:
Shared
memory or shared everything Architecture
Shared
disk architecture
Shred
nothing architecture
2.1 Shared Memory (or
Shared Everything) Architecture:
It has the following
characteristics-
Multiple
PUs share memory.
Each
PU has full access to all shared memory through a common bus.
Communication
between nodes occurs via shared memory.
Performance
is limited by the bandwidth of the memory bus.
Disadvantage
Scalability is limited
by bus bandwidth and latency, and by available memory.
2.2 Shared Disk
Architecture
Shared disk
systems are typically loosely coupled. Such systems, illustrated in figure,
have the following characteristics:
Each
node consists of one or more PUs and associated memory.
Memory
is not shared between nodes.
Communication
occurs over a common high-speed bus.
Each
node has access to the same disks and other resources.
Bandwidth
of the high-speed bus limits the number of nodes (scalability) of thesystem.
Advantages
Shared
disk systems permit high availability. All data is accessible even if onenode
dies.
These
systems have the concept of one database, which is an advantage overshared
nothing systems.
Shared
disk systems provide for incremental growth.
Disadvantages
Inter-node
synchronization is required, involving DLM overhead and greater
dependency on
high-speed interconnect.
If
the workload is not partitioned well, there may be high
synchronizationoverhead.
2.2 Shared Nothing
Architecture (Distributed)
Shared nothing systems
are typically loosely coupled. In shared nothing systems onlyone CPU is
connected to a given disk. If a table or database is located on that diskShared
nothing systems are concerned with access to disks, not access to memory.
Advantages for parallel
processing
Shared
nothing systems provide for incremental growth.
System
growth is practically unlimited.
MPPs
are good for read-only databases and decision support applications.
Failure
is local: if one node fails, the others stay up.
Disadvantages for
parallel processing
More
coordination is required.
More
overhead is required for a process working on a disk belonging to anothernode.
Multidimensional
Data Model
A
Multidimensional Data Model is a key concept in data warehousing and OLAP
(Online Analytical Processing), designed to organize data for efficient
querying and reporting. It represents data as a cube structure, where each
dimension corresponds to a different perspective or attribute of the data,
allowing complex queries to be broken down into simpler ones.
Here
are the main components of a multidimensional data model:
1.
Facts:
Fact
tables contain the numerical data (metrics) that businesses are interested in,
such as sales revenue, quantity sold, or profit.
Fact
tables are typically made up of:
Measures:
The quantitative values (e.g., sales, profit, revenue).
Foreign
keys: Links to the dimension tables that provide context for the facts.
2.
Dimensions
Dimension
tables store descriptive information that defines the context for facts (e.g.,
time, product, geography).
Each
dimension table contains attributes that provide more detail. For example, a
product dimension could have attributes like product name, category, and brand.
3.
Schema Types
Star
Schema:
The
simplest form of the multidimensional model.
Consists
of a central fact table linked to dimension tables.
The
fact table has foreign keys connecting to dimension tables.
Snowflake
Schema:
A
more normalized version of the star schema.
Dimension
tables are broken down into smaller tables to reduce data redundancy.
It
is more complex but can save storage space by normalizing dimensions.
Galaxy
Schema (Fact Constellation):
A
combination of multiple star schemas.
Involves
more than one fact table sharing dimension tables, useful for handling complex
systems.
4.
OLAP Operations
Multidimensional
models allow for a variety of operations to explore the data cube:
Slice:
Selects a specific dimension to filter the cube (e.g., sales in 2023).
Dice:
Selects a subset of data by specifying ranges across dimensions (e.g., sales in
2023 for Product A in Region X).
Drill
Down: Provides a more detailed view (e.g., from quarterly
to monthly sales).
Roll
Up:
Aggregates data (e.g., from monthly to yearly sales).
Pivot
(Rotate): Rotates the data cube for a different perspective.
Example
of a Multidimensional Model:
Consider
a sales data warehouse with:
A
Fact Table that tracks sales, with metrics like sales_amount and units_sold.
Dimensions
like Time, Product, and Store:
Time
dimension might have attributes like year, quarter, month, and day.
Product
dimension might have product_name, category, and brand.
Store
dimension might have store_name, city, and region.
Data Cube:
A
data cube is used to represent data along various dimensions. For example, if
you're analysing sales data, the dimensions could be time, region, and product
category. Each cell in the cube contains aggregated data, such as total sales,
for a combination of those dimension values.
Dimensions:
These are perspectives or entities with respect to which an organization wants
to keep records. Examples are time, geography, and product.
Facts:
These are numerical measures or quantities, like sales, profit, or revenue.
Cells:
Each cell in the data cube represents the aggregation (sum, average, etc.) of
the data at the intersection of dimensions.
For
instance, a 3D data cube could represent:
TimeRegionProductSales
(fact)
Q1
2024NorthProduct A $5000
Q1
2024SouthProduct B $7000
Operations
on Data Cubes:
Roll-up:
Aggregating data by climbing up the hierarchy (e.g., from city to region).
Drill-down:
Breaking data down into finer levels of detail (e.g., from year to quarter to
month).
Slice:
Selecting a single dimension of the cube, producing asub-cube.
Dice:
Selecting two or more dimensions, creating a smaller cube.
Schema Design:
Schema
is a logical description of the entire database. It includes the name and
description of records of all record types including all associated data-items
and aggregates. Much like a database, a data warehouse also requires to
maintain a schema. A database uses relational model, while a data warehouse
uses Star, Snowflake, and Fact Constellation schema. In this chapter, we will
discuss the schemas used in a data warehouse.
Star
Schema
·
Each dimension in a
star schema is represented with only one-dimension table.
·
This dimension table
contains the set of attributes.
·
The following diagram
shows the sales data of a company with respect to the four dimensions, namely
time, item, branch, and location.
·
There is a fact table
at the center. It contains the keys to each of four dimensions.
·
The fact table also
contains the attributes, namely dollars sold and units sold.
Note −
Each dimension has only one dimension table and each table holds a set of
attributes. For example, the location dimension table contains the attribute
set {location_key, street, city, province_or_state,country}. This constraint
may cause data redundancy. For example, "Vancouver" and
"Victoria" both the cities are in the Canadian province of British
Columbia. The entries for such cities may cause data redundancy along the
attributesprovince_or_state and country.
Snowflake
Schema
·
Some dimension tables
in the Snowflake schema are normalized.
·
The normalization
splits up the data into additional tables.
· Unlike Star schema, the dimensions table in a snowflake schema are normalized. For example, the item dimension table in star schema is normalized and split into two dimension tables, namely item and supplier table.Now the item dimension table contains the attributes item_key, item_name, type, brand, and supplier-key.
·
The supplier key is
linked to the supplier dimension table. The supplier dimension table contains
the attributes supplier_key and supplier_type.
Note −
Due to normalization in the Snowflake schema, the redundancy is reduced and
therefore, it becomes easy to maintain and the save storage space.
Fact
Constellation Schema:
·
A fact constellation
has multiple fact tables. It is also known as galaxy schema.
· The following diagram shows two fact tables, namely sales and shipping
·
The sales fact table is
same as that in the star schema.
·
The shipping fact table
has the five dimensions, namely item_key, time_key, shipper_key, from_location,
to_location.
·
The shipping fact table
also contains two measures, namely dollars sold and units sold.
· It is also possible to share dimension tables between fact tables. For example, time, item, and location dimension tables are shared between the sales and shipping fact table.
Concept:
Schema
design is a crucial aspect of database and data warehouse systems, especially
in the context of organizing data for storage, retrieval, and efficient
querying. Here are some fundamental concepts in schema design:
1.
Entities and Attributes:
Entity:
An entity represents a real-world object or concept (e.g., a customer, product,
or order). In relational databases, entities are often mapped to tables.
Attributes:
Attributes represent the properties or characteristics of an entity. For
example, for a customer entity, attributes could be customer ID, name, and
contact information.
2.
Primary Key (PK):
A
primary key is a unique identifier for each record in a table. It ensures that
each row in a table is uniquely identifiable (e.g., a customer ID or order ID).
3.
Foreign Key (FK):
A
foreign key establishes a relationship between two tables. It refers to the
primary key in another table, creating a link between records (e.g., linking an
order to a customer by storing the customer ID in the order table).
4.
Normalization:
Normalization
is the process of organizing data in such a way that redundancy is minimized.
It involves dividing large tables into smaller, related tables. The aim is to
avoid anomalies in data operations (insert, update, delete).
1NF:
Eliminate duplicate columns and ensure each cell contains only atomic
(indivisible) values.
2NF:
Ensure that all non-key attributes are fully dependent on the primary key.
3NF:
Ensure that no non-key attribute depends on another non-key attribute.
5.
Denormalization:
Denormalization
is the opposite of normalization. It involves combining related tables to
improve query performance, evenif it introduces some redundancy.
Denormalization is often used in data warehouse schema designs to optimize
read-heavy operations.
6.
Relationships:
One-to-One:
Each record in Table A relates to one record in Table B (e.g., each person has
one passport).
One-to-Many:
A single record in Table A can relate to multiple records in Table B (e.g., a
customer can place multiple orders).
Many-to-Many:
Multiple records in Table A can relate to multiple records in Table B (e.g., a
student can enroll in multiple courses, and a course can have multiple
students). A junction table is often used to implement this relationship.
7.
Indexes:
Indexes
are used to speed up the retrieval of records from a database. They create a
data structure that makes it quicker to find rows in a table based on the
values of one or more columns.
Comments
Post a Comment