Unit 2

Unit 2

Data warehouse Process and technology(DWM)

Warehousing Strategy

A warehouse strategy is a plan that outlines how you will manage your warehouse for optimal efficiency. It’s important to note that a warehouse plan is not the same thing as a strategy. A well-crafted warehouse strategy focuses on how well your warehouse operates within given benchmarks (e.g., storage density, fulfilment speed, inventory accuracy, etc.)

Outline your data stack: including ETL process, data pipeline tools, and data science pipeline
Define data lifecycle requirements
Determine actions you expect to take on the data
Sketch out dashboards you expect to build with the data
Build your data analytics team
Explore data warehouse solutions

Efficiently providing systematic, contextual data to the business intelligence tool of an organization, the data warehouses can find out more practical business strategies.

Business User: Business users or customers need a data warehouse to look at summarized data from the past.Since these people are coming from a non-technical background also, the data may be represented to them in an uncomplicated way.
Maintains consistency: Data warehouses are programmed in such a way that they can be applied in a regular formatto all collected data from differentsources, which makes it effortlessfor company decision-makers to analyse and share data insights By standardizing thedata, the risk of errorin interpretation is also reduced and improves overall accuracy.
Store historical data: DataWarehouses are also used to store historical data that means, the time variabledata from the past and this input can be used for various purposes.
Make strategic decisions: Data warehouses contribute to making better strategic decisions. Some businessstrategies may be depending upon the data stored within the datawarehouses.
High response time: Data warehouse has got to be prepared for somewhat sudden masses and type of queries thatdemands a major degree of flexibility and fast latency.

Data Warehousing Process

There are five notable stages of data warehousing, starting with defining the business goals and how they are measured.

1.Establishing Business Objectives

Every business has objectives, and the first step in the data warehousing process is clearly determining those business objectives.example, delivering the best telecommunication services in the industry. Every department should have objectives that contribute to the larger company goal.

2.Collecting And Analysing Information

The next stage will collect and analyse the information to put into the data warehouse. Each business department will create a summary and analytical reports that include information about various business processes.

3.Developing A Data Framework

storing data in the warehouse needs to be done in an organized and efficient manner. The best way to do that is to develop a conceptual data model. A data framework will ensure that the data you hold in the warehouse is complete. The framework can take a long time to construct, but it makes the data warehousing process relatively easy.

4.Moving The Data

Every business has databases and backups, which will be the primary source of information. You will have to ensure the data is complete from the source or program it for completion before transfer.

The key is to ensure any reports generated from the data warehouse match the reports from the data source.

5.Plan Implementation

The final stage will be implementing the plan. Once you move the data and successfully track its importation into the data warehouse, you must ensure that the data fits well into the predetermined structure or framework. If it is a large project, it will include developing phases and schedules for data delivery.

Warehouse Planning and Implementation

The warehouse planning is to minimise high warehousing costs which arise, for example, from low-capacity utilisation, outdated warehouse technology or complex organisation. A further aim can be to increase the degree of automation. Implementing a data warehouse involves several critical steps, each contributing to the overall success of the data architect’s project.

1. Requirement Gathering and Analysis

Identifying the data sources and the type of data to be collected.
Determining the business goals that the data warehouse should support.
Engaging stakeholders, including business users, data engineers, and database administrators, to gather requirements

2. Designing the Data Warehouse Architecture

A well-planned data warehouse architecture is crucial for efficient data storage and efficient data retrieval.

Data Modeling: Designing the schema, which defines how data is organized within the warehouse. Common models include star schema, snowflake schema, and data vault modeling.
ETL Processes: Planning how data will be extracted from source systems, transformed to fit the warehouse schema, and loaded into the warehouse.
Data Storage Solutions: Selecting appropriate technologies for data storage, such as relational databases or big data platforms.

3.Data Cleansing and Validation

· Data Cleansing: Identifying and correcting errors or inconsistencies in the data.

· Data Validation: Ensuring that the data meets the predefined quality criteria and business rules

4. Implementing Data Security and Compliance

Data security and compliance are paramount in any such data warehousing system or project.

· Data Encryption: Protecting data at rest and in transit through encryption.

· Access Controls: Implementing role-based access to restrict who can access or modify data.

· Compliance: Ensuring that the data warehouse complies with relevant regulations and standards.

5.Testing and Quality Assurance

· User Acceptance Testing (UAT): Engaging end-users to test the system and ensure it meets their needs.

· Quality Assurance (QA): Testing the system for data accuracy, query performance, and security.

6.Deployment and Maintenance

The data warehouse has passed all tests, it is ready for deployment. Ongoing maintenance is crucial to ensure all the data the system remains reliable and efficient.

· Monitoring: Continuously monitoring the data warehouse for performance issues or data inconsistencies.

· Upgrades and Scalability: Updating the system to handle increased data volumes or new business requirements.

The client/server computing model and data warehousing

The client/server computing model plays a significant role in the architecture of data warehousing systems, providing a framework for distributing processing tasks between clients (end-users) and servers (data warehouse infrastructure). Here’s how the client/server model relates to data warehousing:

Computing Model

In the client/server computing model, computing tasks are divided between clients and servers:

1. Client:

· Refers to the end-user devices, such as desktop computers, laptops, tablets, or smartphones.

· Executes user applications and interfaces with the user.

· Requests data and services from servers.

2. Server:

· Refers to powerful computers or server clusters that provide services to clients.

· Handles data storage, processing, and management tasks.

· Responds to client requests by providing data or executing operations.

Role in Data Warehousing

In the context of data warehousing, the client/server model is used to facilitate access to and utilization of the data warehouse by end-users:

1. Client-

2. Side Applications:

· Business intelligence (BI) tools, reporting applications, and analytical software installed on client devices.

· Allows users to interact with the data warehouse, query data, generate reports, and visualize insights.

3. Server-Side Components:

· Data warehouse servers and associated infrastructure.

· Stores and manages large volumes of structured and sometimes unstructured data.

· Executes complex data processing tasks, including ETL (Extract, Transform, Load), data aggregation, and query optimization.

Interaction in Data Warehousing

The client/server model facilitates the interaction between end-users and the data warehouse:

1. Data Retrieval:

· Clients send queries to the data warehouse servers to retrieve specific datasets or perform analyses.

· Servers process these queries, accessing the required data from storage, and returning the results to the clients.

2. Data Presentation:

· Servers provide processed data to clients in a format suitable for analysis or visualization.

· Clients use BI tools or reporting applications to present the data to end-users in the form of charts, graphs, tables, or dashboards.

3. Data Manipulation:

· Clients can manipulate and analyse data locally using client-side applications.

· Servers may also provide services for advanced data processing, such as machine learning or predictive analytics, depending on the architecture.

Benefits

The client/server computing model offers several benefits in the context of data warehousing:

1. Scalability:

· Enables scaling of both client and server components independently based on demand.

· Additional clients can be added without impacting server performance, and vice versa.

2. Centralized Management:

· Centralizes data storage and management on servers, simplifying administration and ensuring data consistency.

3. Resource Utilization:

· Distributes processing tasks between clients and servers, optimizing resource utilization and performance.

4. Flexibility:

· Allows users to access data warehouse resources from various client devices and locations, providing flexibility in data access and analysis.

Distributed DBMS implementations in data warehousing

A distributed database is essentially a database that is dispersed across numerous sites, i.e., on various computers or over a network of computers, and is not restricted to a single system. A distributed database system is spread across several locations with distinct physical components.

Types:

1. Homogeneous Database: A homogeneous database stores data uniformly across all locations. All sites utilize the same operating system, database management system, and data structures. They are therefore simple to handle.

2. Heterogeneous Database: With a heterogeneous distributed database, many locations may employ various software and schema, which may cause issues with queries and transactions. They could even employ separate database data models. Translations are therefore necessary for communication across various sites.

Data may be stored on several places in two ways using distributed data storage:

Replication - With this strategy, every aspect of the connection is redundantly kept at two or more locations. It is a completely redundant database if the entire database is accessible from every location. Systems preserve copies of the data as a result of replication. This has advantages since it makes more data accessible at many locations. Moreover, query requests can now be handled in parallel.
Fragmentation - In this method, the relationships are broken up into smaller pieces and each fragment is kept in the many locations where it is needed. To ensure there is no data loss, the pieces must be created in a way that allows for the reconstruction of the original relation. As fragmentation doesn't result in duplicate data, consistency is not a concern.

Uses for distributed databases

The corporate management information system makes use of it.

Multimedia apps utilize it.

Used in hotel chains, military command systems, etc.

The production control system also makes use of it

Characteristics of distributed databases

Distributed databases are logically connected to one another when they are part of a collection, and they frequently form a single logical database. Data is physically stored across several sites and is separately handled in distributed databases.

Place unrelated

Spread-out query processing

The administration of distributed transactions

Independent of hardware

Network independent of operating systems

Transparency of transactions

DBMS unrelated

Designing Data Warehouse Schemas

Designing an effective data warehouse schema involves understanding business processes, defining dimension and fact tables, and ensuring data integrity.

Steps in Schema Design

Identify Business Processes: Determine the key processes and measures to be analysed.
Define Dimension Tables: Create tables for descriptive attributes.
Create Fact Table: Centralize measures and link to dimension tables.
Normalize Data: For snowflake schema, further break down dimension tables to reduce redundancy.
Ensure Data Integrity: Use primary and foreign keys to maintain relationships.

Benefits of Data Warehouse Schemas

Efficient Querying: Well-designed schemas enhance query performance.
Data Integration: Facilitates combining data from multiple sources.
Reduced Redundancy: Normalization in snowflake schema lowers data duplication.
Enhanced Data Analysis: Supports complex queries and detailed analysis.

Challenges in Data Warehouse Schemas

Complexity: Designing and maintaining complex schemas can be challenging.
Storage Space: Snowflake and fact constellation schemas may require more storage.
Query Performance: Multiple joins in snowflake schema can impact performance.

Best Practices for Data Warehouse Schema Design

Understand Business Needs: Tailor the schema to business requirements.
Balance Normalization: Find a balance between reducing redundancy and maintaining performance.
Ensure Scalability: Design schemas that can grow with the business.
Optimize Query Performance: Use indexing and partitioning to enhance performance.