Unit 3 Data warehousing and data mining
Unit 3
Data
warehousing and data mining
Data mining
Definition and functionalities
Data Mining functions
are used to define the trends or correlations contained in data mining
activities. In comparison, data mining activities can be divided into
2 categories:
1.Descriptive Data Mining:
This category of data mining is
concerned with finding patterns and relationships in the data that can provide
insight into the underlying structure of the data. Descriptive data mining is
often used to summarize or explore the data.
Cluster analysis:
This technique is used to identify
groups of data points that share similar characteristics. Clustering can be
used for segmentation, anomaly detection, and summarization.
Association rule mining:
This technique is used to identify
relationships between variables in the data. It can be used to discover
co-occurring events or to identify patterns in transaction data.
Visualization:
This technique is used to represent the
data in a visual format that can help users to identify patterns or trends that
may not be apparent in the raw data.
2.Predictive Data Mining: This
category of data mining is concerned with developing models that can predict
future behaviour or outcomes based on historical data. Predictive data mining
is often used for classification or regression tasks.
Decision trees: This
technique is used to create a model that can predict the value of a target
variable based on the values of several input variables. Decision trees are
often used for classification tasks.
Neural networks: This
technique is used to create a model that can learn to recognize patterns in the
data. Neural networks are often used for image recognition, speech recognition,
and natural language processing.
Regression analysis: This
technique is used to create a model that can predict the value of a target
variable based on the values of several input variables. Regression analysis is
often used for prediction tasks.
Both descriptive and predictive
data mining techniques are important for
gaining insights and making better decisions. Descriptive data mining can be
used to explore the data and identify patterns, while predictive data mining
can be used to make predictions based on those patterns. Together, these
techniques can help organizations to understand their data and make informed
decisions based on that understanding.
Data Mining Functionality:
1. Class/Concept Descriptions: Classes
or definitions can be correlated with results. In simplified, descriptive and
yet accurate ways, it can be helpful to define individual groups and concepts.
These class or concept definitions are referred to as class/concept
descriptions.
- Data
Characterization: This refers to the summary of general characteristics
or features of the class that is under the study. The output of the data
characterization can be presented in various forms include pie charts, bar
charts, curves, multidimensional data cubes.
Example: To
study the characteristics of software products with sales increased by 10% in
the previous years. To summarize the characteristics of the customer who spend
more than $5000 a year at All Electronics, the result is general profile of
those customers such as that they are 40-50 years old, employee and have
excellent credit rating.
- Data
Discrimination: It compares
common features of class which is under study. It is a comparison of the
general features of the target class data objects against the general
features of objects from one or multiple contrasting classes.
Example: we
may want to compare two groups of customers those who shop for computer
products regular and those who rarely shop for such products(less than 3 times
a year), the resulting description provides a general comparative profile of
those customers, such as 80% of the customers who frequently purchased computer
products are between 20 and 40 years old and have a university degree, and 60%
of the customers who infrequently buys such products are either seniors or
youth, and have no university degree.
2. Mining Frequent Patterns,
Associations, and Correlations: Frequent
patterns are nothing but things that are found to be most common in the data.
There are different kinds of frequencies that can be observed in the dataset.
- Frequent
item set: This applies to a
number of items that can be seen together regularly for eg: milk and
sugar.
- Frequent
Subsequence: This refers to the
pattern series that often occurs regularly such as purchasing a phone
followed by a back cover.
- Frequent
Substructure: It refers to the
different kinds of data structures such as trees and graphs that may be
combined with the itemset or subsequence.
Association Analysis: The
process involves uncovering the relationship between data and deciding the
rules of the association. It is a way of discovering the relationship between
various items.
Example: Suppose
we want to know which items are frequently purchased together. An example for
such a rule mined from a transactional database is,
buys (X, “computer”) ⇒ buys (X,
“software”) [support = 1%, confidence = 50%],
where X is a variable representing a
customer
Data processing from of Data
Pre-processing
Data preprocessing is an important step
in the data mining process. It refers to the cleaning, transforming, and
integrating of data in order to make it ready for analysis. The goal of data
preprocessing is to improve the quality of the data and to make it more
suitable for the specific data mining task.
Steps of Data Preprocessing
Data preprocessing is an important step
in the data mining process that involves cleaning and transforming raw data to
make it suitable for analysis. Some common steps in data preprocessing include:
- Data
Cleaning: This involves
identifying and correcting errors or inconsistencies in the data, such as
missing values, outliers, and duplicates. Various techniques can be used
for data cleaning, such as imputation, removal, and transformation.
- Data
Integration: This involves
combining data from multiple sources to create a unified dataset. Data
integration can be challenging as it requires handling data with different
formats, structures, and semantics. Techniques such as record linkage and
data fusion can be used for data integration.
- Data
Transformation: This involves
converting the data into a suitable format for analysis. Common techniques
used in data transformation include normalization, standardization, and
discretization. Normalization is used to scale the data to a common range,
while standardization is used to transform the data to have zero mean and
unit variance. Discretization is used to convert continuous data into
discrete categories.
- Data
Reduction: This involves
reducing the size of the dataset while preserving the important
information. Data reduction can be achieved through techniques such as
feature selection and feature extraction. Feature selection involves
selecting a subset of relevant features from the dataset, while feature
extraction involves transforming the data into a lower-dimensional space
while preserving the important information.
- Data
Discretization: This involves
dividing continuous data into discrete categories or intervals.
Discretization is often used in data mining and machine learning
algorithms that require categorical data. Discretization can be achieved
through techniques such as equal width binning, equal frequency binning,
and clustering.
- Data
Normalization: This involves
scaling the data to a common range, such as between 0 and 1 or -1 and 1.
Normalization is often used to handle data with different units and
scales. Common normalization techniques include min-max normalization,
z-score normalization, and decimal scaling.
Data preprocessing plays a crucial role
in ensuring the quality of data and the accuracy of the analysis results. The
specific steps involved in data preprocessing may vary depending on the nature
of the data and the analysis goals.
By performing these steps, the data
mining process becomes more efficient and the results become more accurate.
Preprocessing in Data Mining
Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data
cleaning is done. It involves handling of missing data, noisy data etc.
- Missing
Data: This situation arises when
some data is missing in the data. It can be handled in various ways.
Some of them are:
- Ignore
the tuples: This
approach is suitable only when the dataset we have is quite large and
multiple values are missing within a tuple.
- Fill
the Missing values: There are
various ways to do this task. You can choose to fill the missing values
manually, by attribute mean or the most probable value.
- Noisy
Data: Noisy data is a meaningless
data that can’t be interpreted by machines.It can be generated due to
faulty data collection, data entry errors etc. It can be handled in
following ways :
- Binning
Method: This method works on
sorted data in order to smooth it. The whole data is divided into
segments of equal size and then various methods are performed to complete
the task. Each segmented is handled separately. One can replace all data
in a segment by its mean or boundary values can be used to complete the
task.
- Regression:Here
data can be made smooth by fitting it to a regression function.The
regression used may be linear (having one independent variable) or
multiple (having multiple independent variables).
- Clustering: This
approach groups the similar data in a cluster. The outliers may be
undetected or it will fall outside the clusters.
2. Data Transformation: This
step is taken in order to transform the data in appropriate forms suitable for
mining process. This involves following ways:
- Normalization: It
is done in order to scale the data values in a specified range (-1.0 to
1.0 or 0.0 to 1.0)
- Attribute
Selection: In this strategy, new
attributes are constructed from the given set of attributes to help the
mining process.
- Discretization: This
is done to replace the raw values of numeric attribute by interval levels
or conceptual levels.
- Concept
Hierarchy Generation: Here
attributes are converted from lower level to higher level in hierarchy.
For Example-The attribute “city” can be converted to “country”.
3. Data Reduction: Data
reduction is a crucial step in the data mining process that involves reducing
the size of the dataset while preserving the important information. This is done
to improve the efficiency of data analysis and to avoid overfitting of the
model.
- Feature
Selection: This involves selecting
a subset of relevant features from the dataset. Feature selection is often
performed to remove irrelevant or redundant features from the dataset. It
can be done using various techniques such as correlation analysis, mutual
information, and principal component analysis (PCA).
- Feature
Extraction: This involves
transforming the data into a lower-dimensional space while preserving the
important information. Feature extraction is often used when the original
features are high-dimensional and complex. It can be done using techniques
such as PCA,
linear discriminant analysis (LDA),
and non-negative matrix factorization (NMF).
- Sampling: This
involves selecting a subset of data points from the dataset. Sampling is
often used to reduce the size of the dataset while preserving the
important information. It can be done using techniques such as random
sampling, stratified sampling, and systematic sampling.
Decision Tree
A decision tree is a
flowchart-like structure used to make decisions or predictions. It consists of
nodes representing decisions or tests on attributes, branches representing the
outcome of these decisions, and leaf nodes representing final outcomes or
predictions. Each internal node corresponds to a test on an attribute, each
branch corresponds to the result of the test, and each leaf node corresponds to
a class label or a continuous value.
Structure of a Decision Tree
- Root
Node: Represents the entire dataset and the initial decision to be made.
- Internal
Nodes: Represent decisions or tests on attributes. Each internal node has
one or more branches.
- Branches:
Represent the outcome of a decision or test, leading to another node.
- Leaf
Nodes: Represent the final decision or prediction. No further splits occur
at these nodes.
Decision Trees Work
The process of creating a decision tree
involves:
- Selecting
the Best Attribute: Using a metric like Gini impurity, entropy, or information
gain, the best attribute to split the data is selected.
- Splitting
the Dataset: The dataset is split into subsets based on the selected
attribute.
- Repeating
the Process: The process is repeated recursively for each subset, creating
a new internal node or leaf node until a stopping criterion is met (e.g.,
all instances in a node belong to the same class or a predefined depth is
reached).
Metrics for Splitting
- Gini
Impurity: Measures the likelihood of an
incorrect classification of a new instance if it was randomly classified
according to the distribution of classes in the dataset.
- Gini=1–∑i=1n(pi)2Gini=1–∑i=1n(pi)2,
where pi is the probability of an instance being classified
into a particular class.
- Entropy:
Measures the amount of uncertainty or impurity in the dataset.
- Entropy=−∑i=1npilog2(pi)Entropy=−∑i=1npilog2(pi),
where pi is the probability of an instance being classified
into a particular class.
- Information
Gain: Measures the reduction in entropy
or Gini impurity after a dataset is split on an attribute.
- InformationGain=Entropyparent–∑i=1n(∣Di∣∣D∣∗Entropy(Di))InformationGain=Entropyparent–∑i=1n(∣D∣∣Di∣∗Entropy(Di)),
where Di is the subset of D after splitting
by an attribute.
Advantages of Decision Trees
- Simplicity
and Interpretability: Decision trees are easy to understand and interpret.
The visual representation closely mirrors human decision-making processes.
- Versatility:
Can be used for both classification and regression tasks.
- No
Need for Feature Scaling: Decision trees do not require normalization or
scaling of the data.
- Handles
Non-linear Relationships: Capable of capturing non-linear relationships
between features and target variables.
Disadvantages of Decision Trees
- Overfitting:
Decision trees can easily overfit the training data, especially if they
are deep with many nodes.
- Instability:
Small variations in the data can result in a completely different tree
being generated.
- Bias
towards Features with More Levels: Features with more levels can dominate
the tree structure.
Pruning
To overcome overfitting,
pruning techniques are used. Pruning reduces the size of the tree by
removing nodes that provide little power in classifying instances. There are
two main types of pruning:
- Pre-pruning
(Early Stopping): Stops the tree from growing once it meets certain
criteria (e.g., maximum depth, minimum number of samples per leaf).
- Post-pruning:
Removes branches from a fully grown tree that do not provide significant
power.
Applications of Decision Trees
- Business
Decision Making: Used in strategic planning and resource allocation.
- Healthcare:
Assists in diagnosing diseases and suggesting treatment plans.
- Finance:
Helps in credit scoring and risk assessment.
- Marketing:
Used to segment customers and predict customer behaviour.
Comments
Post a Comment