1)For this week, you will be working through the steps of an affinity diagram. Choose one of the following problem statements:Power outages cause downtimeMalicious code causes systems to crash and production lossHardware failure causes data loss on the database serverOnce you pick a statement, generate ideas and brainstorm based on this article: https://asq.org/quality-resources/affinityFor your peer responses, pick 2 and group the ideas based on step 3.2)Assignment (Similarity and distance measures: )Review Chapter 2-recording, and Chapter-2 text and answer the following question.Submit the answer in a Word document. (Note: this is NOT a group assignment).Compute the Hamming distance and the Jaccard similarity between the following two binary vectors:x=0101010001y=0100011000

chap2_data.pptx

Unformatted Attachment Preview

Data Mining: Data

Lecture Notes for Chapter 2

Introduction to Data Mining , 2nd Edition

by

Tan, Steinbach, Karpatne, Kumar

01/22/2018

Introduction to Data Mining, 2nd Edition

1

Outline

Attributes and Objects

Types of Data

Data Quality

Similarity and Distance

Data Preprocessing

01/22/2018

Introduction to Data Mining, 2nd Edition

2

What is Data?

Attributes

Collection of data objects

and their attributes

– Examples: eye color of a

person, temperature, etc.

– Attribute is also known as

variable, field, characteristic,

dimension, or feature

Objects

An attribute is a property

or characteristic of an

object

A collection of attributes

describe an object

– Object is also known as

record, point, case, sample,

entity, or instance

Tid Refund Marital

Status

Taxable

Income Cheat

1

Yes

Single

125K

No

2

No

Married

100K

No

3

No

Single

70K

No

4

Yes

Married

120K

No

5

No

Divorced 95K

Yes

6

No

Married

No

7

Yes

Divorced 220K

No

8

No

Single

85K

Yes

9

No

Married

75K

No

Single

90K

Yes

10 No

10

60K

A More Complete View of Data

Data may have parts

The different parts of the data may have

relationships

More generally, data may have structure

Data can be incomplete

We will discuss this in more detail later

01/22/2018

Introduction to Data Mining, 2nd Edition

4

Attribute Values

Attribute values are numbers or symbols

assigned to an attribute for a particular object

Distinction between attributes and attribute values

– Same attribute can be mapped to different attribute

values

◆

Example: height can be measured in feet or meters

– Different attributes can be mapped to the same set of

values

Example: Attribute values for ID and age are integers

◆ But properties of attribute values can be different

◆

01/22/2018

Introduction to Data Mining, 2nd Edition

5

Measurement of Length

The way you measure an attribute may not match the

attributes properties.

5

A

1

B

7

This scale

preserves

only the

ordering

property of

length.

2

C

8

3

D

10

4

E

15

5

This scale

preserves

the ordering

and additvity

properties of

length.

Types of Attributes

There are different types of attributes

– Nominal

◆

Examples: ID numbers, eye color, zip codes

– Ordinal

◆

Examples: rankings (e.g., taste of potato chips on a

scale from 1-10), grades, height {tall, medium, short}

– Interval

◆

Examples: calendar dates, temperatures in Celsius or

Fahrenheit.

– Ratio

◆

01/22/2018

Examples: temperature in Kelvin, length, time, counts

Introduction to Data Mining, 2nd Edition

7

Properties of Attribute Values

The type of an attribute depends on which of the

following properties/operations it possesses:

– Distinctness:

=

– Order:

< >

– Differences are

+ meaningful :

– Ratios are

meaningful

* /

– Nominal attribute: distinctness

– Ordinal attribute: distinctness & order

– Interval attribute: distinctness, order & meaningful

differences

– Ratio attribute: all 4 properties/operations

01/22/2018

Introduction to Data Mining, 2nd Edition

8

Difference Between Ratio and Interval

Is it physically meaningful to say that a

temperature of 10 ° is twice that of 5° on

– the Celsius scale?

– the Fahrenheit scale?

– the Kelvin scale?

Consider measuring the height above average

– If Bill’s height is three inches above average and

Bob’s height is six inches above average, then would

we say that Bob is twice as tall as Bill?

– Is this situation analogous to that of temperature?

01/22/2018

Introduction to Data Mining, 2nd Edition

9

Categorical

Qualitative

Attribute Description

Type

Nominal

Nominal attribute

values only

distinguish. (=, )

zip codes, employee

ID numbers, eye

color, sex: {male,

female}

Ordinal

Ordinal attribute

values also order

objects.

(<, >)

For interval

attributes,

differences between

values are

meaningful. (+, – )

For ratio variables,

both differences and

ratios are

meaningful. (*, /)

hardness of minerals,

{good, better, best},

grades, street

numbers

calendar dates,

temperature in

Celsius or Fahrenheit

Interval

Numeric

Quantitative

Examples

Ratio

Operations

mode, entropy,

contingency

correlation, 2

test

median,

percentiles, rank

correlation, run

tests, sign tests

mean, standard

deviation,

Pearson’s

correlation, t and

F tests

temperature in Kelvin, geometric mean,

monetary quantities,

harmonic mean,

counts, age, mass,

percent variation

length, current

This categorization of attributes is due to S. S. Stevens

Numeric

Quantitative

Categorical

Qualitative

Attribute Transformation

Type

Comments

Nominal

Any permutation of values

If all employee ID numbers

were reassigned, would it

make any difference?

Ordinal

An order preserving change of

values, i.e.,

new_value = f(old_value)

where f is a monotonic function

An attribute encompassing

the notion of good, better best

can be represented equally

well by the values {1, 2, 3} or

by { 0.5, 1, 10}.

Interval

new_value = a * old_value + b

where a and b are constants

Ratio

new_value = a * old_value

Thus, the Fahrenheit and

Celsius temperature scales

differ in terms of where their

zero value is and the size of a

unit (degree).

Length can be measured in

meters or feet.

This categorization of attributes is due to S. S. Stevens

Discrete and Continuous Attributes

Discrete Attribute

– Has only a finite or countably infinite set of values

– Examples: zip codes, counts, or the set of words in a

collection of documents

– Often represented as integer variables.

– Note: binary attributes are a special case of discrete

attributes

Continuous Attribute

– Has real numbers as attribute values

– Examples: temperature, height, or weight.

– Practically, real values can only be measured and

represented using a finite number of digits.

– Continuous attributes are typically represented as floatingpoint variables.

01/22/2018

Introduction to Data Mining, 2nd Edition

12

Asymmetric Attributes

Only presence (a non-zero attribute value) is regarded as

important

◆

◆

Words present in documents

Items present in customer transactions

If we met a friend in the grocery store would we ever say the

following?

“I see our purchases are very similar since we didn’t buy most of the

same things.”

We need two asymmetric binary attributes to represent one

ordinary binary attribute

– Association analysis uses asymmetric attributes

Asymmetric attributes typically arise from objects that are

sets

01/22/2018

Introduction to Data Mining, 2nd Edition

13

Some Extensions and Critiques

Velleman, Paul F., and Leland Wilkinson. “Nominal,

ordinal, interval, and ratio typologies are misleading.” The

American Statistician 47, no. 1 (1993): 65-72.

Mosteller, Frederick, and John W. Tukey. “Data analysis

and regression. A second course in statistics.” AddisonWesley Series in Behavioral Science: Quantitative

Methods, Reading, Mass.: Addison-Wesley, 1977.

Chrisman, Nicholas R. “Rethinking levels of measurement

for cartography.”Cartography and Geographic Information

Systems 25, no. 4 (1998): 231-242.

01/22/2018

Introduction to Data Mining, 2nd Edition

14

Critiques

Incomplete

– Asymmetric binary

– Cyclical

– Multivariate

– Partially ordered

– Partial membership

– Relationships between the data

Real data is approximate and noisy

– This can complicate recognition of the proper attribute type

– Treating one attribute type as another may be approximately

correct

01/22/2018

Introduction to Data Mining, 2nd Edition

15

Critiques …

Not a good guide for statistical analysis

– May unnecessarily restrict operations and results

◆

Statistical analysis is often approximate

◆

Thus, for example, using interval analysis for ordinal values

may be justified

– Transformations are common but don’t preserve

scales

◆

Can transform data to a new scale with better statistical

properties

◆

Many statistical analyses depend only on the distribution

01/22/2018

Introduction to Data Mining, 2nd Edition

16

More Complicated Examples

ID numbers

– Nominal, ordinal, or interval?

Number of cylinders in an automobile engine

– Nominal, ordinal, or ratio?

Biased Scale

– Interval or Ratio

01/22/2018

Introduction to Data Mining, 2nd Edition

17

Key Messages for Attribute Types

The types of operations you choose should be

“meaningful” for the type of data you have

– Distinctness, order, meaningful intervals, and meaningful ratios

are only four properties of data

– The data type you see – often numbers or strings – may not

capture all the properties or may suggest properties that are not

there

– Analysis may depend on these other properties of the data

◆

Many statistical analyses depend only on the distribution

– Many times what is meaningful is measured by statistical

significance

– But in the end, what is meaningful is measured by the domain

01/22/2018

Introduction to Data Mining, 2nd Edition

18

Types of data sets

Record

– Data Matrix

– Document Data

– Transaction Data

Graph

– World Wide Web

– Molecular Structures

Ordered

–

–

–

–

Spatial Data

Temporal Data

Sequential Data

Genetic Sequence Data

01/22/2018

Introduction to Data Mining, 2nd Edition

19

Important Characteristics of Data

– Dimensionality (number of attributes)

◆

High dimensional data brings a number of challenges

– Sparsity

◆

Only presence counts

– Resolution

◆

Patterns depend on the scale

– Size

◆

Type of analysis may depend on size of data

01/22/2018

Introduction to Data Mining, 2nd Edition

20

Record Data

Data that consists of a collection of records, each

of which consists of a fixed set of attributes

Tid Refund Marital

Status

Taxable

Income Cheat

1

Yes

Single

125K

No

2

No

Married

100K

No

3

No

Single

70K

No

4

Yes

Married

120K

No

5

No

Divorced 95K

Yes

6

No

Married

No

7

Yes

Divorced 220K

No

8

No

Single

85K

Yes

9

No

Married

75K

No

10

No

Single

90K

Yes

60K

10

01/22/2018

Introduction to Data Mining, 2nd Edition

21

Data Matrix

If data objects have the same fixed set of numeric

attributes, then the data objects can be thought of as

points in a multi-dimensional space, where each

dimension represents a distinct attribute

Such data set can be represented by an m by n matrix,

where there are m rows, one for each object, and n

columns, one for each attribute

Projection

of x Load

Projection

of y load

Distance

Load

Thickness

10.23

5.27

15.22

2.7

1.2

12.65

6.25

16.22

2.2

1.1

01/22/2018

Introduction to Data Mining, 2nd Edition

22

Document Data

Each document becomes a ‘term’ vector

– Each term is a component (attribute) of the vector

– The value of each component is the number of times

the corresponding term occurs in the document.

team

coach

play

ball

score

game

win

lost

timeout

season

Document 1

3

0

5

0

2

6

0

2

0

2

Document 2

0

7

0

2

1

0

0

3

0

0

Document 3

0

1

0

0

1

2

2

0

3

0

01/22/2018

Introduction to Data Mining, 2nd Edition

23

Transaction Data

A special type of record data, where

– Each record (transaction) involves a set of items.

– For example, consider a grocery store. The set of

products purchased by a customer during one

shopping trip constitute a transaction, while the

individual products that were purchased are the items.

01/22/2018

TID

Items

1

Bread, Coke, Milk

2

3

4

5

Beer, Bread

Beer, Coke, Diaper, Milk

Beer, Bread, Diaper, Milk

Coke, Diaper, Milk

Introduction to Data Mining, 2nd Edition

24

Graph Data

Examples: Generic graph, a molecule, and webpages

2

1

5

2

5

Benzene Molecule: C6H6

01/22/2018

Introduction to Data Mining, 2nd Edition

25

Ordered Data

Sequences of transactions

Items/Events

An element of

the sequence

01/22/2018

Introduction to Data Mining, 2nd Edition

26

Ordered Data

Genomic sequence data

GGTTCCGCCTTCAGCCCCGCGCC

CGCAGGGCCCGCCCCGCGCCGTC

GAGAAGGGCCCGCCTGGCGGGCG

GGGGGAGGCGGGGCCGCCCGAGC

CCAACCGAGTCCGACCAGGTGCC

CCCTCTGCTCGGCCTAGACCTGA

GCTCATTAGGCGGCAGCGGACAG

GCCAAGTAGAACACGCGAAGCGC

TGGGCTGCCTGCTGCGACCAGGG

01/22/2018

Introduction to Data Mining, 2nd Edition

27

Ordered Data

Spatio-Temporal Data

Average Monthly

Temperature of

land and ocean

01/22/2018

Introduction to Data Mining, 2nd Edition

28

Data Quality

Poor data quality negatively affects many data processing

efforts

“The most important point is that poor data quality is an unfolding

disaster.

– Poor data quality costs the typical company at least ten

percent (10%) of revenue; twenty percent (20%) is

probably a better estimate.”

Thomas C. Redman, DM Review, August 2004

Data mining example: a classification model for detecting

people who are loan risks is built using poor data

– Some credit-worthy candidates are denied loans

– More loans are given to individuals that default

01/22/2018

Introduction to Data Mining, 2nd Edition

29

Data Quality …

What kinds of data quality problems?

How can we detect problems with the data?

What can we do about these problems?

Examples of data quality problems:

–

–

–

–

Noise and outliers

Missing values

Duplicate data

Wrong data

01/22/2018

Introduction to Data Mining, 2nd Edition

30

Noise

For objects, noise is an extraneous object

For attributes, noise refers to modification of original values

– Examples: distortion of a person’s voice when talking on a poor

phone and “snow” on television screen

Two Sine Waves

01/22/2018

Two Sine Waves + Noise

Introduction to Data Mining, 2nd Edition

31

Outliers

Outliers are data objects with characteristics that

are considerably different than most of the other

data objects in the data set

– Case 1: Outliers are

noise that interferes

with data analysis

– Case 2: Outliers are

the goal of our analysis

◆

Credit card fraud

◆

Intrusion detection

Causes?

01/22/2018

Introduction to Data Mining, 2nd Edition

32

Missing Values

Reasons for missing values

– Information is not collected

(e.g., people decline to give their age and weight)

– Attributes may not be applicable to all cases

(e.g., annual income is not applicable to children)

Handling missing values

– Eliminate data objects or variables

– Estimate missing values

Example: time series of temperature

◆ Example: census results

◆

– Ignore the missing value during analysis

01/22/2018

Introduction to Data Mining, 2nd Edition

33

Missing Values …

Missing completely at random (MCAR)

– Missingness of a value is independent of attributes

– Fill in values based on the attribute

– Analysis may be unbiased overall

Missing at Random (MAR)

– Missingness is related to other variables

– Fill in values based other values

– Almost always produces a bias in the analysis

Missing Not at Random (MNAR)

– Missingness is related to unobserved measurements

– Informative or non-ignorable missingness

Not possible to know the situation from the data

01/22/2018

Introduction to Data Mining, 2nd Edition

34

Duplicate Data

Data set may include data objects that are

duplicates, or almost duplicates of one another

– Major issue when merging data from heterogeneous

sources

Examples:

– Same person with multiple email addresses

Data cleaning

– Process of dealing with duplicate data issues

When should duplicate data not be removed?

01/22/2018

Introduction to Data Mining, 2nd Edition

35

Similarity and Dissimilarity Measures

Similarity measure

– Numerical measure of how alike two data objects are.

– Is higher when objects are more alike.

– Often falls in the range [0,1]

Dissimilarity measure

– Numerical measure of how different two data objects

are

– Lower when objects are more alike

– Minimum dissimilarity is often 0

– Upper limit varies

Proximity refers to a similarity or dissimilarity

01/22/2018

Introduction to Data Mining, 2nd Edition

36

Similarity/Dissimilarity for Simple Attributes

The following table shows the similarity and dissimilarity

between two objects, x and y, with respect to a single, simple

attribute.

01/22/2018

Introduction to Data Mining, 2nd Edition

37

Euclidean Distance

Euclidean Distance

where n is the number of dimensions (attributes) and

xk and yk are, respectively, the kth attributes

(components) or data objects x and y.

Standardization is necessary, if scales differ.

01/22/2018

Introduction to Data Mining, 2nd Edition

38

Euclidean Distance

3

point

p1

p2

p3

p4

p1

2

p3

p4

1

p2

0

0

1

2

3

4

5

p1

p1

p2

p3

p4

0

2.828

3.162

5.099

x

0

2

3

5

y

2

0

1

1

6

p2

2.828

0

1.414

3.162

p3

3.162

1.414

0

2

p4

5.099

3.162

2

0

Distance Matrix

01/22/2018

Introduction to Data Mining, 2nd Edition

39

Minkowski Distance

Minkowski Distance is a generalization of Euclidean

Distance

Where r is a parameter, n is the number of dimensions

(attributes) and xk and yk are, respectively, the kth

attributes (components) or data objects x and y.

01/22/2018

Introduction to Data Mining, 2nd Edition

40

Minkowski Distance: Examples

r = 1. City block (Manhattan, taxicab, L1 norm) distance.

– A common example of this is the Hamming distance, which

is just the number of bits that are different between two

binary vectors

r = 2. Euclidean distance

r → . “supremum” (Lmax norm, L norm) distance.

– This is the maximum difference between any component of

the vectors

Do not confuse r with n, i.e., all these distances are

defined for all numbers of dimensions.

01/22/2018

Introduction to Data Mining, 2nd Edition

41

Minkowski Distance

point

p1

p2

p3

p4

x

0

2

3

5

y

2

0

1

1

L1

p1

p2

p3

p4

p1

0

4

4

6

p2

4

0

2

4

p3

4

2

0

2

p4

6

4

2

0

L2

p1

p2

p3

p4

p1

p2

2.828

0

1.414

3.162

p3

3.162

1.414

0

2

p4

5.099

3.162

2

0

L

p1

p2

p3

p4

p1

p2

p3

p4

0

2.828

3.162

5.099

0

2

3

5

2

0

1

3

3

1

0

2

5

3

2

0

Distance Matrix

01/22/2018

Introduction to Data Mining, 2nd Edition

42

Mahalanobis Distance

𝐦𝐚𝐡𝐚𝐥𝐚𝐧𝐨𝐛𝐢𝐬 𝐱, 𝐲 = (𝐱 − 𝐲)𝑇 Ʃ−1 (𝐱 − 𝐲)

is the covariance matrix

For red points, the Euclidean distance is 14.7, Mahalanobis distance is 6.

01/22/2018

Introduction to Data Mining, 2nd Edition

43

Mahalanobis Distance

Covariance

Matrix:

C

0.3 0.2

=

0

.

2

0

.

3

A: (0.5, 0.5)

B

B: (0, 1)

A

C: (1.5, 1.5)

Mahal(A,B) = 5

Mahal(A,C) = 4

01/22/2018

Introduction to Data Mining, 2nd Edition

44

Common Properties of a Distance

Distances, such as the Euclidean distance,

have some well known properties.

1. d(x, y) 0 for all x and y and d(x, y) = 0 only if

x = y. (Positive definiteness)

2. d(x, y) = d(y, x) for all x and y. (Symmetry)

3. d(x, z) d(x, y) + d(y, z) for all points x, y, and z.

(Triangle Inequality)

where d(x, y) is the distance (dissimilarity) between

points (data objects), x and y.

A distance that satisfies these properties is a

metric

01/22/2018

Introduction to Data Mining, 2nd Edition

45

Common Properties of a Similarity

Similarities, also have some well known

properties.

1.

s(x, y) = 1 (or maximum similarity) only if x = y.

2.

s(x, y) = s(y, x) for all x and y. (Symmetry)

where s(x, y) is the similarity between points (data

objects), x and y.

01/22/2018

Introduction to Data Mining, 2nd Edition

46

Similarity Between Binary Vectors

Common situation is that objects, p and q, have only

binary attributes

Compute similarities using the following quantities

f01 = the number of attributes where p was 0 and q was 1

f10 = the number of attributes where p was 1 and q was 0

f00 = the number of attributes where p was 0 and q was 0

f11 = the number of attributes where p was 1 and q was 1

Simple Matching and Jaccard Coefficients

SMC = number of matches / number of attributes

= (f11 + f00) / (f01 + f10 + f11 + f00)

J = number of 11 matches / number of non-zero attributes

= (f11) / (f01 + f10 + f11)

01/22/2018

Introduction to Data Mining, 2nd Edition

47

SMC versus Jaccard: Example

x= 1000000000

y= 0000001001

f01 = 2 (the number of attributes w …

Purchase answer to see full

attachment