Data Science Final Review

8 minute read

Published: January 05, 2021

This is my note for the class “Data Science”

Outline

Overview
Introduction to data
Association
Data management for data science
Probability and statistics
Statistics process control
Dimension Reduction
Feature Selection
Time series analysis
More on data, OLAP: Classification
More on classification
Clustering and others
NN and PUF security
Web and Social Network Analysis

Overview

Sources of Big Data

Individuals
Business process
Sensors

Big data lead to the explosion of ML. Cloud -> Big Data -> AI

Data Mining

Association
- Apriori
- FP-tree
Classification
- Desicion tree
- SVM
- NN
- KNN
- Bayes
Clustering
- [K-means]
- PAM
- DBScan
Anomaly detection

Introduction to data

Procedure of Data Mining

Obtain the data
Decide your goal
Data cleaning
Feature Selection
Apply mining method
Decide what and how to output
Interpret your results

Data

attribute: a property or characteristic of an object
object: row, tuple, record, point, … a collection of attributes to describe
Types of attribute:
- Nominal: 只有名稱
- Ordinal: 有順序的關係
- Interval: 區間
- Ratio: 比例關係

	Distinctness	Order	Addition	Multiplication
Nominal	O
Ordinal	O	O
Interval	O	O	O
Ratio	O	O	O	O

data matrix: with the same set of attributes
document data: each term is an attribute, record the number of existence of the term
transaction data: each record involves a set of items
data quality:
- Noise, outliers
- Missing values
- Duplicate data

Data preprocessing

Aggregation: combine attributes
Sampling: representative
Dimensionality Reduction: avoid curse of dimensionality
- PCA
- SVD
Feature subset selection: another way to reduce dimensionality of data
Feature creation
Discretization and Binariztion
Attribute Transformation

Curse of Dimensionality

When dimensionality increases, data becomes increasingly sparse, and becomes less meaningful.

Euclidean Distance $dist = \sqrt{\sum_{k=1}^{n} {(p_k-q_k)^2}}$
Minkowski Distance $dist = ({\sum_{k=1}^{n} {|p_k-q_k|^r}})^\frac{1}{r}$
Cosine Dimilarity $cos(d_1,d_2)=\frac{<d_1,d_2>}{|d_1||d_2|}$
Correlation $corr(x, y)=\frac{S_xy}{S_xS_y}$ | | Euclidean Distance | Cosine Similarity | Correlation | |—|—|—|—| | Invariant to Scaling | no | yes | yes | | Invariant to Translation | no | no | yes |

Association

support count $\sigma$: frequency of an itemset
support $s$: fraction of the transaction
frequent itemset: itemset whose support >= minsup
association rule: X -> Y
- support: s(X,Y)
- confidence: s(X,Y)/s(X)

Mining association rules

Frequent Itemset Generation
Rule Generation

Apriori

with minsup, derive $C_1$ and $L_1$
Maximal Frequent Itemset: no superset is frequent
Closed Frequent Itemset: no superset has the same support
criterion: confidence(x->y) = P(y)
Lift = $\frac{P(Y X)}{P(Y)}$
Interest = $\frac{P(X,Y)}{P(X)P(Y)}$
PS = ${P(X,Y)-P(X)P(Y)}$
coefficient = $\frac{P(X,Y)-P(X)P(Y)}{\sqrt{P(X)[1-P(X)]P(Y)[1-P(Y)]}}$
Simpson’s paradox: observed relationship in data may be influenced by the presence of other confounding factors

Databases

Database Management System(DBMS): the software system that manages orginized data
Data Warehouse: many DB batches from DBMSs
Data Lake: all data, structured or unstructured

Relational Database Model

Use mathematical relation as its building block

NULL value:
- exist but don’t know
- don’t know whether exist
- don’t apply to the tuple
Constraints
- Key constraints:
  - all tuples in a relation must also be distinct
  - superkey(SK): a set of attributes that no 2 distinct tuple can have the same value
  - 1 primary key from candidate keys, to identify tuples in a relation
- Constraints on NULL: specify whether NULL is permitted
- Entity Integrity Constraints: no primary key value can be NULL
- Referential Integrity Constraints: maintain the consistency among tuples

Probability and Statistics

Sample space: the collection of all possible outcomes
Event space($\sigma$-field)
- $S\in F$
- if $A\in F$ then $A_c\in F$
- if $A_j\in F$ for $j \geq{1}$, then $\bigcup_{j=1}^{\inf}A_j \in F$
Random Variable: a function from the sample space into the set of real numbers
Kolmogorov’s axioms
- $P(x) \geq 0$
- $P(S) = 1$
- $P(x_1\cup x_2) = P(x_1)+P(x_2)$
Statistic: a function of a random sample, free of $\theta$
Estimator: a special designed statistic to guess $\theta$
Hypothesis testing: test whether the characteristics of some statistic fulfills certain condition
$f(x_1,x_2,…,x_n;\theta) = \prod_{i=1}^{n}f(x_i;\theta)$
Statistical inference: a statement based on sample information about the population
- estimation
- hypothesis testing
- prediction
Loss
- squared error loss(L2 loss): $L2(\theta,\hat \theta) = (\hat \theta- \theta)^2$
- absolute error loss(L1 loss): $L1(\theta,\hat \theta) = \hat \theta- \theta $

Estimation and Hypothesis Testing

Methods of finding estimators

Moment Method
- population moment: the kth moment of random variable X about 0 is $E(X^k) = \int_{\inf}^{\inf}x^kf(x;\theta)dx$
- the kth sample moment about 0 is $M_k = \frac{\Sigma_{i=1}^{k}X_i^k}{n}$
Maximum Likelihood Method
- Likelihood function $L(\theta) = \prod_{i=1}^{n}f(x;\theta)$
- \[ln(L(\theta))\]
- Find theta to maximize L $\frac{ln(L(\theta))}{ d\theta} = 0$
Bayesian Method
- find posterior pdf $f(\theta x_1,…,x_n)$
- with squared error loss
  - find $E(\theta x_1,..,x_n)$
- with absolute error loss
  - find median m

Goodness of estimators

Unbiasedness: $E(\hat \theta) = \theta$
Efficiency: $Var(\hat \theta_1)$ smaller, more efficient
Uniform Minimum Variance Unbiasedness
Sufficiency
Consistency

Interval Estimation

interval estimator: a pair of statistics L, U, and $\theta$ fall in ${[L,U]}$
confidence interval: $P(L \leq \theta \leq U) = 1-\alpha$
pivotal quantity: a function of a random sample, whose probability distribution is independent of the parameter $\theta$

Hypothesis Testing

statistical hypothesis: a conjecture about the distribution of a population
critical region: a subset of the sample space whcih accept the alternative hypothesis to be true

situations	$H_0$ is true	$H_0$ is false
accept $H_0$	correct	Type II error
reject $H_0$	Type I error	correct

Dimension Reduction

Feature Selection: a subset of the original features
Feature Projection: a linear combination of the original features
Manifold Modeling: a non-linear combination of the original features

Principal Component Analysis (PCA)

PCA 是一種 feature projection 的方法，當一個 dataset 有很多組 attributes，希望挑出來的是擁有大variance 、小 covariance (correlation)，我個人的理解是那些跟其他 attribute 有顯著差異的 attributes。

Covariance matrix:

\[S_x = \frac{1}{n-1}XX^T\]

The element Eii is the variance of attribute i.
The element Eij is the covariance of attribute i and attribute j.

transform to:

\[S_x = \begin{bmatrix}d_1 & 0 & \ddots & 0 \\ 0 & d_2 & \ddots & 0 \\ \ddots & \ddots & \ddots & \ddots \\0 & 0 & \ddots & d_m \end{bmatrix}\]

假設有 m 個 attributes，找出一條有最小 square distance 的回歸線（這條線的 variance 會最大），然後在 orthogonal 的方向找出 m 條這樣的線，這些線就是 principal components，接下來就可以選出 variance 最大的幾條線來做降維投影。

Singular Value Decomposition

We can achieve PCA by performing SVD first. $A=U\Sigma V^*$

U: m*m
$\Sigma$: m*n
V: n*n

Linear Discriminant Approach(LDA)

To 2 classes: $S_1 = \Sigma_{x \in X_1}(x-u_1)(x-u_1)^T$ $S_2 = \Sigma_{x \in X_2}(x-u_2)(x-u_2)^T$ $S_W = S_1 + S_2$ $S_B = (u_1-u_2)(u_1-u_2)^T$ Objective: function: $J(W) = \frac{W^TS_BW}{W^TS_WW}$ Fisher’s linear discriminant $S_W^{-1}S_BW = \lambda W$

Feature Selection

Naive Approach

Measurement Function

Select features with large E or T

Entropy: $E=-\Sigma_{i=1}^{N}\Sigma_{j=1}^{N}(s_{ij}log(s_{ij})+(1-s_{ij})log(1-s_{ij}))$
T-statistics: $T(f_j)=\frac{ u_j^1-u_j^{-1} }{\sqrt(\frac{(\delta_j^1)^2}{n_{1}}+\frac{(\delta_j^-1)^2}{n_{-1}})}$

Framework

Filter approach: select features, evaluate with statistics-criteria, done
Wrapper approach: select features, build model to evaluate, done
Embedded approach: build a model to select features and evaluate

Simulated Annealing

Partical Swarm Optimization

Introduction to Time Series

ARMA

ARIMA

prediction error: $e_{t+k t} = Y_{t+k} - \hat Y_{t+k}$

OLAP, Classification

On-Line Analytical Process: use multidimensional array representation
- slicing
- dicing
- roll-up: 組在一起
- drill-down: 拆出細節

Classification

Tree Induction

GINI index $GINI(t) = 1-\Sigma_{j}[p(j|t)]^2$
Splitting $GINI_split(t) = \Sigma_{i=1}^{k}\frac{n_i}{n}GINI(i)$
Entropy $Entropy(t) = -\Sigma_jp(j|t)log(p(j|t))$
Splitting $GAIN_split = Entropy(p)-\Sigma_{i-1}^{k}\frac{n_i}{n}Entropy(i)$

More on Classification

Rule-based

Easy to interpret and generate
Can classify new instances rapidly

k-Nearnest neighbors

Euclidean Distance
top-k nearest

Bayes Classifier

\[P(C|A) = \frac{P(A|C)P(C)}{P(A)}\]

Naive Bayes Classifier

$$ P(C	A_1,A_2,…,A_n) = \frac{P(A_1,A_2,…,A_n	C)P(C)}{P(A_1,A_2,…,A_n)}$$
$$P(A_1,A_2,…,A_n	C)=P(A_1)P(A_2)…P(A_n)$$

Support Vector Machine

constraint $y(W^Tx-b)-1 \geq 0$
objective(to maximize the margin) $minimize \frac{||W||^2}{2}$ $L(w,b,\alpha) = \frac{||W||^2}{2} - \Sigma_{i=1}^{7}\alpha_i[y_i(W^Tx-b)-1]$

Kernel Function

The function that represents the inner product of some space(feature) in another(original) space. SVM can fins an hyperplace to seperate the data in a feature space induced by a kernel function.

Clustering

cluster: high intra-class similarity, low inter-class similarity
unsupervised classification

Partition around Medoids(PAM)

Select arbitrary of k objects as medoids
Compute cost for all non-selected objects
Select the pair with the lowest cost, if it’s negative, replace the medoid, and go backward.
For each non-selected object, find the most similar medoid.

Hierarchical Clustering

agglomerative clustering

compute the proximity matrix
each data point be a cluster
do:
merge the 2 closest
update the proximity matrix
until 2 clusters remain

NN and PUF

NN

Check PyTorch.

Physically Unclonable Function(PUF)

PUF: a promising security primitive for low-end devices in 5G
Easy to evaluate, hard to predict
A server will send a challenge to PUF, and PUF will respond and the server will check whether the response is correct

Modeling attack

APUF: arbiter PUF can be modeled as a linear additive model

Cloud Computing

Centralized infrastruction
Virtualized architecture
Useage-based resources allocation

SaaS: software as a service
PaaS: platform as a service
IaaS: infrastructure as a service

Share on

Twitter Facebook LinkedIn

Liang-Yuan 'Leo' Wu