Data Science Final Review

8 minute read

Published:

This is my note for the class “Data Science”

Outline

Overview

Sources of Big Data

  • Individuals
  • Business process
  • Sensors

Big data lead to the explosion of ML. Cloud -> Big Data -> AI

Data Mining

Introduction to data

Procedure of Data Mining

  • Obtain the data
  • Decide your goal
  • Data cleaning
  • Feature Selection
  • Apply mining method
  • Decide what and how to output
  • Interpret your results

Data

  • attribute: a property or characteristic of an object
  • object: row, tuple, record, point, … a collection of attributes to describe
  • Types of attribute:
    • Nominal: 只有名稱
    • Ordinal: 有順序的關係
    • Interval: 區間
    • Ratio: 比例關係
 DistinctnessOrderAdditionMultiplication
NominalO   
OrdinalOO  
IntervalOOO 
RatioOOOO
  • data matrix: with the same set of attributes
  • document data: each term is an attribute, record the number of existence of the term
  • transaction data: each record involves a set of items
  • data quality:
    • Noise, outliers
    • Missing values
    • Duplicate data

Data preprocessing

  • Aggregation: combine attributes
  • Sampling: representative
  • Dimensionality Reduction: avoid curse of dimensionality
  • Feature subset selection: another way to reduce dimensionality of data
  • Feature creation
  • Discretization and Binariztion
  • Attribute Transformation

Curse of Dimensionality

When dimensionality increases, data becomes increasingly sparse, and becomes less meaningful.

  • Euclidean Distance \(dist = \sqrt{\sum_{k=1}^{n} {(p_k-q_k)^2}}\)
  • Minkowski Distance \(dist = ({\sum_{k=1}^{n} {|p_k-q_k|^r}})^\frac{1}{r}\)
  • Cosine Dimilarity \(cos(d_1,d_2)=\frac{<d_1,d_2>}{|d_1||d_2|}\)
  • Correlation \(corr(x, y)=\frac{S_xy}{S_xS_y}\) | | Euclidean Distance | Cosine Similarity | Correlation | |—|—|—|—| | Invariant to Scaling | no | yes | yes | | Invariant to Translation | no | no | yes |

Association

  • support count $\sigma$: frequency of an itemset
  • support $s$: fraction of the transaction
  • frequent itemset: itemset whose support >= minsup
  • association rule: X -> Y
    • support: s(X,Y)
    • confidence: s(X,Y)/s(X)

Mining association rules

  1. Frequent Itemset Generation
  2. Rule Generation

Apriori

  • with minsup, derive $C_1$ and $L_1$
  • Maximal Frequent Itemset: no superset is frequent
  • Closed Frequent Itemset: no superset has the same support
  • criterion: confidence(x->y) = P(y)
  • Lift = $\frac{P(YX)}{P(Y)}$
  • Interest = $\frac{P(X,Y)}{P(X)P(Y)}$
  • PS = ${P(X,Y)-P(X)P(Y)}$
  • coefficient = $\frac{P(X,Y)-P(X)P(Y)}{\sqrt{P(X)[1-P(X)]P(Y)[1-P(Y)]}}$
  • Simpson’s paradox: observed relationship in data may be influenced by the presence of other confounding factors

Databases

  • Database Management System(DBMS): the software system that manages orginized data
  • Data Warehouse: many DB batches from DBMSs
  • Data Lake: all data, structured or unstructured

Relational Database Model

Use mathematical relation as its building block

  • NULL value:
    • exist but don’t know
    • don’t know whether exist
    • don’t apply to the tuple
  • Constraints
    • Key constraints:
      • all tuples in a relation must also be distinct
      • superkey(SK): a set of attributes that no 2 distinct tuple can have the same value
      • 1 primary key from candidate keys, to identify tuples in a relation
    • Constraints on NULL: specify whether NULL is permitted
    • Entity Integrity Constraints: no primary key value can be NULL
    • Referential Integrity Constraints: maintain the consistency among tuples

Probability and Statistics

  • Sample space: the collection of all possible outcomes
  • Event space($\sigma$-field)
    • $S\in F$
    • if $A\in F$ then $A_c\in F$
    • if $A_j\in F$ for $j \geq{1}$, then $\bigcup_{j=1}^{\inf}A_j \in F$
  • Random Variable: a function from the sample space into the set of real numbers
  • Kolmogorov’s axioms
    • $P(x) \geq 0$
    • $P(S) = 1$
    • $P(x_1\cup x_2) = P(x_1)+P(x_2)$
  • Statistic: a function of a random sample, free of $\theta$
  • Estimator: a special designed statistic to guess $\theta$
  • Hypothesis testing: test whether the characteristics of some statistic fulfills certain condition
  • $f(x_1,x_2,…,x_n;\theta) = \prod_{i=1}^{n}f(x_i;\theta)$
  • Statistical inference: a statement based on sample information about the population
    • estimation
    • hypothesis testing
    • prediction
  • Loss
    • squared error loss(L2 loss): $L2(\theta,\hat \theta) = (\hat \theta- \theta)^2$
    • absolute error loss(L1 loss): $L1(\theta,\hat \theta) =\hat \theta- \theta$

Estimation and Hypothesis Testing

Methods of finding estimators

  • Moment Method
    • population moment: the kth moment of random variable X about 0 is \(E(X^k) = \int_{\inf}^{\inf}x^kf(x;\theta)dx\)
    • the kth sample moment about 0 is \(M_k = \frac{\Sigma_{i=1}^{k}X_i^k}{n}\)
  • Maximum Likelihood Method
    • Likelihood function \(L(\theta) = \prod_{i=1}^{n}f(x;\theta)\)
    • \[ln(L(\theta))\]
    • Find theta to maximize L \(\frac{ln(L(\theta))}{ d\theta} = 0\)
  • Bayesian Method
    • find posterior pdf $f(\thetax_1,…,x_n)$
    • with squared error loss
      • find $E(\thetax_1,..,x_n)$
    • with absolute error loss
      • find median m

Goodness of estimators

  • Unbiasedness: $E(\hat \theta) = \theta$
  • Efficiency: $Var(\hat \theta_1)$ smaller, more efficient
  • Uniform Minimum Variance Unbiasedness
  • Sufficiency
  • Consistency

Interval Estimation

  • interval estimator: a pair of statistics L, U, and $\theta$ fall in ${[L,U]}$
  • confidence interval: $P(L \leq \theta \leq U) = 1-\alpha$
  • pivotal quantity: a function of a random sample, whose probability distribution is independent of the parameter $\theta$

Hypothesis Testing

  • statistical hypothesis: a conjecture about the distribution of a population
  • critical region: a subset of the sample space whcih accept the alternative hypothesis to be true
situations$H_0$ is true$H_0$ is false
accept $H_0$correctType II error
reject $H_0$Type I errorcorrect

Dimension Reduction

  • Feature Selection: a subset of the original features
  • Feature Projection: a linear combination of the original features
  • Manifold Modeling: a non-linear combination of the original features

Principal Component Analysis (PCA)

PCA 是一種 feature projection 的方法,當一個 dataset 有很多組 attributes,希望挑出來的是擁有大variance 、小 covariance (correlation),我個人的理解是那些跟其他 attribute 有顯著差異的 attributes。

Covariance matrix:

\[S_x = \frac{1}{n-1}XX^T\]
  • The element Eii is the variance of attribute i.
  • The element Eij is the covariance of attribute i and attribute j.

transform to:

\[S_x = \begin{bmatrix}d_1 & 0 & \ddots & 0 \\ 0 & d_2 & \ddots & 0 \\ \ddots & \ddots & \ddots & \ddots \\0 & 0 & \ddots & d_m \end{bmatrix}\]

假設有 m 個 attributes,找出一條有最小 square distance 的回歸線(這條線的 variance 會最大),然後在 orthogonal 的方向找出 m 條這樣的線,這些線就是 principal components,接下來就可以選出 variance 最大的幾條線來做降維投影。

Singular Value Decomposition

We can achieve PCA by performing SVD first. \(A=U\Sigma V^*\)

  • U: m*m
  • $\Sigma$: m*n
  • V: n*n

Linear Discriminant Approach(LDA)

To 2 classes: \(S_1 = \Sigma_{x \in X_1}(x-u_1)(x-u_1)^T\) \(S_2 = \Sigma_{x \in X_2}(x-u_2)(x-u_2)^T\) \(S_W = S_1 + S_2\) \(S_B = (u_1-u_2)(u_1-u_2)^T\) Objective: function: \(J(W) = \frac{W^TS_BW}{W^TS_WW}\) Fisher’s linear discriminant \(S_W^{-1}S_BW = \lambda W\)

Feature Selection

Naive Approach

Measurement Function

Select features with large E or T

  • Entropy: $E=-\Sigma_{i=1}^{N}\Sigma_{j=1}^{N}(s_{ij}log(s_{ij})+(1-s_{ij})log(1-s_{ij}))$
  • T-statistics: $T(f_j)=\frac{u_j^1-u_j^{-1}}{\sqrt(\frac{(\delta_j^1)^2}{n_{1}}+\frac{(\delta_j^-1)^2}{n_{-1}})}$

Framework

  • Filter approach: select features, evaluate with statistics-criteria, done
  • Wrapper approach: select features, build model to evaluate, done
  • Embedded approach: build a model to select features and evaluate

Simulated Annealing

Partical Swarm Optimization

Introduction to Time Series

ARMA

ARIMA

  • prediction error: $e_{t+kt} = Y_{t+k} - \hat Y_{t+k}$

OLAP, Classification

  • On-Line Analytical Process: use multidimensional array representation
    • slicing
    • dicing
    • roll-up: 組在一起
    • drill-down: 拆出細節

Classification

Tree Induction

  • GINI index \(GINI(t) = 1-\Sigma_{j}[p(j|t)]^2\)
  • Splitting \(GINI_split(t) = \Sigma_{i=1}^{k}\frac{n_i}{n}GINI(i)\)
  • Entropy \(Entropy(t) = -\Sigma_jp(j|t)log(p(j|t))\)
  • Splitting \(GAIN_split = Entropy(p)-\Sigma_{i-1}^{k}\frac{n_i}{n}Entropy(i)\)

More on Classification

Rule-based

  • Easy to interpret and generate
  • Can classify new instances rapidly

k-Nearnest neighbors

  • Euclidean Distance
  • top-k nearest

Bayes Classifier

\[P(C|A) = \frac{P(A|C)P(C)}{P(A)}\]

Naive Bayes Classifier

$$ P(CA_1,A_2,…,A_n) = \frac{P(A_1,A_2,…,A_nC)P(C)}{P(A_1,A_2,…,A_n)}$$
$$P(A_1,A_2,…,A_nC)=P(A_1)P(A_2)…P(A_n)$$ 

Support Vector Machine

  • constraint \(y(W^Tx-b)-1 \geq 0\)
  • objective(to maximize the margin) \(minimize \frac{||W||^2}{2}\) \(L(w,b,\alpha) = \frac{||W||^2}{2} - \Sigma_{i=1}^{7}\alpha_i[y_i(W^Tx-b)-1]\)

Kernel Function

The function that represents the inner product of some space(feature) in another(original) space. SVM can fins an hyperplace to seperate the data in a feature space induced by a kernel function.

Clustering

  • cluster: high intra-class similarity, low inter-class similarity
  • unsupervised classification

Partition around Medoids(PAM)

  • Select arbitrary of k objects as medoids
  • Compute cost for all non-selected objects
  • Select the pair with the lowest cost, if it’s negative, replace the medoid, and go backward.
  • For each non-selected object, find the most similar medoid.

Hierarchical Clustering

agglomerative clustering

  • compute the proximity matrix
  • each data point be a cluster
  • do:
  • merge the 2 closest
  • update the proximity matrix
  • until 2 clusters remain

NN and PUF

NN

Check PyTorch.

Physically Unclonable Function(PUF)

  • PUF: a promising security primitive for low-end devices in 5G
  • Easy to evaluate, hard to predict
  • A server will send a challenge to PUF, and PUF will respond and the server will check whether the response is correct

Modeling attack

  • APUF: arbiter PUF can be modeled as a linear additive model

Web and Social Network Analysis

Cloud Computing

  • Centralized infrastruction
  • Virtualized architecture
  • Useage-based resources allocation
  1. SaaS: software as a service
  2. PaaS: platform as a service
  3. IaaS: infrastructure as a service

Leave a Comment