Data Science Final Review
Published:
This is my note for the class “Data Science”
Outline
Overview
Sources of Big Data
- Individuals
- Business process
- Sensors
Big data lead to the explosion of ML. Cloud -> Big Data -> AI
Data Mining
Introduction to data
Procedure of Data Mining
- Obtain the data
- Decide your goal
- Data cleaning
- Feature Selection
- Apply mining method
- Decide what and how to output
- Interpret your results
Data
- attribute: a property or characteristic of an object
- object: row, tuple, record, point, … a collection of attributes to describe
- Types of attribute:
- Nominal: 只有名稱
- Ordinal: 有順序的關係
- Interval: 區間
- Ratio: 比例關係
Distinctness | Order | Addition | Multiplication | |
---|---|---|---|---|
Nominal | O | |||
Ordinal | O | O | ||
Interval | O | O | O | |
Ratio | O | O | O | O |
- data matrix: with the same set of attributes
- document data: each term is an attribute, record the number of existence of the term
- transaction data: each record involves a set of items
- data quality:
- Noise, outliers
- Missing values
- Duplicate data
Data preprocessing
- Aggregation: combine attributes
- Sampling: representative
- Dimensionality Reduction: avoid curse of dimensionality
- Feature subset selection: another way to reduce dimensionality of data
- Feature creation
- Discretization and Binariztion
- Attribute Transformation
Curse of Dimensionality
When dimensionality increases, data becomes increasingly sparse, and becomes less meaningful.
- Euclidean Distance \(dist = \sqrt{\sum_{k=1}^{n} {(p_k-q_k)^2}}\)
- Minkowski Distance \(dist = ({\sum_{k=1}^{n} {|p_k-q_k|^r}})^\frac{1}{r}\)
- Cosine Dimilarity \(cos(d_1,d_2)=\frac{<d_1,d_2>}{|d_1||d_2|}\)
- Correlation \(corr(x, y)=\frac{S_xy}{S_xS_y}\) | | Euclidean Distance | Cosine Similarity | Correlation | |—|—|—|—| | Invariant to Scaling | no | yes | yes | | Invariant to Translation | no | no | yes |
Association
- support count $\sigma$: frequency of an itemset
- support $s$: fraction of the transaction
- frequent itemset: itemset whose support >= minsup
- association rule: X -> Y
- support: s(X,Y)
- confidence: s(X,Y)/s(X)
Mining association rules
- Frequent Itemset Generation
- Rule Generation
Apriori
- with minsup, derive $C_1$ and $L_1$
- Maximal Frequent Itemset: no superset is frequent
- Closed Frequent Itemset: no superset has the same support
- criterion: confidence(x->y) = P(y)
Lift = $\frac{P(Y X)}{P(Y)}$ - Interest = $\frac{P(X,Y)}{P(X)P(Y)}$
- PS = ${P(X,Y)-P(X)P(Y)}$
- coefficient = $\frac{P(X,Y)-P(X)P(Y)}{\sqrt{P(X)[1-P(X)]P(Y)[1-P(Y)]}}$
- Simpson’s paradox: observed relationship in data may be influenced by the presence of other confounding factors
Databases
- Database Management System(DBMS): the software system that manages orginized data
- Data Warehouse: many DB batches from DBMSs
- Data Lake: all data, structured or unstructured
Relational Database Model
Use mathematical relation as its building block
- NULL value:
- exist but don’t know
- don’t know whether exist
- don’t apply to the tuple
- Constraints
- Key constraints:
- all tuples in a relation must also be distinct
- superkey(SK): a set of attributes that no 2 distinct tuple can have the same value
- 1 primary key from candidate keys, to identify tuples in a relation
- Constraints on NULL: specify whether NULL is permitted
- Entity Integrity Constraints: no primary key value can be NULL
- Referential Integrity Constraints: maintain the consistency among tuples
- Key constraints:
Probability and Statistics
- Sample space: the collection of all possible outcomes
- Event space($\sigma$-field)
- $S\in F$
- if $A\in F$ then $A_c\in F$
- if $A_j\in F$ for $j \geq{1}$, then $\bigcup_{j=1}^{\inf}A_j \in F$
- Random Variable: a function from the sample space into the set of real numbers
- Kolmogorov’s axioms
- $P(x) \geq 0$
- $P(S) = 1$
- $P(x_1\cup x_2) = P(x_1)+P(x_2)$
- Statistic: a function of a random sample, free of $\theta$
- Estimator: a special designed statistic to guess $\theta$
- Hypothesis testing: test whether the characteristics of some statistic fulfills certain condition
- $f(x_1,x_2,…,x_n;\theta) = \prod_{i=1}^{n}f(x_i;\theta)$
- Statistical inference: a statement based on sample information about the population
- estimation
- hypothesis testing
- prediction
- Loss
- squared error loss(L2 loss): $L2(\theta,\hat \theta) = (\hat \theta- \theta)^2$
absolute error loss(L1 loss): $L1(\theta,\hat \theta) = \hat \theta- \theta $
Estimation and Hypothesis Testing
Methods of finding estimators
- Moment Method
- population moment: the kth moment of random variable X about 0 is \(E(X^k) = \int_{\inf}^{\inf}x^kf(x;\theta)dx\)
- the kth sample moment about 0 is \(M_k = \frac{\Sigma_{i=1}^{k}X_i^k}{n}\)
- Maximum Likelihood Method
- Likelihood function \(L(\theta) = \prod_{i=1}^{n}f(x;\theta)\)
- \[ln(L(\theta))\]
- Find theta to maximize L \(\frac{ln(L(\theta))}{ d\theta} = 0\)
- Bayesian Method
find posterior pdf $f(\theta x_1,…,x_n)$ - with squared error loss
find $E(\theta x_1,..,x_n)$
- with absolute error loss
- find median m
Goodness of estimators
- Unbiasedness: $E(\hat \theta) = \theta$
- Efficiency: $Var(\hat \theta_1)$ smaller, more efficient
- Uniform Minimum Variance Unbiasedness
- Sufficiency
- Consistency
Interval Estimation
- interval estimator: a pair of statistics L, U, and $\theta$ fall in ${[L,U]}$
- confidence interval: $P(L \leq \theta \leq U) = 1-\alpha$
- pivotal quantity: a function of a random sample, whose probability distribution is independent of the parameter $\theta$
Hypothesis Testing
- statistical hypothesis: a conjecture about the distribution of a population
- critical region: a subset of the sample space whcih accept the alternative hypothesis to be true
situations | $H_0$ is true | $H_0$ is false |
---|---|---|
accept $H_0$ | correct | Type II error |
reject $H_0$ | Type I error | correct |
Dimension Reduction
- Feature Selection: a subset of the original features
- Feature Projection: a linear combination of the original features
- Manifold Modeling: a non-linear combination of the original features
Principal Component Analysis (PCA)
PCA 是一種 feature projection 的方法,當一個 dataset 有很多組 attributes,希望挑出來的是擁有大variance 、小 covariance (correlation),我個人的理解是那些跟其他 attribute 有顯著差異的 attributes。
Covariance matrix:
\[S_x = \frac{1}{n-1}XX^T\]- The element Eii is the variance of attribute i.
- The element Eij is the covariance of attribute i and attribute j.
transform to:
\[S_x = \begin{bmatrix}d_1 & 0 & \ddots & 0 \\ 0 & d_2 & \ddots & 0 \\ \ddots & \ddots & \ddots & \ddots \\0 & 0 & \ddots & d_m \end{bmatrix}\]假設有 m 個 attributes,找出一條有最小 square distance 的回歸線(這條線的 variance 會最大),然後在 orthogonal 的方向找出 m 條這樣的線,這些線就是 principal components,接下來就可以選出 variance 最大的幾條線來做降維投影。
Singular Value Decomposition
We can achieve PCA by performing SVD first. \(A=U\Sigma V^*\)
- U: m*m
- $\Sigma$: m*n
- V: n*n
Linear Discriminant Approach(LDA)
To 2 classes: \(S_1 = \Sigma_{x \in X_1}(x-u_1)(x-u_1)^T\) \(S_2 = \Sigma_{x \in X_2}(x-u_2)(x-u_2)^T\) \(S_W = S_1 + S_2\) \(S_B = (u_1-u_2)(u_1-u_2)^T\) Objective: function: \(J(W) = \frac{W^TS_BW}{W^TS_WW}\) Fisher’s linear discriminant \(S_W^{-1}S_BW = \lambda W\)
Feature Selection
Naive Approach
Measurement Function
Select features with large E or T
- Entropy: $E=-\Sigma_{i=1}^{N}\Sigma_{j=1}^{N}(s_{ij}log(s_{ij})+(1-s_{ij})log(1-s_{ij}))$
T-statistics: $T(f_j)=\frac{ u_j^1-u_j^{-1} }{\sqrt(\frac{(\delta_j^1)^2}{n_{1}}+\frac{(\delta_j^-1)^2}{n_{-1}})}$
Framework
- Filter approach: select features, evaluate with statistics-criteria, done
- Wrapper approach: select features, build model to evaluate, done
- Embedded approach: build a model to select features and evaluate
Simulated Annealing
Partical Swarm Optimization
Introduction to Time Series
ARMA
ARIMA
prediction error: $e_{t+k t} = Y_{t+k} - \hat Y_{t+k}$
OLAP, Classification
- On-Line Analytical Process: use multidimensional array representation
- slicing
- dicing
- roll-up: 組在一起
- drill-down: 拆出細節
Classification
Tree Induction
- GINI index \(GINI(t) = 1-\Sigma_{j}[p(j|t)]^2\)
- Splitting \(GINI_split(t) = \Sigma_{i=1}^{k}\frac{n_i}{n}GINI(i)\)
- Entropy \(Entropy(t) = -\Sigma_jp(j|t)log(p(j|t))\)
- Splitting \(GAIN_split = Entropy(p)-\Sigma_{i-1}^{k}\frac{n_i}{n}Entropy(i)\)
More on Classification
Rule-based
- Easy to interpret and generate
- Can classify new instances rapidly
k-Nearnest neighbors
- Euclidean Distance
- top-k nearest
Bayes Classifier
\[P(C|A) = \frac{P(A|C)P(C)}{P(A)}\]Naive Bayes Classifier
$$ P(C | A_1,A_2,…,A_n) = \frac{P(A_1,A_2,…,A_n | C)P(C)}{P(A_1,A_2,…,A_n)}$$ |
$$P(A_1,A_2,…,A_n | C)=P(A_1)P(A_2)…P(A_n)$$ |
Support Vector Machine
- constraint \(y(W^Tx-b)-1 \geq 0\)
- objective(to maximize the margin) \(minimize \frac{||W||^2}{2}\) \(L(w,b,\alpha) = \frac{||W||^2}{2} - \Sigma_{i=1}^{7}\alpha_i[y_i(W^Tx-b)-1]\)
Kernel Function
The function that represents the inner product of some space(feature) in another(original) space. SVM can fins an hyperplace to seperate the data in a feature space induced by a kernel function.
Clustering
- cluster: high intra-class similarity, low inter-class similarity
- unsupervised classification
Partition around Medoids(PAM)
- Select arbitrary of k objects as medoids
- Compute cost for all non-selected objects
- Select the pair with the lowest cost, if it’s negative, replace the medoid, and go backward.
- For each non-selected object, find the most similar medoid.
Hierarchical Clustering
agglomerative clustering
- compute the proximity matrix
- each data point be a cluster
- do:
- merge the 2 closest
- update the proximity matrix
- until 2 clusters remain
NN and PUF
NN
Check PyTorch.
Physically Unclonable Function(PUF)
- PUF: a promising security primitive for low-end devices in 5G
- Easy to evaluate, hard to predict
- A server will send a challenge to PUF, and PUF will respond and the server will check whether the response is correct
Modeling attack
- APUF: arbiter PUF can be modeled as a linear additive model
Web and Social Network Analysis
Cloud Computing
- Centralized infrastruction
- Virtualized architecture
- Useage-based resources allocation
- SaaS: software as a service
- PaaS: platform as a service
- IaaS: infrastructure as a service
Leave a Comment