In contemporary society, big data is one of the hottest topics all over the world.I recommend a book named Big Data, writen by Viktor Mayer-Schonberger and Kenneth Cukier, that I have read several months ago.As the authors reference in the book, big data is a revolution that will transform how we live, work and think, not so long as many people think big data is a new computer-involved technology.The latter is too narrow, even to a computer engineer.Distinguishing from traditional data, big data has three features.First of all, the datum we get are not based on sampling survey, insteading based on whole datum we can reach.Secondly, we have not to promise the datum in sample are completely correct any more.Last but not least, we pay more attention to the association among events and never obsess over the factors cause the association.Give an example, when we apply big data thoughts in medical treatment, instead of traditional sampling survey, we needn’t to check millions or billions of datum and correct all the false datum or try to fill every blank attributes.We sacrifice some accuracy and exchange for an expansion of sample in orders of magnitude.Basing on the size of sample, we are able to abandon the method that we used called supervised learning and replace it by probability statistics, which will help us to find much more valuable association that seemingly have nothing to do with each other.On the other hand, when we divide a sample, which has ten billion datum, by ten attributes and each attribute has ten different possible values.We can easily get that, the average number of each sample interval is ten.It is ridiculous to draw a conclusion basing on a sample with such tiny size because the size is absolutely unrepresentative.What’s more, the preference of investigators will influence the component of sample in some way.
When it simply comes to computer science, big data is also not one technique.It is a frame, which needs many techniques to support, and we would achieve different value when we fill the frame with different material.Here are some skills that a big data engineer must master.
Basic Knowledge
1 MSE(Mean Square Error)
2 LMS(Least Mean Square)
3 LSM(Least Square Methods)
4 MLE(MaximumLikelihood Estimation)
5 QP(Quadratic Programming)
6 CP(Conditional Probability)
7 JP(Joint Probability)
8 MP(Marginal Probability)
9 Bayesian Formula
10 L1 /L2Regularization
11 GD(GradientDescent)
12 SGD(Stochastic Gradient Descent)
13 Eigenvalue
14 Eigenvector
15 QR-decomposition
16 Quantile
17 Covariance
18 Matrix Calculation
Discrete Distribution
19 BernoulliDistribution/Binomial
20 Negative BinomialDistribution
21 MultinomialDistribution
22 Geometric Distribution
23 HypergeometricDistribution
24 Poisson Distribution
Continuous Distribution
25 UniformDistribution
26 Normal Distribution /Guassian Distribution
27 ExponentialDistribution
28 Lognormal Distribution
29 Gamma Distribution
30 Beta Distribution
31 Dirichlet Distribution
32 Rayleigh Distribution
33 Cauchy Distribution
34 Weibull Distribution
Three Sampling Distribution
35 Chi-square Distribution
36 t-distribution
37 F-distribution
Data Pre-processing
39 Missing Value Imputation
40 Discretization Mapping
41 Normalization
Sampling
42 Simple Random Sampling
43 OfflineSampling
44 Online Sampling
45 Ratio-based Sampling
46 Acceptance-RejectionSampling
47 Importance Sampling
48 MCMC(MarkovChain Monte Carlo:Metropolis-Hasting& Gibbs)
Optimization
49 Lagrange mutipliers
50 Non-constrainedOptimization
51 Cyclic VariableMethods
52 Pattern Search Methods
53 VariableSimplex Methods
54 Gradient Descent Methods
55 Newton Methods
56 Quasi-NewtonMethods
57 Conjugate Gradient Methods
ConstrainedOptimization
58 Approximation Programming Methods
59 FeasibleDirection Methods
60 Penalty Function Methods
61 Multiplier Methods
62 Heuristic Algorithm
63 SA(SimulatedAnnealing)
64 GA(genetic algorithm)
Similarity Measure&Distance Measure
65 Euclidean Distance
66 ManhattanDistance
67 Chebyshev Distance
68 MinkowskiDistance
69 Standardized Euclidean Distance
70 MahalanobisDistance
71 Cos(Cosine)
72 HammingDistance/Edit Distance
73 JaccardDistance
74 Correlation Coefficient Distance
75 InformationEntropy
76 KL(Kullback-Leibler Divergence/Relative Entropy)
Data Structure
77 Stack、Queue、List
78 Hash Table
79 Binary Tree、Red Black Tree、B Tree
80 Graph
Common Algorithm
81 Sort
82 Maximum Sub Array
83 Maximum Common Sub Sequence
84 Minimal Spanning Tree
85 Shortest Path
86 The Storage And Computation Of Matrix
Consistency Algorithm (Distributed Theory)
87 Paxos
88 Raft
89 Gossip
Clustering
90 K-Means
91 K-Mediods
92 Binary K-Means 2
93 FK-Means
94 Canopy
95 Spectral-KMeans
96 GMM-EM
97 K-PototypesCLARANS
98 BIRCH
99 CURE
100 DBSCAN
101 CLIQUE
Classification&Regression
102 LR(Linear Regression)
103 LR(LogisticRegression)
104 SR(Softmax Regression)
105 GLM(GeneralizedLinear Model)
106 RR(Ridge Regression)
107 LASSO(Least Absolute Shrinkage andSelectionator Operator)
108 RF
109 DT(DecisionTree)
110 GBDT(Gradient BoostingDecision Tree)
111 CART(ClassificationAnd Regression Tree)
112 KNN(K-Nearest Neighbor)
113 SVM(Support VectorMachine)
114 KF Kernel Function/PolynomialKernel Function
115 Guassian KernelFunction/Radial BasisFunction RBF
116 String KernelFunction
117 NB(Naive Bayes)BN(Bayesian Network/Bayesian Belief Network/ Belief Network)
118 LDA(Linear Discriminant Analysis/FisherLinear Discriminant)
119 EL(Ensemble Learning BoostingBaggingStacking)
120 AdaBoost(Adaptive Boosting)
121 MEM(MaximumEntropy Model)
Effectiveness Evaluation
122 Confusion Matrix
123 Precision Recall
124 Accuracy F-score
125 ROC Curve AUC
126 LiftCurve KS Curve
PGM(Probabilistic Graphical Models)
127 BN(Bayesian Network/Bayesian Belief Network/ BeliefNetwork)
128 MC(Markov Chain)
129 HMM(HiddenMarkov Model)
130 MEMM(Maximum Entropy Markov Model)
131 CRF(ConditionalRandom Field)
132 MRF(MarkovRandom Field)
NN(Neural Network)
133 ANN(Artificial Neural Network)
134 BP(Error BackPropagation)
Deep Learning
135 Auto-encoder
136 SAE(Stacked Auto-encoders)
137 Sparse Auto-encoders
138 Denoising Auto-encoders
139 Contractive Auto-encoders
140 RBM(RestrictedBoltzmann Machine)
141 DBN(Deep Belief Network)
142 CNN(ConvolutionalNeural Network)
143 Word2Vec
Dimensionality Reduction
144 LDA LinearDiscriminant Analysis/Fisher Linear Discriminant/Fisher
145 PCA(Principal Component Analysis)
146 ICA(IndependentComponent Analysis)
147 SVD(Singular Value Decomposition)
148 FA(FactorAnalysis)
Text Mining
149 VSM(Vector Space Model)
150 Word2Vec
151 TF(Term Frequency)
152 TF-IDF(Term Frequency-Inverse DocumentFrequency)
153 MI(MutualInformation)
154 ECE(Expected Cross Entropy)
155 QEMI
156 IG(InformationGain)
157 IGR(Information Gain Ratio)
158 Gini
159 x2 Statistic
160 TEW(TextEvidence Weight)
161 OR(Odds Ratio)
162 N-Gram Model 3
163 LSA(Latent Semantic Analysis)
164 PLSA(ProbabilisticLatent Semantic Analysis)
165 LDA(Latent DirichletAllocation)
Association Mining
166 Apriori
167 FP-growth(Frequency Pattern Tree Growth)
168 AprioriAll
169 Spade
Recommendation Engine
170 DBR(Demographic-based Recommendation)
171 CBR(Context-basedRecommendation)
172 CF(Collaborative Filtering)
173 UCF(User-basedCollaborative Filtering Recommendation)
174 ICF(Item-basedCollaborative Filtering Recommendation)
Feature Selection
175 Mutual Information
176 DocumentFrequence
177 Information Gain
178 Chi-squared Test
179 Gini
Outlier Detection
180 Statistic-based
181 Distance-based
182 Density-based
183 Clustering-based
Learning to Rank
184 Pointwise:McRank
185 Pairwise:RankingSVM,RankNet,Frank,RankBoost
186 Listwise:AdaRank,SoftRank,LamdaMART
Tool
187 Matlab
188 TensorFlow
189 Spark Mllib
Processing Platform
190 Hadoop
191 Spark
192 Weka
193 Flink
Data Warehouse And Data Analysis (SQL)
194 Pig
195 Hive
196 Spark SQL
197 ElasticSearch
Message Queue
198 Kafka
Stream Computing
199 Storm
200 Spark Streaming
Programming Language
201 Java
202 Python
203 R
Data Visualization
204 D3.js
Here is one system structure diagram that I think will help reader to understand the big data frame.
Email me if you want the orginal diagram.(callouweicheng@gmail.com)