1 Front Matter
Foreword
DSPA Application and Use Disclaimer
2nd Edition Preface
Book Content
Notations
1 Chapter 1 - Introduction
1.1 Motivation
1.1.1 DSPA Mission and Objectives
1.1.2 Examples of driving motivational problems and challenges
1.1.3 Common Characteristics of Big (Biomedical and Health) Data
1.1.4 Data Science
1.1.5 Predictive Analytics
1.1.6 High-throughput Big Data Analytics
1.1.7 Examples of data repositories, archives and services
1.1.8 Responsible Data Science and Ethical Predictive Analytics
1.1.9 DSPA Expectations
1.2 Foundations of R
1.2.1 Why use R?
1.2.2 Getting started with R
1.2.3 Mathematics, Statistics, and Optimization
1.2.4 Advanced Data Processing
1.2.5 Basic Plotting
1.2.6 Basic R Programming
1.2.7 Data Simulation Primer
1.3 Practice Problems
1.3.1 Long-to-Wide Data format translation
1.3.2 Data Frames
1.3.3 Data stratification
1.3.4 Simulation
1.3.5 Programming
1.4 Appendix
1.4.1 Tidyverse
1.4.2 Additional R documentation and resources
1.4.3 HTML SOCR Data Import
1.4.4 R Debugging
2 Chapter 2: Basic Visualization and Exploratory Data Analytics
2.1 Data Handling
2.1.1 Saving and Loading R Data Structures
2.1.2 Importing and Saving Data from CSV Files
2.1.3 Importing Data from ZIP and SAV Files
2.1.4 Exploring the Structure of Data
2.1.5 Exploring Numeric Variables
2.1.6 Measuring Central Tendency - mean, median, mode
2.1.7 Measuring Spread - variance, quartiles and the five-number summary
2.1.8 Visualizing Numeric Variables - boxplots
2.1.9 Visualizing Numeric Variables - histograms
2.1.10 Uniform and normal distributions
2.1.11 Exploring Categorical Variables
2.1.12 Exploring Relationships Between Variables
2.1.13 Missing Data
2.1.14 Parsing web pages and visualizing tabular HTML data
2.1.15 Cohort-Rebalancing (for Imbalanced Groups)
2.2 Exploratory Data Analytics (EDA)
2.2.1 Classification of visualization methods
2.2.2 Composition
2.2.3 Comparison
2.2.4 Relationships
2.3 Practice Problems
2.3.1 Data Manipulation
2.3.2 Bivariate relations
2.3.3 Missing data
2.3.4 Surface plots
2.3.5 Unbalanced groups
2.3.6 Common plots
2.3.7 Trees and Graphs
2.3.8 Data EDA examples
2.3.9 Data reports
3 Chapter 3: Linear Algebra, Matrix Computing and Regression Modeling
3.1 Linear Algebra
3.1.1 Building Matrices
3.1.2 Matrix subscripts
3.1.3 Addition and subtraction
3.1.4 Multiplication
3.2 Matrix Computing
3.2.1 Solving Systems of Equations
3.2.2 The identity matrix
3.2.3 Vectors, Matrices, and Scalars
3.2.4 Sample Statistics
3.2.5 Applications of Matrix Algebra in Linear Modeling
3.2.6 Finding function extrema (min/max) using calculus
3.2.7 Linear modeling in R
3.3 Eigenspectra - Eigenvalues and Eigenvectors
3.4 Matrix notation
3.5 Linear regression
3.5.1 Sample covariance matrix
3.6 Linear multivariate linear regression modeling
3.6.1 Simple linear regression
3.6.2 Ordinary least squares estimation
3.6.3 Regression Model Assumptions
3.6.4 Correlations
3.6.5 Multiple Linear Regression
3.7 Case Study 1: Baseball Players
3.7.1 Step 1 - collecting data
3.7.2 Step 2 - exploring and preparing the data
3.7.3 Step 3 - training a model on the data
3.7.4 Step 4 - evaluating model performance
3.7.5 Step 5 - improving model performance
3.8 Regression trees and model trees
3.8.1 Adding regression to trees
3.9 Bayesian Additive Regression Trees (BART)
3.9.1 1D Simulation
3.9.2 Higher-Dimensional Simulation
3.9.3 Heart Attack Hospitalization Case-Study
3.9.4 Another look at Case study 2: Baseball Players
3.10 Practice Problems
3.10.1 How is matrix multiplication defined?
3.10.2 Scalar vs. Matrix Multiplication
3.10.3 Matrix Equations
3.10.4 Least Square Estimation
3.10.5 Matrix manipulation
3.10.6 Matrix Transposition
3.10.7 Sample Statistics
3.10.8 Eigenvalues and Eigenvectors
3.10.9 Regression Forecasting using Numerical Data
4 Chapter 4: Linear and Nonlinear Dimensionality Reduction
4.1 Motivational Example: Reducing 2D to 1D
4.2 Matrix Rotations
4.3 Summary (PCA, ICA, and FA)
4.4 Principal Component Analysis (PCA)
4.4.1 Principal Components
4.5 Independent component analysis (ICA)
4.6 Factor Analysis (FA)
4.7 Singular Value Decomposition (SVD)
4.7.1 SVD Summary
4.8 t-distributed Stochastic Neighbor Embedding (t-SNE)
4.8.1 t-SNE Formulation
4.8.2 t-SNE Example: Hand-written Digit Recognition
4.9 Uniform Manifold Approximation and Projection (UMAP)
4.9.1 Mathematical formulation
4.9.2 Hand-Written Digits Recognition
4.9.3 Apply UMAP for class-prediction using new data
4.10 UMAP Parameters
4.10.1 Stability, Replicability, and Reproducibility
4.10.2 UMAP Interpretation
4.11 Dimensionality Reduction Case Study (Parkinson's Disease)
4.11.1 Step 1: Collecting Data
4.11.2 Step 2: Exploring and preparing the data
4.11.3 PCA
4.11.4 Factor analysis (FA)
4.11.5 t-SNE
4.11.6 Uniform Manifold Approximation and Projection (UMAP)
4.12 Practice Problems
4.12.1 Parkinson's Disease example
4.12.2 Allometric Relations in Plants example
4.12.3 3D Volumetric Brain Study
5 Chapter 5: Supervised Classification
5.1 k-Nearest Neighbor Approach
5.2 Distance Function and Dummy coding
5.2.1 Estimation of the hyperparameter k
5.2.2 Rescaling of the features
5.2.3 Rescaling Formulas
5.2.4 Case Study: Youth Development
5.2.5 Case Study: Predicting Galaxy Spins
5.3 Probabilistic Learning - Naïve Bayes Classification
5.3.1 Overview of the Naive Bayes Method
5.3.2 Model Assumptions
5.3.3 Bayes Formula
5.3.4 The Laplace Estimator
5.3.5 Case Study: Head and Neck Cancer Medication
5.4 Decision Trees and Divide and Conquer Classification
5.4.1 Motivation
5.4.2 Decision Tree Overview
5.4.3 Case Study 1: Quality of Life and Chronic Disease
5.4.4 Classification rules
5.5 Case Study 2: QoL in Chronic Disease (Take 2)
5.6 Practice Problems
5.6.1 Iris Species
5.6.2 Cancer Study
5.6.3 Baseball Data
5.6.4 Medical Specialty Text-Notes Classification
5.6.5 Chronic Disease Case-Study
6 Chapter 6: Black Box Machine Learning Methods
6.1 Neural Networks
6.1.1 From biological to artificial neurons
6.1.2 Activation functions
6.2 Network topology
6.2.1 Network layers
6.2.2 Training neural networks with backpropagation
6.2.3 Case Study 1: Google Trends and the Stock Market - Regression
6.2.4 Simple NN demo - learning to compute
6.2.5 Case Study 2: Google Trends and the Stock Market - Classification
6.3 Support Vector Machines (SVM)
6.3.1 Classification with hyperplanes
6.3.2 Case Study 3: Optical Character Recognition (OCR)
6.3.3 Case Study 4: Iris Flowers
6.3.4 Parameter Tuning
6.3.5 Improving the performance of Gaussian kernels
6.4 Ensemble meta-learning
6.4.1 Bagging
6.4.2 Boosting
6.4.3 Random forests
6.4.4 Random Forest Algorithm (Pseudo Code)
6.4.5 Adaptive boosting
6.5 Practice Problems
6.5.1 Problem 1: Google Trends and the Stock Market
6.5.2 Problem 2: Quality of Life and Chronic Disease
7 Chapter 7: Qualitative Learning Methods - Text Mining, Natural Language Processing, Apriori Association Rules Learning
7.1 Natural Language Processing (NLP) and Text Mining (TM)
7.1.1 A simple NLP/TM example
7.1.2 Case-Study: Job ranking
7.1.3 Area Under ROC Curve
7.1.4 TF-IDF
7.1.5 Cosine similarity
7.1.6 Sentiment analysis
7.1.7 NLP/TM Analytics
7.2 Apriori Association Rules Learning
7.2.1 Association Rules
7.2.2 The Apriori algorithm for association rule learning
7.2.3 Rule support and confidence
7.2.4 Building a set of rules with the Apriori principle
7.2.5 A toy example
7.2.6 Case Study 1: Head and Neck Cancer Medications
7.2.7 Graphical depiction of association rules
7.2.8 Saving association rules to a file or a data frame
7.3 Summary
7.4 Practice Problems
7.4.1 Groceries
7.4.2 Titanic Passengers
8 Chapter 8: Unsupervised Clustering
8.1 ML Clustering
8.2 Silhouette plots
8.3 The k-Means Clustering Algorithm
8.3.1 Pseudocode
8.3.2 Choosing the appropriate number of clusters
8.3.3 Case Study 1: Divorce and Consequences on Young Adults
8.3.4 Model improvement
8.3.5 Case study 2: Pediatric Trauma
8.3.6 Feature selection for k-Means clustering
8.4 Hierarchical Clustering
8.5 Spectral Clustering
8.5.1 Image segmentation using spectral clustering
8.5.2 Point cloud segmentation using spectral clustering
8.6 Gaussian mixture models
8.7 Summary
8.8 Practice Problems
8.8.1 Youth Development
9 Chapter 9: Model Performance Assessment, Validation, and Improvement
9.1 Measuring the performance of classification methods
9.2 Evaluation strategies
9.2.1 Binary outcomes
9.2.2 Cross-tables, contingency tables, and confusion-matrices
9.2.3 Other measures of performance beyond accuracy
9.2.4 Visualizing performance tradeoffs (ROC Curve)
9.3 Estimating future performance (internal statistical cross-validation)
9.3.1 The holdout method
9.3.2 Cross-validation
9.3.3 Bootstrap sampling
9.4 Improving model performance by parameter tuning
9.4.1 Using caret for automated parameter tuning
9.5 Customizing the tuning process
9.6 Comparing the performance of several alternative models
9.7 Forecasting types and assessment approaches
9.7.1 Overfitting
9.8 Internal Statistical Cross-validation
9.8.1 Example (Linear Regression)
9.8.2 Cross-validation methods
9.8.3 Case-Studies
9.8.4 Summary of CV output
9.8.5 Alternative predictor functions
9.8.6 Foundation of LDA and QDA for prediction, dimensionality reduction, or forecasting
9.8.7 Comparing multiple classifiers
10 Chapter 10: Specialized Machine Learning Topics
10.1 Working with specialized data and databases
10.1.1 Data format conversion
10.1.2 Querying data in SQL databases
10.1.3 SparQL Queries
10.1.4 Real Random Number Generation
10.1.5 Downloading the complete text of web pages
10.1.6 Reading and writing XML with the XML package
10.1.7 Web-page Data Scraping
10.1.8 Parsing JSON from web APIs
10.1.9 Reading and writing Microsoft Excel spreadsheets using XLSX
10.2 Working with domain-specific data
10.2.1 Working with bioinformatics data
10.2.2 Visualizing network data
10.3 Data Streaming
10.3.1 Definition
10.3.2 The stream package
10.3.3 Synthetic example - random Gaussian stream
10.3.4 Generate the stream
10.3.5 Sources of Data Streams
10.3.6 Printing, plotting, and saving streams
10.3.7 Stream animation
10.3.8 Case-Study: SOCR Knee Pain Data
10.3.9 Data Stream clustering and classification (DSC)
10.3.10 Evaluation of data stream clustering
10.4 Optimization and improving the computational performance
10.4.1 Generalizing tabular data structures with dplyr
10.4.2 Making data frames faster with data.table
10.4.3 Creating disk-based data frames with ff
10.4.4 Using massive matrices with bigmemory
10.5 Parallel computing
10.5.1 Measuring execution time
10.5.2 Parallel processing with multiple cores
10.5.3 Parallelization using foreach and doParallel
10.5.4 GPU computing
10.6 Deploying optimized learning algorithms
10.6.1 Building bigger regression models with biglm
10.6.2 Growing bigger and faster random forests with bigrf
10.6.3 Training and evaluation models in parallel with caret
10.7 R Notebook support for other programming languages
10.7.1 R-Python integration
10.7.2 Installing Python
10.7.3 Install the reticulate package
10.7.4 Installing and importing Python Modules
10.7.5 Python-based data modeling
10.7.6 Visualization of the results in R
10.7.7 R integration with C/C++
10.8 Practice problem
11 Chapter 11: Variable Importance and Feature Selection
11.1 Feature selection methods
11.1.1 Filtering techniques
11.1.2 Wrapper
11.1.3 Embedded Techniques
11.1.4 Random Forest Feature Selection
11.1.5 Case Study - ALS
11.2 Regularized Linear Modeling and Controlled Variable Selection
11.2.1 General Questions
11.2.2 Model Regularization
11.2.3 Matrix notation
11.2.4 Regularized Linear Modeling
11.2.5 Predictor Standardization
11.2.6 Estimation Goals
11.2.7 Linear Regression
11.2.8 Drawbacks of Linear Regression
11.2.9 Variable Selection
11.2.10 Simple Regularization Framework
11.2.11 General Regularization Framework
11.2.12 Likelihood Ratio Test (LRT), False Discovery Rate (FDR), and Logistic Transform
11.2.13 Logistic Transformation
11.2.14 Implementation of Regularization
11.2.15 Computational Complexity
11.2.16 LASSO and Ridge Solution Paths
11.2.17 Regression Solution Paths - Ridge vs. LASSO
11.2.18 Choice of the Regularization Parameter
11.2.19 Cross Validation Motivation
11.2.20 n-Fold Cross Validation
11.2.21 LASSO 10-Fold Cross Validation
11.2.22 Stepwise OLS (ordinary least squares)
11.2.23 Final Models
11.2.24 Model Performance
11.2.25 Summary
11.3 Knockoff Filtering (FDR-Controlled Feature Selection)
11.3.1 Simulated Knockoff Example
11.3.2 Knockoff invocation
11.3.3 PD Neuroimaging-genetics Case-Study
11.4 Practice Problems
12 Chapter 12: Big Longitudinal Data Analysis
12.1 Classical Time-Series Analytic Approaches
12.1.1 Time series analysis
12.1.2 Structural Equation Modeling (SEM)-latent variables
12.1.3 Longitudinal data analysis - Linear Mixed Model
12.1.4 Generalized estimating equations (GEE)
12.1.5 PD/PPMI Case-Study: SEM, GLMM, and GEE modeling
12.2 Network-based Approaches
12.2.1 Background
12.2.2 Recurrent Neural Networks (RNN)
12.2.3 Tensor Format Representation
12.2.4 Simulated RNN case-study
12.2.5 Climate Data Study
12.2.6 Keras-based Multi-covariate LSTM Time-series Analysis and Forecasting
13 Chapter 13: Function Optimization
13.1 General optimization approach
13.1.1 First-order Gradient-based Optimization
13.1.2 Second-order Hessian-based Optimization
13.1.3 Gradient-free Optimization
13.2 Free (unconstrained) optimization
13.2.1 Example 1: minimizing a univariate function (inverse-CDF)
13.2.2 Example 2: minimizing a bivariate function
13.2.3 Example 3: using simulated annealing to find the maximum of an oscillatory function
13.3 Constrained Optimization
13.3.1 Equality constraints
13.3.2 Lagrange Multipliers
13.3.3 Inequality constrained optimization
13.3.4 Quadratic Programming (QP)
13.4 General Nonlinear Optimization
13.4.1 Dual problem optimization
13.5 Manual vs. Automated Lagrange Multiplier Optimization
13.6 Data Denoising
13.7 Sparse Matrices
13.8 Parallel Computing
13.9 Foundational Methods for Function Optimization
13.9.1 Basics
13.9.2 Gradient Descent
13.9.3 Convexity
13.9.4 Foundations of the Newton-Raphson's Method
13.9.5 Stochastic Gradient Descent
13.9.6 Simulated Annealing (SANN)
13.10 Hands-on Examples
13.10.1 Example 1: Healthcare Manufacturer Product Optimization
13.10.2 Example 2: Optimization of the Booth's function
13.10.3 Example 3: Extrema of the bivariate Goldstein-Price Function
13.10.4 Example 4: Bivariate Oscillatory Function
13.10.5 Nonlinear Constraint Optimization Problem
13.11 Examples of explicit optimization use in AI/ML
13.12 Practice Problems
14 Chapter 14: Deep Learning, Neural Networks
14.1 Perceptrons
14.2 Biological Relevance
14.3 Simple Neural Net Examples
14.3.1 Exclusive OR (XOR) Operator
14.3.2 NAND Operator
14.3.3 Complex networks designed using simple building blocks
14.4 Neural Network Modeling using Keras
14.4.1 Iterations - Samples, Batches and Epochs
14.4.2 Use-Case: Predicting Titanic Passenger Survival
14.4.3 EDA/Visualization
14.4.4 Data Preprocessing
14.4.5 Keras Modeling
14.4.6 NN Model Fitting
14.4.7 Convolutional Neural Networks (CNNs)
14.4.8 Model Exploration
14.4.9 Passenger Survival Forecasting using New Data
14.4.10 Fine-tuning the NN Model
14.4.11 Model Export and Import
14.5 Case-Studies
14.5.1 Classification example using Sonar data
14.5.2 Schizophrenia Neuroimaging Study
14.5.3 ALS regression example
14.5.4 IBS Study
14.5.5 Country QoL Ranking Data
14.5.6 Handwritten Digits Classification
14.6 Classifying Real-World Images using Pre-Trained Tensorflow and Keras Models
14.6.1 Load the Pre-trained Model
14.6.2 Load and Preprocess a New Image
14.6.3 Image Classification
14.6.4 Additional Image Classification Examples
14.7 Data Generation: simulating synthetic data
14.7.1 Fractal shapes
14.7.2 Fake images
14.7.3 Generative Adversarial Networks (GANs)
14.8 Transfer Learning
14.8.1 Text Classification using Deep Network Transfer Learning
14.8.2 Multinomial Transfer Learning classification of Clinical Text
14.8.3 Binary Classification of Film Reviews
14.9 Image classification
14.9.1 Performance Metrics
14.9.2 Torch Deep Convolutional Neural Network (CNN)
14.9.3 Tensorflow Image Pre-processing Pipeline
14.10 Additional References
14.11 Practice Problems
14.11.1 Deep learning Classification
14.11.2 Deep learning Regression
14.11.3 Image classification
14.11.4 (Challenging Problem) Deep Convolutional Networks for 3D Volume Segmentation
15 Summary
16 Electronic Appendix
17 Glossary
18 Index
Show more