Data Science and Predictive Analytics

Biomedical and Health Applications using R (The Springer Series in Applied Machine Learning)

By Dinov, Ivo D.

Rating

Format

Hardback, 918 pages

Published

Switzerland, 1 February 2023

1 Front Matter

Foreword

DSPA Application and Use Disclaimer

2nd Edition Preface

Book Content

Notations

1 Chapter 1 - Introduction

1.1 Motivation

1.1.1 DSPA Mission and Objectives

1.1.2 Examples of driving motivational problems and challenges

1.1.3 Common Characteristics of Big (Biomedical and Health) Data

1.1.4 Data Science

1.1.5 Predictive Analytics

1.1.6 High-throughput Big Data Analytics

1.1.7 Examples of data repositories, archives and services

1.1.8 Responsible Data Science and Ethical Predictive Analytics

1.1.9 DSPA Expectations

1.2 Foundations of R

1.2.1 Why use R?

1.2.2 Getting started with R

1.2.3 Mathematics, Statistics, and Optimization

1.2.4 Advanced Data Processing

1.2.5 Basic Plotting

1.2.6 Basic R Programming

1.2.7 Data Simulation Primer

1.3 Practice Problems

1.3.1 Long-to-Wide Data format translation

1.3.2 Data Frames

1.3.3 Data stratification

1.3.4 Simulation

1.3.5 Programming

1.4 Appendix

1.4.1 Tidyverse

1.4.2 Additional R documentation and resources

1.4.3 HTML SOCR Data Import

1.4.4 R Debugging

2 Chapter 2: Basic Visualization and Exploratory Data Analytics

2.1 Data Handling

2.1.1 Saving and Loading R Data Structures

2.1.2 Importing and Saving Data from CSV Files

2.1.3 Importing Data from ZIP and SAV Files

2.1.4 Exploring the Structure of Data

2.1.5 Exploring Numeric Variables

2.1.6 Measuring Central Tendency - mean, median, mode

2.1.7 Measuring Spread - variance, quartiles and the five-number summary

2.1.8 Visualizing Numeric Variables - boxplots

2.1.9 Visualizing Numeric Variables - histograms

2.1.10 Uniform and normal distributions

2.1.11 Exploring Categorical Variables

2.1.12 Exploring Relationships Between Variables

2.1.13 Missing Data

2.1.14 Parsing web pages and visualizing tabular HTML data

2.1.15 Cohort-Rebalancing (for Imbalanced Groups)

2.2 Exploratory Data Analytics (EDA)

2.2.1 Classification of visualization methods

2.2.2 Composition

2.2.3 Comparison

2.2.4 Relationships

2.3 Practice Problems

2.3.1 Data Manipulation

2.3.2 Bivariate relations

2.3.3 Missing data

2.3.4 Surface plots

2.3.5 Unbalanced groups

2.3.6 Common plots

2.3.7 Trees and Graphs

2.3.8 Data EDA examples

2.3.9 Data reports

3 Chapter 3: Linear Algebra, Matrix Computing and Regression Modeling

3.1 Linear Algebra

3.1.1 Building Matrices

3.1.2 Matrix subscripts

3.1.3 Addition and subtraction

3.1.4 Multiplication

3.2 Matrix Computing

3.2.1 Solving Systems of Equations

3.2.2 The identity matrix

3.2.3 Vectors, Matrices, and Scalars

3.2.4 Sample Statistics

3.2.5 Applications of Matrix Algebra in Linear Modeling

3.2.6 Finding function extrema (min/max) using calculus

3.2.7 Linear modeling in R

3.3 Eigenspectra - Eigenvalues and Eigenvectors

3.4 Matrix notation

3.5 Linear regression

3.5.1 Sample covariance matrix

3.6 Linear multivariate linear regression modeling

3.6.1 Simple linear regression

3.6.2 Ordinary least squares estimation

3.6.3 Regression Model Assumptions

3.6.4 Correlations

3.6.5 Multiple Linear Regression

3.7 Case Study 1: Baseball Players

3.7.1 Step 1 - collecting data

3.7.2 Step 2 - exploring and preparing the data

3.7.3 Step 3 - training a model on the data

3.7.4 Step 4 - evaluating model performance

3.7.5 Step 5 - improving model performance

3.8 Regression trees and model trees

3.8.1 Adding regression to trees

3.9 Bayesian Additive Regression Trees (BART)

3.9.1 1D Simulation

3.9.2 Higher-Dimensional Simulation

3.9.3 Heart Attack Hospitalization Case-Study

3.9.4 Another look at Case study 2: Baseball Players

3.10 Practice Problems

3.10.1 How is matrix multiplication defined?

3.10.2 Scalar vs. Matrix Multiplication

3.10.3 Matrix Equations

3.10.4 Least Square Estimation

3.10.5 Matrix manipulation

3.10.6 Matrix Transposition

3.10.7 Sample Statistics

3.10.8 Eigenvalues and Eigenvectors

3.10.9 Regression Forecasting using Numerical Data

4 Chapter 4: Linear and Nonlinear Dimensionality Reduction

4.1 Motivational Example: Reducing 2D to 1D

4.2 Matrix Rotations

4.3 Summary (PCA, ICA, and FA)

4.4 Principal Component Analysis (PCA)

4.4.1 Principal Components

4.5 Independent component analysis (ICA)

4.6 Factor Analysis (FA)

4.7 Singular Value Decomposition (SVD)

4.7.1 SVD Summary

4.8 t-distributed Stochastic Neighbor Embedding (t-SNE)

4.8.1 t-SNE Formulation

4.8.2 t-SNE Example: Hand-written Digit Recognition

4.9 Uniform Manifold Approximation and Projection (UMAP)

4.9.1 Mathematical formulation

4.9.2 Hand-Written Digits Recognition

4.9.3 Apply UMAP for class-prediction using new data

4.10 UMAP Parameters

4.10.1 Stability, Replicability, and Reproducibility

4.10.2 UMAP Interpretation

4.11 Dimensionality Reduction Case Study (Parkinson's Disease)

4.11.1 Step 1: Collecting Data

4.11.2 Step 2: Exploring and preparing the data

4.11.3 PCA

4.11.4 Factor analysis (FA)

4.11.5 t-SNE

4.11.6 Uniform Manifold Approximation and Projection (UMAP)

4.12 Practice Problems

4.12.1 Parkinson's Disease example

4.12.2 Allometric Relations in Plants example

4.12.3 3D Volumetric Brain Study

5 Chapter 5: Supervised Classification

5.1 k-Nearest Neighbor Approach

5.2 Distance Function and Dummy coding

5.2.1 Estimation of the hyperparameter k

5.2.2 Rescaling of the features

5.2.3 Rescaling Formulas

5.2.4 Case Study: Youth Development

5.2.5 Case Study: Predicting Galaxy Spins

5.3 Probabilistic Learning - Naïve Bayes Classification

5.3.1 Overview of the Naive Bayes Method

5.3.2 Model Assumptions

5.3.3 Bayes Formula

5.3.4 The Laplace Estimator

5.3.5 Case Study: Head and Neck Cancer Medication

5.4 Decision Trees and Divide and Conquer Classification

5.4.1 Motivation

5.4.2 Decision Tree Overview

5.4.3 Case Study 1: Quality of Life and Chronic Disease

5.4.4 Classification rules

5.5 Case Study 2: QoL in Chronic Disease (Take 2)

5.6 Practice Problems

5.6.1 Iris Species

5.6.2 Cancer Study

5.6.3 Baseball Data

5.6.4 Medical Specialty Text-Notes Classification

5.6.5 Chronic Disease Case-Study

6 Chapter 6: Black Box Machine Learning Methods

6.1 Neural Networks

6.1.1 From biological to artificial neurons

6.1.2 Activation functions

6.2 Network topology

6.2.1 Network layers

6.2.2 Training neural networks with backpropagation

6.2.3 Case Study 1: Google Trends and the Stock Market - Regression

6.2.4 Simple NN demo - learning to compute

6.2.5 Case Study 2: Google Trends and the Stock Market - Classification

6.3 Support Vector Machines (SVM)

6.3.1 Classification with hyperplanes

6.3.2 Case Study 3: Optical Character Recognition (OCR)

6.3.3 Case Study 4: Iris Flowers

6.3.4 Parameter Tuning

6.3.5 Improving the performance of Gaussian kernels

6.4 Ensemble meta-learning

6.4.1 Bagging

6.4.2 Boosting

6.4.3 Random forests

6.4.4 Random Forest Algorithm (Pseudo Code)

6.4.5 Adaptive boosting

6.5 Practice Problems

6.5.1 Problem 1: Google Trends and the Stock Market

6.5.2 Problem 2: Quality of Life and Chronic Disease

7 Chapter 7: Qualitative Learning Methods - Text Mining, Natural Language Processing, Apriori Association Rules Learning

7.1 Natural Language Processing (NLP) and Text Mining (TM)

7.1.1 A simple NLP/TM example

7.1.2 Case-Study: Job ranking

7.1.3 Area Under ROC Curve

7.1.4 TF-IDF

7.1.5 Cosine similarity

7.1.6 Sentiment analysis

7.1.7 NLP/TM Analytics

7.2 Apriori Association Rules Learning

7.2.1 Association Rules

7.2.2 The Apriori algorithm for association rule learning

7.2.3 Rule support and confidence

7.2.4 Building a set of rules with the Apriori principle

7.2.5 A toy example

7.2.6 Case Study 1: Head and Neck Cancer Medications

7.2.7 Graphical depiction of association rules

7.2.8 Saving association rules to a file or a data frame

7.3 Summary

7.4 Practice Problems

7.4.1 Groceries

7.4.2 Titanic Passengers

8 Chapter 8: Unsupervised Clustering

8.1 ML Clustering

8.2 Silhouette plots

8.3 The k-Means Clustering Algorithm

8.3.1 Pseudocode

8.3.2 Choosing the appropriate number of clusters

8.3.3 Case Study 1: Divorce and Consequences on Young Adults

8.3.4 Model improvement

8.3.5 Case study 2: Pediatric Trauma

8.3.6 Feature selection for k-Means clustering

8.4 Hierarchical Clustering

8.5 Spectral Clustering

8.5.1 Image segmentation using spectral clustering

8.5.2 Point cloud segmentation using spectral clustering

8.6 Gaussian mixture models

8.7 Summary

8.8 Practice Problems

8.8.1 Youth Development

9 Chapter 9: Model Performance Assessment, Validation, and Improvement

9.1 Measuring the performance of classification methods

9.2 Evaluation strategies

9.2.1 Binary outcomes

9.2.2 Cross-tables, contingency tables, and confusion-matrices

9.2.3 Other measures of performance beyond accuracy

9.2.4 Visualizing performance tradeoffs (ROC Curve)

9.3 Estimating future performance (internal statistical cross-validation)

9.3.1 The holdout method

9.3.2 Cross-validation

9.3.3 Bootstrap sampling

9.4 Improving model performance by parameter tuning

9.4.1 Using caret for automated parameter tuning

9.5 Customizing the tuning process

9.6 Comparing the performance of several alternative models

9.7 Forecasting types and assessment approaches

9.7.1 Overfitting

9.8 Internal Statistical Cross-validation

9.8.1 Example (Linear Regression)

9.8.2 Cross-validation methods

9.8.3 Case-Studies

9.8.4 Summary of CV output

9.8.5 Alternative predictor functions

9.8.6 Foundation of LDA and QDA for prediction, dimensionality reduction, or forecasting

9.8.7 Comparing multiple classifiers

10 Chapter 10: Specialized Machine Learning Topics

10.1 Working with specialized data and databases

10.1.1 Data format conversion

10.1.2 Querying data in SQL databases

10.1.3 SparQL Queries

10.1.4 Real Random Number Generation

10.1.5 Downloading the complete text of web pages

10.1.6 Reading and writing XML with the XML package

10.1.7 Web-page Data Scraping

10.1.8 Parsing JSON from web APIs

10.1.9 Reading and writing Microsoft Excel spreadsheets using XLSX

10.2 Working with domain-specific data

10.2.1 Working with bioinformatics data

10.2.2 Visualizing network data

10.3 Data Streaming

10.3.1 Definition

10.3.2 The stream package

10.3.3 Synthetic example - random Gaussian stream

10.3.4 Generate the stream

10.3.5 Sources of Data Streams

10.3.6 Printing, plotting, and saving streams

10.3.7 Stream animation

10.3.8 Case-Study: SOCR Knee Pain Data

10.3.9 Data Stream clustering and classification (DSC)

10.3.10 Evaluation of data stream clustering

10.4 Optimization and improving the computational performance

10.4.1 Generalizing tabular data structures with dplyr

10.4.2 Making data frames faster with data.table

10.4.3 Creating disk-based data frames with ff

10.4.4 Using massive matrices with bigmemory

10.5 Parallel computing

10.5.1 Measuring execution time

10.5.2 Parallel processing with multiple cores

10.5.3 Parallelization using foreach and doParallel

10.5.4 GPU computing

10.6 Deploying optimized learning algorithms

10.6.1 Building bigger regression models with biglm

10.6.2 Growing bigger and faster random forests with bigrf

10.6.3 Training and evaluation models in parallel with caret

10.7 R Notebook support for other programming languages

10.7.1 R-Python integration

10.7.2 Installing Python

10.7.3 Install the reticulate package

10.7.4 Installing and importing Python Modules

10.7.5 Python-based data modeling

10.7.6 Visualization of the results in R

10.7.7 R integration with C/C++

10.8 Practice problem

11 Chapter 11: Variable Importance and Feature Selection

11.1 Feature selection methods

11.1.1 Filtering techniques

11.1.2 Wrapper

11.1.3 Embedded Techniques

11.1.4 Random Forest Feature Selection

11.1.5 Case Study - ALS

11.2 Regularized Linear Modeling and Controlled Variable Selection

11.2.1 General Questions

11.2.2 Model Regularization

11.2.3 Matrix notation

11.2.4 Regularized Linear Modeling

11.2.5 Predictor Standardization

11.2.6 Estimation Goals

11.2.7 Linear Regression

11.2.8 Drawbacks of Linear Regression

11.2.9 Variable Selection

11.2.10 Simple Regularization Framework

11.2.11 General Regularization Framework

11.2.12 Likelihood Ratio Test (LRT), False Discovery Rate (FDR), and Logistic Transform

11.2.13 Logistic Transformation

11.2.14 Implementation of Regularization

11.2.15 Computational Complexity

11.2.16 LASSO and Ridge Solution Paths

11.2.17 Regression Solution Paths - Ridge vs. LASSO

11.2.18 Choice of the Regularization Parameter

11.2.19 Cross Validation Motivation

11.2.20 n-Fold Cross Validation

11.2.21 LASSO 10-Fold Cross Validation

11.2.22 Stepwise OLS (ordinary least squares)

11.2.23 Final Models

11.2.24 Model Performance

11.2.25 Summary

11.3 Knockoff Filtering (FDR-Controlled Feature Selection)

11.3.1 Simulated Knockoff Example

11.3.2 Knockoff invocation

11.3.3 PD Neuroimaging-genetics Case-Study

11.4 Practice Problems

12 Chapter 12: Big Longitudinal Data Analysis

12.1 Classical Time-Series Analytic Approaches

12.1.1 Time series analysis

12.1.2 Structural Equation Modeling (SEM)-latent variables

12.1.3 Longitudinal data analysis - Linear Mixed Model

12.1.4 Generalized estimating equations (GEE)

12.1.5 PD/PPMI Case-Study: SEM, GLMM, and GEE modeling

12.2 Network-based Approaches

12.2.1 Background

12.2.2 Recurrent Neural Networks (RNN)

12.2.3 Tensor Format Representation

12.2.4 Simulated RNN case-study

12.2.5 Climate Data Study

12.2.6 Keras-based Multi-covariate LSTM Time-series Analysis and Forecasting

13 Chapter 13: Function Optimization

13.1 General optimization approach

13.1.1 First-order Gradient-based Optimization

13.1.2 Second-order Hessian-based Optimization

13.1.3 Gradient-free Optimization

13.2 Free (unconstrained) optimization

13.2.1 Example 1: minimizing a univariate function (inverse-CDF)

13.2.2 Example 2: minimizing a bivariate function

13.2.3 Example 3: using simulated annealing to find the maximum of an oscillatory function

13.3 Constrained Optimization

13.3.1 Equality constraints

13.3.2 Lagrange Multipliers

13.3.3 Inequality constrained optimization

13.3.4 Quadratic Programming (QP)

13.4 General Nonlinear Optimization

13.4.1 Dual problem optimization

13.5 Manual vs. Automated Lagrange Multiplier Optimization

13.6 Data Denoising

13.7 Sparse Matrices

13.8 Parallel Computing

13.9 Foundational Methods for Function Optimization

13.9.1 Basics

13.9.2 Gradient Descent

13.9.3 Convexity

13.9.4 Foundations of the Newton-Raphson's Method

13.9.5 Stochastic Gradient Descent

13.9.6 Simulated Annealing (SANN)

13.10 Hands-on Examples

13.10.1 Example 1: Healthcare Manufacturer Product Optimization

13.10.2 Example 2: Optimization of the Booth's function

13.10.3 Example 3: Extrema of the bivariate Goldstein-Price Function

13.10.4 Example 4: Bivariate Oscillatory Function

13.10.5 Nonlinear Constraint Optimization Problem

13.11 Examples of explicit optimization use in AI/ML

13.12 Practice Problems

14 Chapter 14: Deep Learning, Neural Networks

14.1 Perceptrons

14.2 Biological Relevance

14.3 Simple Neural Net Examples

14.3.1 Exclusive OR (XOR) Operator

14.3.2 NAND Operator

14.3.3 Complex networks designed using simple building blocks

14.4 Neural Network Modeling using Keras

14.4.1 Iterations - Samples, Batches and Epochs

14.4.2 Use-Case: Predicting Titanic Passenger Survival

14.4.3 EDA/Visualization

14.4.4 Data Preprocessing

14.4.5 Keras Modeling

14.4.6 NN Model Fitting

14.4.7 Convolutional Neural Networks (CNNs)

14.4.8 Model Exploration

14.4.9 Passenger Survival Forecasting using New Data

14.4.10 Fine-tuning the NN Model

14.4.11 Model Export and Import

14.5 Case-Studies

14.5.1 Classification example using Sonar data

14.5.2 Schizophrenia Neuroimaging Study

14.5.3 ALS regression example

14.5.4 IBS Study

14.5.5 Country QoL Ranking Data

14.5.6 Handwritten Digits Classification

14.6 Classifying Real-World Images using Pre-Trained Tensorflow and Keras Models

14.6.1 Load the Pre-trained Model

14.6.2 Load and Preprocess a New Image

14.6.3 Image Classification

14.6.4 Additional Image Classification Examples

14.7 Data Generation: simulating synthetic data

14.7.1 Fractal shapes

14.7.2 Fake images

14.7.3 Generative Adversarial Networks (GANs)

14.8 Transfer Learning

14.8.1 Text Classification using Deep Network Transfer Learning

14.8.2 Multinomial Transfer Learning classification of Clinical Text

14.8.3 Binary Classification of Film Reviews

14.9 Image classification

14.9.1 Performance Metrics

14.9.2 Torch Deep Convolutional Neural Network (CNN)

14.9.3 Tensorflow Image Pre-processing Pipeline

14.10 Additional References

14.11 Practice Problems

14.11.1 Deep learning Classification

14.11.2 Deep learning Regression

14.11.3 Image classification

14.11.4 (Challenging Problem) Deep Convolutional Networks for 3D Volume Segmentation

15 Summary

16 Electronic Appendix

17 Glossary

18 Index

Our Price

$235

Ships from Australia Estimated delivery date: 8th May - 13th May from Australia

Already Own It? Sell Yours

Buy together with Data Science at a great price!

Buy Together

$390

Product Description