{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## DS-GA 1003, Machine Learning Spring 2021\n", "### Lab 3 : 17-Feb-2021, Wednesday\n", "### Prostate Cancer Analysis with LASSO" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this demo, we illustrate the classic technique of **LASSO regularization**. You will learn to:\n", "* Fit a LASSO model using the `sklearn` package\n", "* Determine the regularization level with cross-validation\n", "* Draw the coefficient path as a function of the regularization level" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Background\n", "\n", "We use a classic prostate cancer dataset from the paper:\n", "\n", "> Stamey, Thomas A., et al. \"[Prostate specific antigen in the diagnosis and treatment of adenocarcinoma of the prostate. II. Radical prostatectomy treated patients](http://www.sciencedirect.com/science/article/pii/S002253471741175X).\" The Journal of urology 141.5 (1989): 1076-1083.\n", "\n", "In the study, **the level of [prostate specific antigen](https://en.wikipedia.org/wiki/Prostate-specific_antigen)** was measured in 102 men before they had a prostatectomy. Elevated values of the PSA are believed to be associated with the presence of prostate cancer and other disorders. To study this hypothesis, various features of the prostate were measured after the prostatectomy. Data analysis is then used to understand the relation between the PSA level and prostate features. The study is old and much more is known about PSA today. But, the analysis is typical for medical problems and illustrates the basic tools well." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The specific analysis presented in this demo taken from the class text: \n", "\n", "> Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. [Elements of statistical learning](https://www.amazon.com/exec/obidos/ASIN/0387952845/trevorhastie-20), New York: Springer series in statistics, 2001.\n", "\n", "The text provides an excellent discussion of LASSO and other methods on this dataset. \n", "\n", "Special thanks to [Phil Schniter](http://www2.ece.ohio-state.edu/~schniter/) at Ohio State for pointing on error in an earlier version of this demo.\n", "\n", "First, we load the regular packages." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Loading the Data\n", "\n", "Our analysis begins by getting the data from Tibshirani's website. " ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# Get data\n", "url = 'https://web.stanford.edu/~hastie/ElemStatLearn/datasets/prostate.data'\n", "df = pd.read_csv(url, sep='\\t', header=0)\n", "df = df.drop('Unnamed: 0', axis=1) # skip the column of indices" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | lcavol | \n", "lweight | \n", "age | \n", "lbph | \n", "svi | \n", "lcp | \n", "gleason | \n", "pgg45 | \n", "lpsa | \n", "train | \n", "
---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "-0.579818 | \n", "2.769459 | \n", "50 | \n", "-1.386294 | \n", "0 | \n", "-1.386294 | \n", "6 | \n", "0 | \n", "-0.430783 | \n", "T | \n", "
1 | \n", "-0.994252 | \n", "3.319626 | \n", "58 | \n", "-1.386294 | \n", "0 | \n", "-1.386294 | \n", "6 | \n", "0 | \n", "-0.162519 | \n", "T | \n", "
2 | \n", "-0.510826 | \n", "2.691243 | \n", "74 | \n", "-1.386294 | \n", "0 | \n", "-1.386294 | \n", "7 | \n", "20 | \n", "-0.162519 | \n", "T | \n", "
3 | \n", "-1.203973 | \n", "3.282789 | \n", "58 | \n", "-1.386294 | \n", "0 | \n", "-1.386294 | \n", "6 | \n", "0 | \n", "-0.162519 | \n", "T | \n", "
4 | \n", "0.751416 | \n", "3.432373 | \n", "62 | \n", "-1.386294 | \n", "0 | \n", "-1.386294 | \n", "6 | \n", "0 | \n", "0.371564 | \n", "T | \n", "