Created
November 21, 2015 14:19
-
-
Save janderit/8e74ec08c677aacf890b to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
(* | |
The goal in this exercise is to create a model to predict | |
how many bicycles will be used in a day, given the | |
available information. | |
Data source: UC Irvine Machine Learning dataset, "bike sharing": | |
https://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset | |
We will first explore the data, using the CSV type provider | |
and fsharp.Charting to visualize it. | |
Then we will develop a regression model to predict usage, | |
using Math.NET and a bit of linear algebra, starting simple, | |
and progressively increasing power and complexity. | |
*) | |
(* | |
Step 1. Opening the data using the CSV Type Provider | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
Type Providers make it extremely easy to access data in | |
various formats: point to a sample, generate a type based | |
on the sample, and start exploring using IntelliSense. | |
*) | |
// TODO: run the following code, step-by-step | |
#I @"../packages/" | |
#r @"FSharp.Data.2.2.5/lib/net40/FSharp.Data.dll" | |
open FSharp.Data | |
// we create a type based on sample data | |
type Dataset = CsvProvider<"day.csv"> | |
type Datapoint = Dataset.Row | |
// we can now read data... | |
let dataset = Dataset.Load("day.csv") | |
let data = dataset.Rows | |
// ... which is statically typed | |
let firstObservation = data |> Seq.head | |
printfn "First date: %s" (firstObservation.Dteday.ToShortDateString()) | |
// we can print out the file headers | |
match dataset.Headers with | |
| None -> printfn "No headers found" | |
| Some(headers) -> | |
headers | |
|> Seq.iter (fun header -> printfn "%s" header) | |
// ... or print the total number of users for 5 first days | |
data | |
|> Seq.take 5 | |
|> Seq.iter (fun day -> printfn "Total users: %i" day.Cnt) | |
// TODO: explore the dataset | |
// what was the day with the most wind? | |
data |> Seq.maxBy (fun e -> e.Windspeed ) | |
|> (fun x->x.Dteday.ToLongDateString ()) | |
// what is the average number of riders? | |
data |> Seq.averageBy(fun e-> e.Cnt|>float ) | |
// what is the average number of riders | |
// during holidays? On Sundays? | |
data |> Seq.filter(fun e -> e.Holiday ) |> Seq.averageBy(fun e-> e.Cnt|>float ) | |
data |> Seq.filter(fun e -> e.Weekday=0) |> Seq.averageBy(fun e-> e.Cnt|>float ) | |
(* | |
Step 2. Visually exploring the data using F# Charting | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
It is usually easier to understand a dataset by looking at | |
it with charts. Let's use F# Charting to visualize our data. | |
*) | |
// TODO: run the following code, step-by-step | |
#load @"FSharp.Charting.0.90.13/FSharp.Charting.fsx" | |
open FSharp.Charting | |
// Creating basic charts with F# Charting | |
Chart.Column [ 1 .. 10 ] | |
Chart.Bar [ 1 .. 10 ] | |
Chart.Line [ 1 .. 10 ] | |
Chart.Point [ (1.0,1); (2.5,3); (3.9,8)] | |
Chart.Line [ "Monday", 1; "Tuesday", 2; "Wednesday", 3 ] | |
Chart.Line [ for x in 1 .. 20 -> x, x * x ] | |
// Using additional features | |
Chart.Line [ "Monday", 1; "Tuesday", 2; "Wednesday", 3 ] | |
|> Chart.WithTitle("Value by day") | |
// Combining charts | |
Chart.Combine [ | |
Chart.Line [ 1 .. 10 ] | |
Chart.Line [ 3 .. 12 ] | |
] | |
// TODO: plot the number of users, Cnt, over time. | |
// Is the curve flat? Is there a trend? | |
let users = | |
data |> Seq.map(fun e -> e.Dteday, e.Cnt) | |
|> Chart.Line | |
// Scatterplots (Chart.Point) are often helpful to see if | |
// some features / variables are related. | |
// TODO: plot windspeed against count, temperature against | |
// count. Which one seems more informative? | |
let users1 = | |
data |> Seq.map(fun e -> e.Windspeed, e.Cnt) | |
|> Chart.Point | |
data |> Seq.map(fun e -> e.Weathersit, e.Hum) |> Chart.Point | |
(* | |
Step 3. Defining a Regression Model to predict Cnt | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
We want to create a model that, given data for a day, will | |
return a number, the predicted total number of users for | |
that day. This is called a Regression Model. | |
*) | |
// TODO: run the following code, step-by-step | |
// Model structure: take a day, predict the number of users | |
type Model = Datapoint -> float | |
// Our first model will be a straight line through the data | |
// The parameter theta is called the 'coefficient'. | |
let prediction (theta:float) (day:Datapoint) = | |
theta * (float day.Instant) | |
// compare the results of 3 models, using 3 different | |
// values for Theta (1.0, 5.0, 20.0), against the 'real' | |
// value we are trying to predict. | |
Chart.Combine [ | |
Chart.Line (data |> Seq.map (fun day -> day.Cnt)) | |
Chart.Line (data |> Seq.map (fun day -> prediction 1.0 day)) | |
Chart.Line (data |> Seq.map (fun day -> prediction 5.0 day)) | |
Chart.Line (data |> Seq.map (fun day -> prediction 20.0 day)) | |
] | |
(* | |
Step 4. What is a 'good model'? | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
What is a good model? A model where for each point, the | |
error (the difference between the correct and the predicted | |
values) is small. | |
*) | |
// TODO: run the following code, step-by-step | |
// There are many possible ways to measure error; for | |
// instance, the absolute difference between the 2. | |
let absError (model:Model) (day:Datapoint) = | |
abs (model day - (float day.Cnt)) | |
let meanAbsError (model:Model) (data:Datapoint seq) = | |
data |> Seq.averageBy (absError model) | |
// TODO: for theta = 0.0, 1.0, .. 20.0, compute the | |
// model error. What is the best value of Theta? | |
let thetas=[0..20] |> List.map float | |
thetas |> List.map(fun e ->e, meanAbsError( prediction e ) data) |> Chart.Line | |
(* | |
Step 5. More complex models | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
So far we have used only one coefficient. We can build more | |
complex models, by adding coefficients, for instance: | |
Prediction = Theta0 + Theta1 * day.Instant + Theta2 * day.Temp | |
*) | |
// TODO: run the following code, step-by-step | |
let complexPrediction (theta0:float,theta1:float,theta2:float) (day:Datapoint) = | |
theta0 + theta1 * (float day.Instant) + theta2 * (float day.Temp) | |
// now we have 3 coefficients, instead of just one | |
// note that complexModel still has a signature of | |
// Datapoint -> float: it is still a valid model | |
let complexModel = complexPrediction (500.0, 7.5, 2500.0) | |
let complexModelError = meanAbsError complexModel data | |
Chart.Combine [ | |
Chart.Line (data |> Seq.map (fun day -> day.Cnt)) | |
Chart.Line (data |> Seq.map (fun day -> complexModel day)) | |
] | |
(* | |
Step 6. Using Math.NET & Algebra to find the best Theta | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
WARNING: MATH ALERT. The following section contains a bit | |
of algebra. If you don't follow everything, that's OK: we | |
are just using algebra to create an algorithm that will | |
automatically find the best model. In Section 7, we will | |
use our algorithm, showing a practical usage. | |
Manually searching for the best value of Theta is painful. | |
Fortunately, using Linear Algebra, we can find in one shot | |
the best solution for Theta, using what is called the | |
'normal form'. | |
*) | |
// TODO: run the following code, step-by-step | |
#r @"MathNet.Numerics.3.8.0/lib/net40/MathNet.Numerics.dll" | |
#r @"MathNet.Numerics.FSharp.3.8.0/lib/net40/MathNet.Numerics.FSharp.dll" | |
open MathNet | |
open MathNet.Numerics.LinearAlgebra | |
open MathNet.Numerics.LinearAlgebra.Double | |
type Vec = Vector<float> | |
type Mat = Matrix<float> | |
// instead of defining explicitly coefficients | |
// (theta0,theta1,theta2, ...), we are going to | |
// transform our model: we use a vector Theta to | |
// store all these values, and we will transform | |
// datapoints into another vector, so that our | |
// prediction becomes simply the product of 2 vectors: | |
let predict (theta:Vec) (v:Vec) = theta * v | |
// using Normal Form to estimate a model. | |
// For more information, see: | |
// https://en.wikipedia.org/wiki/Linear_least_squares_(mathematics) | |
let estimate (Y:Vec) (X:Mat) = | |
(X.Transpose() * X).Inverse() * X.Transpose() * Y | |
// We create a 'Featurizer', a function that will turn | |
// a Datapoint into a Vector: | |
type Featurizer = Datapoint -> float list | |
let predictor (f:Featurizer) (theta:Vec) = | |
f >> vector >> (*) theta | |
let evaluate = meanAbsError | |
// Given a Featurizer (what data we want to use | |
// from the Datapoints) and data, this will | |
// return the 'best' coefficients, as well as a | |
// function that predicts a value, given a Datapoint: | |
let model (f:Featurizer) (data:Datapoint seq) = | |
let Yt, Xt = | |
data | |
|> Seq.toList | |
|> List.map (fun obs -> float obs.Cnt, f obs) | |
|> List.unzip | |
let theta = estimate (vector Yt) (matrix Xt) | |
let predict = predictor f theta | |
theta,predict | |
(* | |
Step 7. Using our algorithm to create better models | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
Now that we have an algorithm in place, let's use it on our | |
dataset, to see if we can produce better prediction models! | |
*) | |
// illustration: show no constant, add a constant | |
// we separated the dataset into 2 parts, one for training, | |
// one for testing the quality of our model. | |
let train = Dataset.Load("train.csv") | |
let test = Dataset.Load("test.csv") | |
// for our first model, we will use a constant (1.0), and | |
// day.Instant as features. | |
let featurizer0 (obs:Datapoint) = | |
[ 1.0; | |
float obs.Instant; ] | |
let (theta0,model0) = model featurizer0 train.Rows | |
Chart.Combine [ | |
Chart.Line [ for obs in data -> float obs.Cnt ] | |
Chart.Line [ for obs in data -> model0 obs ] ] | |
// we can now compare how good the model does, | |
// both on train and test sets. A good model should | |
// have similar performance on both. | |
evaluate model0 train.Rows |> printfn "Training set: %.0f" | |
evaluate model0 test.Rows |> printfn "Testing set: %.0f" | |
Chart.Point [for obs in data -> float obs.Cnt, model0 obs ] | |
theta0 |> Seq.iteri (fun i x -> printfn "Coeff. %i: %.1f" i x) | |
// TODO: add temperature to the features | |
// is this better? worse? | |
let freq = 2.0*System.Math.PI / 365.25 | |
let featurizer1 (obs:Datapoint) = | |
[ 1.0 | |
float obs.Instant | |
-System.Math.Cos(freq*float obs.Instant) | |
float obs.Temp | |
float obs.Temp * float obs.Temp | |
(if obs.Holiday then 1.0 else 0.0) | |
float obs.Hum | |
float obs.Windspeed | |
] | |
let (theta1,model1) = model featurizer1 train.Rows | |
evaluate model1 test.Rows |> printfn "Testing set: %.0f" | |
evaluate model1 train.Rows |> printfn "Testing set: %.0f" | |
evaluate model1 data |> printfn "Testing set: %.0f" | |
Chart.Point [for obs in test.Rows -> float obs.Cnt, model1 obs ] | |
theta1 |> Seq.iteri (fun i x -> printfn "Coeff. %i: %.1f" i x) | |
// TODO: visualize the result | |
// TODO: try out some others, eg windspeed | |
(* | |
Step 8. Using discrete features | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
So far we have used numeric values (ex: temperature, wind) | |
as model input. Let's see if we can use information from | |
values that are not numbers, such as "is it a holiday"? | |
*) | |
let featurizer2 (obs:Datapoint) = | |
[ 1.0 | |
float obs.Instant | |
float obs.Temp | |
(if obs.Holiday then 1.0 else 0.0) ] | |
let (theta2,model2) = model featurizer2 train.Rows | |
evaluate model2 test.Rows |> printfn "Testing set: %.0f" | |
Chart.Point [for obs in data -> float obs.Cnt, model2 obs ] | |
// Coeff. 3 shows you the impact of holidays: | |
// on average, we lose 760 users. | |
theta2 |> Seq.iteri (fun i x -> printfn "Coeff. %i: %.1f" i x) | |
// TODO: are Mondays bigger days than Saturdays? | |
(* | |
Bonus section: even more features! | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
At that point, we pretty much threw everything we had into | |
our model - is that all we can do? Well, no: we can do | |
potentially more, by combining features into new ones. | |
*) | |
// If you look at the temperature vs. users chart, it | |
// looks like very low or very high temperatures result in | |
// lower usage: | |
data | |
|> Seq.map (fun day -> day.Temp, day.Cnt) | |
|> Chart.Point | |
// instead of modelling usage = theta0 + theta1 x temp, | |
// you could do instead | |
// usage = theta0 + theta1 x temp + theta2 x (temp x temp). | |
// TODO: include square of the temperature in the model | |
let featurizer3 (obs:Datapoint) = | |
[ 1.0 | |
float obs.Instant | |
// use your features | |
// use temp * temp | |
] | |
let (theta3,model3) = model featurizer3 train.Rows | |
// TODO: Is the quality improving the same way on train and test? | |
// TODO: play with more features | |
(* | |
Bonus section: accelerate computations with MKL | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
One of the beauties of algebra is, a lot of work has gone | |
into it to make it run fast. Math.NET supports MKL, which | |
runs algebra on a dedicated piece of hardware on the CPU. | |
To use it, install the corresponding NuGet package, | |
configure Math.NET to use MKL, and compare speed: | |
System.Environment.CurrentDirectory <- __SOURCE_DIRECTORY__ | |
open MathNet.Numerics | |
open MathNet.Numerics.Providers.LinearAlgebra.Mkl | |
Control.LinearAlgebraProvider <- MklLinearAlgebraProvider() | |
*) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment