In the previous post of this series we described some of the basics of linear regression, one of the most well-known models in machine learning. We saw that we can relate the values of input parameters [latex]x_i[/latex] to the target variable [latex]y[/latex] to be predicted. In this post we are going to create a linear regression model to predict the price of houses in Boston (based on valuations from 1970s). The dataset provides information such as Crime (CRIM), areas of non-retail business in the town (INDUS), the age of people who own the house (AGE), average number of rooms (RM) as well as the median value of homes in $1000s (MEDV) as well as other attributes.

Let us start by exploring the data. We are going to use Scikit-learn and fortunately the dataset comes with the module. The input variables are included in the `data`

method and the price is given by the `target`

. We are going to load the input variables in the dataframe `boston_df`

and the prices in the array `y`

:

from sklearn import datasets import pandas as pd boston = datasets.load_boston() boston_df = pd.DataFrame(boston.data) boston_df.columns = boston.feature_names y = boston.target

We are going to build our model using only a limited number of inputs. In this case let us pay attention to the average number of rooms and the crime rate:

X = boston_df[['CRIM', 'RM']] X.columns = ['Crime', 'Rooms'] X.describe()

The description of these two attributes is as follows:

Crime Rooms count 506.000000 506.000000 mean 3.593761 6.284634 std 8.596783 0.702617 min 0.006320 3.561000 25% 0.082045 5.885500 50% 0.256510 6.208500 75% 3.647423 6.623500 max 88.976200 8.780000

As we can see the minimum number of rooms is 3.5 and the maximum is 8.78, whereas for the crime rate the minimum is 0.006 and the maximum value is 88.97, nonetheless the median is 0.25. We will use some of these values to define the ranges that will be provided to our users to find price predictions.

Finally, let us visualise the data:

We shall bear these values in mind when building our regression model in subsequent posts.

You can look at the code (in development) in my github site here.

*
Also published on Medium. *

## 3 Pingbacks