Victor: Logistic Regression

Problem Definition

Logistic regression is part of a category of statistical models called generalized linear models. Logistic regression allows one to predict a discrete outcome, such as group membership, from a set of variables that may be continuous, discrete, dichotomous, or a mix of any of these (Definition taken from here).

Data set

DBLife is a prototype system that manages information for the database research community (see dblife.cs.wisc.edu). DBLife is developed by the Database group at University of Wisconsin-Madison and Yahoo! Research. It contains information about Databases papers, conferences, people and more.

Victor Model and Code

The following code shows the model specification and model instantiation for the Logistic Regression problem applied to the DBLife Web site data set. This model is used to classify papers inside the DBLife Web site to categories.

-- This deletes the model specification
DELETE MODEL SPECIFICATION logit_l1_two;

-- This creates the model specification
CREATE MODEL SPECIFICATION logit_l1_two (
   model_type=(python) as (w),
   data_item_type=(int[], float8[], int) as (k,v, label),
   objective=examples.LINEAR_MODELS.loss_functions.logit_loss_ss,
   objective_agg=SUM,
   grad_step=examples.LOGISTIC_REGRESSION.logit_l1_sparse_split.logit_l1_grad
);

-- This instantiates the model
CREATE MODEL INSTANCE paperareami
   EXAMPLES dblife_tfidf_split(k, v, label)
   MODEL SPEC logit_l1_two
   INIT_FUNCTION examples.LOGISTIC_REGRESSION.logit_l1_sparse_split.init
   STOP WHEN examples.LOGISTIC_REGRESSION.logit_l1_sparse_split.stop_condition
;

1. Model Specification

Above we have defined a python-type model which means that it is stored as byte array in the database. The data items are composed of the 3 values: index k, vector v, and label which are stored as an integer vector, float vector, and integer respectively. (Note: the label cannot be 1 and 0; instead it should be 1 and -1.) We specify the loss function, and that the scores are going to be aggregated by the SUM aggregator. Finally, we define the gradient step for the model.

In the code section below, you can see the loss and gradient function that the user provides. Note that this code is defined in a few lines of python using the utilities that Victor provides.

# logit l1 sparse gradient function 

def logit_l1_grad( model, (indexes, vectors, y) ):

   lm      = model[0]
   wx      = victor_utils.dot_dss(lm.w,indexes,vectors)
   sig     = victor_utils.sigma(-wx*y)

   victor_utils.scale_and_add_dss(lm.w,indexes, vectors, lm.stepsize*y*sig)
   victor_utils.l1_shrink_mask(lm.w, lm.mu*lm.stepsize,indexes)
   model[0].take_step()
   return model

# Calculates logit loss for sparse vectors and returns the value

def logit_loss_ss(model, (index, vecs, y) ):

   lm  = model[0]
   wx  = victor_utils.dot_dss(lm.w, index, vecs)
   err = log( 1+ exp( - y*wx ) )
   return err

2. Model Instantiation

For instantiating the model, we specify how to initialize the model by giving it a function name. Also, we specify when we should stop refining the model. Again, these functions are written in a few lines of python code as seen below:

def init():
   return (simple_linear.LinearModel(41270,mu=1e-2),)

def stop_condition(s, loss):
   if not (s.has_key('state')):
      s['state'] = 0
   s['state'] += 1
   return s['state'] > 10

3. Model Application

Coming soon.

Running the Example

Run the following command to run the logistic regression example.

$ VICTOR_SQL/bin/victor_front.py VICTOR_SQL/examples/LOGISTIC_REGRESSION/logit_l1.spec

The expected output for this example is shown in the installation guide.