Columbus Usage with Examples

We now explain the syntax for using Columbus operations to perform exploratory feature selection. The operations are invoked from the R console.

Get the Datasetnames and handler

An analyst can get the available datasetnames and get the handle for the dataset as follows
GetDatasetnames()
[1] "Telecom"
id <- GetDatasetId("Telecom")
GetDatasetnames retrieves the available datasets by querying the database. Further the analyst can print the features in the dataset by issuing the following command
print(GetFeatureNames(GetFeatureIndices(id), dataset.id = id))

Feature set operations:

In the Columbus system, three feature set operations are supported:
  • AssignFeatureSet
  • The analyst can create a set of features from the dataset using AssignFeatureSet. Each featureset is specific to a dataset.
  • AddFeatureSet
  • The analyst can combine two featuresets using AddFeatureSet
  • DelFeatureSet
  • The analyst can remove a set of features using DelFeatureSet
An example usage of the feature set operation is illustrated below
feat.set.1 <- AssignFeatureSet(c("DATAVOLUME", "NUMMMSOUT", "NUMVASOUT", "NUMSMSVASINC"), dataset.id = id)
feat.set.2 <- AssignFeatureSet(c("DURATIONFIXEDINC", "NUMSMSCMPINC", "NUMSMSINTEROUT"), dataset.id = id)
feat.set.3 <- AddFeatureSet(feat.set.1, feat.set.2, dataset.id = id)
feat.set.4 <- AssignFeatureSet(c("NUMSMSINTEROUT"), dataset.id=id)
feat.set.5 <- DelFeatureSet(feat.set.3, feat.set.4, dataset.od = id)
The data types of the parameters are given below
  • FeatureSetVector [Quoted string] or [Integer] Denotes the feature names in the dataset. If Integer values are given, then the indices are internally mapped to feature names
  • dataset.id [Integer] Dataset identifier.

Descriptive Statistic Operation

In the columbus system, we support the following descriptive statistic operations
  • CorrelationX: Given a feature set and a dataset id, the function computes the pair wise correlation among the features in the dataset.
  • CorrelationY: The function computes the correlation with the target. Note that the target is implicit from the dataset.id given.
  • CoeffLearner: The function learns the co-efficients for the given feature set and the dataset. The function is generic and we currently support two learning models: Incremental Gradient Descent and Conjugate Gradient. The configuration parameters for the learning methods are exposed to the user, where she can specify appropriate values.
An Example usage of descriptive statistic operation is illustrated below
corrx.val <- CorrelationX(feat.set.1, dataset.id = id)
corry.val <- CorrelationY(feat.set.2, dataset.id = id)
igd.coef.learn <- CoeffLearner(feat.set.1, type="igd", num.iters = 5, step.size = 0.01, decay = 1, init.wt = 0)
cg.coef.learn <- CoeffLearner(feat.set.2, type="cg", num.iters = 5, init.wt = 0)
The datatypes of the parameters are given below
  • type : [Quoted string] Identifies the type of coefficient learner. Allowed string constants : "igd", "cg"
  • num.iters : [Integer] Denotes the number of iterations that the co-efficient learner should be iterated.
  • step.size : [Float] Denotes the learning rate in  IGD
  • decay :  [Float]  Denotes the devay value in IGD
  • init.wt : [Float] Initial weight to be assigned.

Evaluate Operation

In the Columbus system, a feature set evaluation involves two phases: train and test. Train phase is nothing but learning coefficients for the features and the test phase can be crossvalidation or Akaike Information Criterian score. An example usage of evaluate operation is given below
cv.eval <- Evaluate(feat.set.1, train.type = "igd", eval.type = "cv", num.iters = 5, step.size = 0.01, decay = 1, init.wt = 0, num.folds = 5)
aic.eval <- Evaluate(feat.set.1, train.type = "cg", eval.type = "aic", num.iters = 5, init.wt = 0)
Additional parameters for the evaluate operation include
  • eval.type : [Quoted string] Denotes the evaluation type to be used. Valid string constants are "cv" and "aic"

Explore Operation

In the Columbus system, given a feature set explore operation can be used to add or delete one another feature from the available set of features. It involves evaluating a group of feature sets and choosing a best feature set. An example usage of explore operation is given below
add.feat.set <- StepAdd(inp.set = feat.set.1, mask.set = feat.set.2, train.type = "igd", eval.type = "cv", num.iters = 5, step.size = 0.01, decay = 1, init.wt = 0, num.folds = 5)
del.feat.set <- StepAdd(inp.set = feat.set.1, train.type = "cg", eval.type = "aic", num.iters = 5,  init.wt = 0)
Additional parameters for the evaluate operation include
  • mask.set : [feature set]: Denotes the list of features that should be omitted while adding a new feature.
An analyst can use a combination of above operations to write a feature selection program. An example feature selection program is given below.

fs1 <- AssignFeatureSet(c("DATAVOLUME", "NUMMMSOUT", "NUMVASOUT", "NUMCALLSFIXEDOUT", "DURATIONFIXEDINC", "NUMSMSINTEROUT", "NUMSMSCMPINC", "NUMSMSVASINC"), dataset.id = id)
fm1 <- CorrelationX(fs1, dataset.id = id) 
fs2 <- AssignFeatureSet(c("NUMCALLSFIXEDOUT"), dataset.id = id) 
fs3 <- DelFeatureSet(fs1, fs2, dataset.id = id) 
fm2 <- CoeffLearner(fs3, "igd", num.iters = 2, dataset.id = id) 
fs4 <- BestK(fm2, 6, dataset.id = id) 
fm6 <- Evaluate(fs4, "cg", "cv", num.iters=2, num.folds=3, dataset.id=id) # change to 5 folds
fs5 <- StepDel(fs4, "cg", "aic", dataset.id = id) 

The analyst can choose the save the program as given below
SaveSession("ColumbusProgram1", dataset.id = id)
Further she can run the same program in batch mode and on a different or same dataset as shown below.
ExecuteProgram("ColumbusProgram1", dataset.id = id)

For more detailed examples, please refer to the Examples Page.