Dismiss Notice

Register now to be one of the first members of this SharePoint Community! Click here it just takes seconds!

Dismiss Notice
Welcome Guest from Country Flag

Anomaly Detection (One Class SVM) in R with MicrosoftML

Discussion in 'Official Microsoft News' started by Tsuyoshi Matsuzaki, Apr 3, 2017.

Thread Status:
Not open for further replies.
  1. Tsuyoshi Matsuzaki

    Tsuyoshi Matsuzaki Guest

    Blog Posts:
    0
    In my previous post I described about the text featurization using MicrosoftML.
    In this post, I show you a brief introduction for the anomaly detection with MicrosoftML.

    Note : As I mentioned in the previous post, MicrosoftML is now available in Windows only (not Linux including the Spark cluster). Sorry, but please wait for the update.​

    MicrosoftML provides the function of one class support vector machines (OC-SVM) called rxOneClassSvm, which is used for the unbalanced binary classification. This function is the unsupervised learner, i.e., it doesn’t need the values of anomalies in the training phase. (The only normal data is used for the training, and it’s separated by the optimal hyperplane with maximum margins while it’s mapped into the high dimensional space.)

    First I show you a brief example of this function for your understanding as follows.

    library(MicrosoftML)

    # train data with normal data
    train_count <- 500
    ndivall <- rnorm(train_count)
    ndivnorm <-
    (ndivall - min(ndivall))/(max(ndivall) - min(ndivall))
    traindata <-
    data.frame(AvailableMemory = round(200 * ndivnorm, digits = 2))
    ndivall <- rnorm(train_count)
    ndivnorm <- (ndivall - min(ndivall))/(max(ndivall) - min(ndivall))
    traindata$DiskIO <- round(100 * ndivnorm, digits = 2)

    # test data with some anomaly data
    test_count <- 10
    ndivall <- rnorm(test_count)
    ndivnorm <-
    (ndivall - min(ndivall))/(max(ndivall) - min(ndivall))
    testdata <-
    data.frame(AvailableMemory = round(200 * ndivnorm, digits = 2))
    ndivall <- rnorm(test_count)
    ndivnorm <- (ndivall - min(ndivall))/(max(ndivall) - min(ndivall))
    testdata$DiskIO <- round(100 * ndivnorm, digits = 2)
    testdata$AvailableMemory[c(3,7)] <- c(100, 0)
    testdata$DiskIO[c(3,7)] <- c(150, 120)

    # train by OC-SVM with normal data
    model <- rxOneClassSvm(
    formula = ~AvailableMemory + DiskIO,
    data = traindata)

    # predict
    result <- rxPredict(
    model,
    data = testdata,
    extraVarsToWrite = c("AvailableMemory", "DiskIO"))

    As you can see, the row #3 and #7 in the test data is the outlier.
    The following illustrates the data map including the normal data by the blue dot and this outlier data by the red dot.

    [​IMG]

    The following is the result. The outlier data in row #3 and #7 are scored as follows.

    [​IMG]

    Let’s see the real scenario.
    Here I use the “Breast Cancer Wisconsin (Diagnostic) Data Set” (see here). This data is including id of patient, the diagnosis result (M = malignant, B = benign), and a lot of attributes which are computed from a digitized image of a breast mass (radius, texture, perimeter, etc). This sample is having high dimensions.

    This dataset is well-formed for the analysis purpose, but in the real application you must do some work like selecting appropriate attributes, vectorizing, data cleaning, eliminating dependencies, etc.

    8510426, B, 13.54, 14.36, 87.46, ...
    8510653, B, 13.08, 15.71, 85.63, ...
    8510824, B, 9.504, 12.44, 60.34, ...

    ...

    Here I train and predict with the following steps.

    1. Split the data into the training purpose and testing purpose.
    2. Create the trained model by rxOneClassSvm with the training data. We use all the attributes except for the patient id and the result (‘M’ or ‘B’).
    3. Predict by the model with test data, and evaluate results. (Here I use ROCR package.)

    This programming example is here :

    library("MicrosoftML")
    library(ROCR)

    # read data
    alldata <- read.csv(
    "C:\tmp\wdbc.data",
    col.names=c(
    "patientid",
    "outcome",
    "radius_mean",
    "texture_mean",
    "perimeter_mean",
    "area_mean",
    "smoothness_mean",
    "compactness_mean",
    "concavity_mean",
    "concavepoints_mean",
    "symmetry_mean",
    "fractaldimension_mean",
    "radius_error",
    "texture_error",
    "perimeter_error",
    "area_error",
    "smoothness_error",
    "compactness_error",
    "concavity_error",
    "concavepoints_error",
    "symmetry_error",
    "fractaldimension_error",
    "radius_worst",
    "texture_worst",
    "perimeter_worst",
    "area_worst",
    "smoothness_worst",
    "compactness_worst",
    "concavity_worst",
    "concavepoints_worst",
    "symmetry_worst",
    "fractaldimension_worst"))

    # split data
    # (Note that all training data must be normal data)
    traindata <- alldata[1:449,]
    traindata <-
    traindata[traindata$outcome=="B",]
    traindata <-
    traindata[,!(names(traindata) %in% c("patientid", "outcome"))]

    testdata <- alldata[450:568,]

    # train by OC-SVM with normal data
    model <- rxOneClassSvm(
    formula = ~ .,
    data = traindata)

    # predict using the trained model
    result <- rxPredict(
    model,
    data = testdata,
    extraVarsToWrite = c("outcome"))

    # evaluate results (compare with the real diagnostic results) and plot
    pred <- prediction(
    predictions = result$Score,
    labels = result$outcome,
    label.ordering = c('B', 'M'))
    roc.perf = performance(
    pred,
    measure = "tpr",
    x.measure = "fpr")
    plot(roc.perf)

    The following is the result plotted by ROCR. (The result seems to be fairly good !)

    [​IMG]

    rxOneClassSvm uses the radial basis (RBF) as the SVM kernel function by default. In more complex cases, you can specify other kernel functions (linear, polynomial, sigmoid) with appropriate parameters.

    model <- rxOneClassSvm(
    formula = ~TestAttr1 + TestAttr2,
    kernel = polynomialKernel(a = .2, deg = 2),
    data = traindata)

    Continue reading...
     
Thread Status:
Not open for further replies.

Share This Page

LiveZilla Live Chat Software