This file provides a basic concrete example on how to use the Faith Cluster

Open your terminal and login using ssh

ssh your_username@diufrd200.unifr.ch

Create a workspace (folder) and the venv for your project (while connected in SSH)

  • Create a folder for your project in your home directory your_username@diufrd200:~$ mkdir faith_demo

  • Move into directory your_username@diufrd200:~$ cd faith_demo

  • Create a venv for your project your_username@diufrd200:~$ python -m venv .venv

Move files and manage your workspace (2 options)

  1. Use Git or another repository to manage your files and pull them from there using your favorite terminal
  2. Connect Visual Studio through SSH to your remote workspace and manage files form there (provides a visual interface and a terminal) See for instructions https://code.visualstudio.com/docs/remote/ssh

Install necessary packages in your venv

  • Note that using a requirements.txt file helps
  • your_username@diufrd200:~$ source .venv/bin/activate
  • install packages manually using pip or use pip install -r requirements.txt

Create a slurm file

Here is a working example

#!/bin/bash
#SBATCH --job-name=faith_demo
#SBATCH --output=logs/output.txt
#SBATCH --error=logs/error.txt
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --mail-type=ALL
#SBATCH --mail-user=your_mail@unifr.ch

source .venv/bin/activate
python3 faith_demo.py

Request processing time for your task by running your slurm file

your_username@diufrd200:~$ sbatch slurm_script.sh

Sources

There are 3 files required to make this example work. They should be all located within the directory of your project (<faith_demo>)

1) Python script (faith_demo.py) 2) Requirements file (requirements.txt) 3) Slurm script (slurm_script.sh)

Python script

"""
Train and evaluate different models, either on the Iris dataset (very easy classification task) or Forest Covert dataset (slightly more complex classification task).
Several models are proposed. A GridSearchCV is also available for the RandomForest dataset, which requires more computation time.
"""
from __future__ import print_function, division

from sklearn.datasets import load_iris
from sklearn.datasets import fetch_covtype

from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn import neighbors
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV


import joblib

def train_dt(X, y):
    print("Training the decision tree model")
    my_model = DecisionTreeClassifier(max_depth=2)
    my_model.fit(X, y)
    return my_model

def train_lr(X, y):
    print("Training the linear model")
    my_model = LogisticRegression()
    my_model.fit(X, y)
    return my_model

def train_knn(X, y):
    print("Training the knn model")
    my_model = neighbors.KNeighborsClassifier()
    my_model.fit(X, y)
    return my_model

def train_rf(X, y):
    print("Training the rf model")
    my_model = RandomForestClassifier()
    my_model.fit(X, y)
    return my_model

def gridSearchCV_rf(X,y):
    print("Training the rf model with GridSearchCV (might take some time ...)")
    # Create the parameter grid based on the results of random search 
    param_grid = {
        'bootstrap': [True],
        'max_depth': [80, 90, 100, 110],
        'max_features': [2, 3],
        'min_samples_leaf': [3, 4, 5],
        'min_samples_split': [8, 10, 12],
        'n_estimators': [100, 200, 300, 1000]
    }
    # Create a based model
    rf = RandomForestClassifier()
    # Instantiate the grid search model
    grid_search = GridSearchCV(estimator = rf, param_grid = param_grid, cv = 3, n_jobs = -1, verbose = 0)
    # Fit the grid search to the data
    grid_search.fit(X, y)
    print("Best params found:\n")
    print(str(grid_search.best_params_), "\n")
    my_model_gs = grid_search.best_estimator_
    return my_model_gs


def save_model(model, filename):
    print("Saving the model")
    joblib.dump(model, filename)

def load_iris_dataset():
    print("Loading iris dataset\n")
    iris = load_iris()
    return iris

def load_forestcovert_dataset():
    print("Loading Forest dataset\n")
    covert = fetch_covtype()
    return covert


def main():
    #dset = load_iris_dataset()
    dset = load_forestcovert_dataset()

    X = dset['data']
    y = dset['target']
    x_train, x_test, y_train, y_test = train_test_split(X, y, train_size=0.75)

    model_dt = train_dt(x_train, y_train)
    print("Decision tree accuracy is", accuracy_score(y_test, model_dt.predict(x_test)), "\n")

    model_lr = train_lr(x_train, y_train)
    print("Linear regression accuracy is", accuracy_score(y_test, model_lr.predict(x_test)), "\n")

    model_knn = train_knn(x_train, y_train)
    print("KNN accuracy is", accuracy_score(y_test, model_knn.predict(x_test)), "\n")

    model_rf = train_rf(x_train, y_train)
    print("Random Forest accuracy is", accuracy_score(y_test, model_rf.predict(x_test)), "\n")

    #model_rfgs = gridSearchCV_rf(x_train, y_train)
    #print("Random Forest from GridSearch accuracy is", accuracy_score(y_test, model_rfgs.predict(x_test)), "\n")

if __name__ == "__main__":
    main()

Requirements text file

scikit-learn 
joblib

Slurm Script

#!/bin/bash
#SBATCH --job-name=faith_demo
#SBATCH --output=logs/output.txt
#SBATCH --error=logs/error.txt
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --mail-type=ALL
#SBATCH --mail-user=your.mail@unifr.ch


source .venv/bin/activate
python3 faith_demo.py