A Machine Learning Workflow on the Deployment and Prediction of Unknown Logs using Kubeflow Pipeline, Bagging and Boosting Methods.

Photo by Colton Sturgeon on Unsplash

Recent and rapid advancements in machine learning has given rise to many applications of ML techniques to real world problems. The oil and gas industry, with its vast amount of subsurface data, is well positioned to leveraging these approaches to deliver significant benefit.

This subsurface data consists of measurements of physical properties gathered while drilling boreholes and are recorded in well logs, which are sequential records of properties recorded at regular depth increments. The well logs record raw measurements collected by various tools, and are naturally noisy, incomplete and sometimes some logs are omitted. This act birthed the idea of estimating unknown well logs.


The goal of this workflow is to train, predict and deploy a ML model that would estimate an unknown sonic well log.

Data Source

The data is from Equinor for the Volve field in the Norwegian North Sea. The Volve dataset is the most complete open source Exploration & Production data available and provides a detailed insight into the life of the field right from Exploration to Abandonment.

The Volve Data Village is an album of 11 different folders containing 9 data types. In this workflow, only the logging data is considered and only logs from well 15_9-F-11A, 15_9-F-1B and 15_9-F-1A are considered. Well 15_9-F-11A and 15_9-F-1B were used for training while 15_9-F-1A for prediction.

Log Information

The logs from the wells to be used are Depth, Gamma ray (GR), Neutron Porosity (NPHI), Bulk Density (RHOB), Caliper (CALI), Photoelectric absorption factor (PEF), True resistivity (RT), and Sonic Log (DT).

well log visualization

Data Cleaning

The wells contain several missing values and due to this, only consistent sections of the wells without Nan values, were used for preprocessing and modelling. Borehole size, Rate of Penetration, and several other logs with little or no correlation to the target were dropped.

Correlation Heatmap of train data

Exploratory Data Analysis

From the EDA, we were able to deduce the wells contain a lot of outliers, the distribution of most logs was not completely Gaussian and some logs were highly correlated, e.g. NPHI, RHOB and DT.

Pairplot of train data

Data Preprocessing and Transformation

The logs were normalized and scaled for better data distribution and outlier reduction. Only the true resistivity was log transformed initially and this is because of its scale on well log plots. The normalization technique used was the power transform technique using Yeo-Johnson method. This technique was picked over techniques like Standard scaler and Normalizer because of its ability to make a distribution Gaussian like.

# perform a yeo-johnson transform of the train datasetptrain = PowerTransformer(method=’yeo-johnson’)
train_df_yj = ptrain.fit_transform(train_df.drop(‘DT’, axis=1))

Outlier Detection and Removal

During the EDA, we deduced that the well logs contain a lot of outliers. Outliers can skew statistical measures and data distributions, providing a misleading representation of the underlying data and relationships. Removing outliers from training data prior to modeling can result in a better fit of the data and, in turn, more skillful predictions.

Thankfully, there are a variety of automatic model-based methods for identifying outliers in input data. The scikit-learn library provides a number of built-in automatic methods for identifying outliers in data. scikit-learn library methods for identifying outliers in data are Isolation Forest, Minimum Covariance using Elliptic Envelope, Local Outlier Factor, and One-class Support Vector Machine.

Out of the four methods, the One-class Support Vector Machine proved to deal with the outliers best.

box-plot before and after outlier removal

Model Building

The logs were built on several models, ranging from bagging techniques to boosting techniques. These models and techniques were RandomForestRegressor, CatBoostRegressor and ExtraTreesRegressor. The root mean squared error (rmse) was selected as the evaluation metrics. The resulting rmse values were 4.79, 4.47, AND 4.72 respectively. All the models performed well.

I had my best performing model from the catboost model with a rmse score of 4.47.

Kubeflow Pipeline

The model was deployed on Kubeflow. You can view the workflow of the modelling process, here.

kubeflow pipeline


This is a visualization of the predicted logs compared with the actual one.

Prediction and true values

Discussion and Conclusion

  • To find out more about everything talked about, click here
  • All the models performed well.
  • With the performance of these models in predicting the unknown sonic log , the oil and gas industry can leverage on Machine Learning techniques to reduce data quality and integrity issues.


Nick et. al, 2019,Machine learning on Crays to optimize petrophysical workflows in oil and gas exploration

An outstanding petroleum engineering student with a cumulative 3 years of experience in Oil & Gas, Computer programming, Business Intelligence, and Data Science

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store