A Machine Learning Workflow on the Deployment and Prediction of Unknown Logs using Kubeflow Pipeline, Bagging and Boosting Methods.
Recent and rapid advancements in machine learning has given rise to many applications of ML techniques to real world problems. The oil and gas industry, with its vast amount of subsurface data, is well positioned to leveraging these approaches to deliver significant benefit.
This subsurface data consists of measurements of physical properties gathered while drilling boreholes and are recorded in well logs, which are sequential records of properties recorded at regular depth increments. The well logs record raw measurements collected by various tools, and are naturally noisy, incomplete and sometimes some logs are omitted. This act birthed the idea of estimating unknown well logs.
Goal
The goal of this workflow is to train, predict and deploy a ML model that would estimate an unknown sonic well log.
Data Source
The data is from Equinor for the Volve field in the Norwegian North Sea. The Volve dataset is the most complete open source Exploration & Production data available and provides a detailed insight into the life of the field right from Exploration to Abandonment.
The Volve Data Village is an album of 11 different folders containing 9 data types. In this workflow, only the logging data is considered and only logs from well 15_9-F-11A, 15_9-F-1B and 15_9-F-1A are considered. Well 15_9-F-11A and 15_9-F-1B were used for training while 15_9-F-1A for prediction.
Log Information
The logs from the wells to be used are Depth, Gamma ray (GR), Neutron Porosity (NPHI), Bulk Density (RHOB), Caliper (CALI), Photoelectric absorption factor (PEF), True resistivity (RT), and Sonic Log (DT).
Data Cleaning
The wells contain several missing values and due to this, only consistent sections of the wells without Nan values, were used for preprocessing and modelling. Borehole size, Rate of Penetration, and several other logs with little or no correlation to the target were dropped.
Exploratory Data Analysis
From the EDA, we were able to deduce the wells contain a lot of outliers, the distribution of most logs was not completely Gaussian and some logs were highly correlated, e.g. NPHI, RHOB and DT.
Data Preprocessing and Transformation
The logs were normalized and scaled for better data distribution and outlier reduction. Only the true resistivity was log transformed initially and this is because of its scale on well log plots. The normalization technique used was the power transform technique using Yeo-Johnson method. This technique was picked over techniques like Standard scaler and Normalizer because of its ability to make a distribution Gaussian like.
# perform a yeo-johnson transform of the train datasetptrain = PowerTransformer(method=’yeo-johnson’)
train_df_yj = ptrain.fit_transform(train_df.drop(‘DT’, axis=1))
Outlier Detection and Removal
During the EDA, we deduced that the well logs contain a lot of outliers. Outliers can skew statistical measures and data distributions, providing a misleading representation of the underlying data and relationships. Removing outliers from training data prior to modeling can result in a better fit of the data and, in turn, more skillful predictions.
Thankfully, there are a variety of automatic model-based methods for identifying outliers in input data. The scikit-learn library provides a number of built-in automatic methods for identifying outliers in data. scikit-learn library methods for identifying outliers in data are Isolation Forest, Minimum Covariance using Elliptic Envelope, Local Outlier Factor, and One-class Support Vector Machine.
Out of the four methods, the One-class Support Vector Machine proved to deal with the outliers best.
Model Building
The logs were built on several models, ranging from bagging techniques to boosting techniques. These models and techniques were RandomForestRegressor, CatBoostRegressor and ExtraTreesRegressor. The root mean squared error (rmse) was selected as the evaluation metrics. The resulting rmse values were 4.79, 4.47, AND 4.72 respectively. All the models performed well.
I had my best performing model from the catboost model with a rmse score of 4.47.
Kubeflow Pipeline
The model was deployed on Kubeflow. You can view the workflow of the modelling process, here.
Result
This is a visualization of the predicted logs compared with the actual one.
Discussion and Conclusion
- To find out more about everything talked about, click here
- All the models performed well.
- With the performance of these models in predicting the unknown sonic log , the oil and gas industry can leverage on Machine Learning techniques to reduce data quality and integrity issues.
References
Nick et. al, 2019,Machine learning on Crays to optimize petrophysical workflows in oil and gas exploration