A Machine Learning Based Approach to Automate Stratigraphic Correlation through Marker Determination

Summary Stratigraphic correlation is well recognized as one of the essential processes, providing information regarding stratigraphic and compartmentalization in a reservoir. It becomes a starting point for subsurface evaluation processes ranging from reservoir characteristics to reserves and resources estimation and economic evaluation. It has always been a focus area in numerous traditional and modern research. Several practices approach stratigraphic correlation, including direct tracing from outcrop, relating geological markers, and comparing the organism characteristics. This work focuses only on one of the traditional work processes, utilizing geological markers to identify stratigraphic correlation. This work primarily studies the potential adoption of data analytics and machine learning in identifying geological markers and connecting them to derive stratigraphic correlation. Well logging information is the primary data source to interpret geological markers. Determining markers was previously done based on the specific well-log characteristics that are rare and uniquely identified in the geological area. It usually takes tremendous effort to find a particular marker from well logging information, especially when many wells scale up the works. Deriving computer-assisted technology with machine learning becomes a key enabler in accelerating and enhancing the business process. The machine learning assisted system has been trained with the entire geoscientists ’ marker interpretations. The system consists of two connected machine-learning models. The first model, designed as a multi-class classification, identifies the geological markers using well-logging information. The first model’s predicted markers are then fed as an input to the second model, designed as a binary classification. It analyzes the relationship between markers in the same wellbore. Subsequently, the predicted markers resulting from two connected models are linked between two or more wells in the same region to create the stratigraphic correlation. Aiming to determine the practicality and potential adoption from one to another, this study implements the same model concept with two different sets of data, two fields in the Gulf of Thailand. The system has been proven successful in model development and deployment and has achieved nearly human performance levels


Introduction
In recent decades, the new asset class, unconventional resources, has been developed to fulfill an increasing need in energy demand.Several activities have been executed in the energy industry, including exploring new oil and gas areas, delineating the additional reservoirs in an existing field, and developing new wells to optimize field production.These activities lead to a significant increase in the amount of data and effort to complete all tasks in the limited time.Stratigraphic correlation is one of the essential processes ranging from an exploration phase to asset development.It provides information regarding stratigraphic and compartmentalization in a reservoir (Howell 1983;Olea and Davis 1986;Waterman and Raymond 1987;Bakke and Griffiths 1989;Fang et al. 1992;Luthi and Bryant 1997).
Stratigraphic correlation is one of the focus areas in numerous traditional and modern research.Rudman and Lankston (1973), Mann and Dowell Jr (1978) identified a stratigraphic correlation, also called correlation, using the cross-correlation technique.Smith and Waterman (1980), Anderson and Gaby (1983), Howell(1983), Waterman and Raymond (1987), Fang et al. (1992), Edwards et al. (2018), Behdad (2019), and Le et al. (2019) determined a similarity between two well-log sequences using the dynamic time warping, also called dynamic waveform matching technique.Zimmermann et al. (2018), Brazell et al. (2019), Bakdi et al. (2020), Tokpanov et al. (2020), andParimontonsakul (2021) focused on applying machine learning models in stratigraphic correlation identification.This work aims to address the correlation tasks through the geoscientists' and data analytics'lens, synchronizing with the business workflow related to the stratigraphic correlation.
This work presents the use of data analytics and machine learning models to assist or automate the stratigraphic correlation tasks.It focuses only on the traditional work process using a geological marker and explores the potential adoption of data analytic and machine learning models to enhance the process.The same model concept is applied to two fields in the Gulf of Thailand to determine the practicality and potential adoption from one to another.
As a result, the models nearly achieve human performance, supporting the idea of integrating the data analytic workflow into the business workflow.The model results also emphasize the importance of a thorough understanding of the work process through the successful implementation of two models connected in series to improve the model performance.In addition, the two connected models may not easily achieve without the ability to adjust or tweak the model setup, emphasizing the significance of data analytic understanding.

Stratigraphic Correlation
Several kinds of information are required to identify stratigraphic correlation.For example, a similarity in the fossil content can be interpreted as correlative since it presents the same organism characteristics, which infer the same age of the rock units.A similarity in the unique lithology sequences can also be interpreted as correlative since it derives a distinct lithology sequence in the geological area.Given the above examples, this study elaborates that geologists can derive a stratigraphic correlation through geological information: a similarity or specific characteristics between that information, such as lithology, organisms, and a geological period.
Well logging is one of the primary data acquisition processes, providing lithology information, petroleum reservoirs, and petrophysical properties.This information is sufficient to imply a stratigraphic correlation.One of the traditional interpretations that geoscientists usually start with is a marker.A marker, also called a geological marker or horizon, is defined as a rare and uniquely identified lithology sequence that one can map over a geological area (Neuendorf et al. 2011).Connecting the same marker exposed in several wells can imply a stratigraphic correlation because it provides the connection of the unique lithology sequence in the area.However, the correlation interpretation does not guarantee that those correlative reservoirs always have pressure communication since several unknown factors can contribute to compartmentalization, such as fault and unconformity (Parimontonsakul 2021).As geologists can use well-logging information to interpret a marker, this study proposes that the same process can be assisted by a data analytic process such as machine learning or any computational process.If geologists' marker interpretation is available, the classification model can be implemented using well-logging information as independent variables or features and labeled marker information as the target predicted values.A clustering model can execute when the only available information is the well-logging data, implying that the machine learning model will provide the group of well-logging patterns, aka pseudo markers (Parimontonsakul, 2021).This study focuses only on the classification model where geologists' marker interpretation is available in this work.This study applies the work process to two fields in the Gulf of Thailand to determine the use case and application in the actual field data.Two areas are selected as it provides the evidence to demonstrate the applicability of the workflow to other fields.

Analytical Problem Formulation
As previously mentioned, marker identification is an essential process in stratigraphic correlation.This section elaborates on the analytical problem formulation and the model implication.This study intends to focus on only one analytical formulation, Convolutional Neural Network, to demonstrate the implementation of the work process in the actual field data.
Convolutional Neural Network (CNN) model has become predominant in the image recognition and computer vision research areas, including several architectures aiming to achieve higher accuracy and more efficient calculations (Krizhevsky et al. 2012;Krizhevsky et al. 2017;Wang et al. 2020;Parimontonsakul 2021).Due to the limited computational capability, this study employs one of the simplest and the most efficient architectures, MobileNets, to identify the markers.Parimontonsakul (2021) proposes one transformation of well-logging information to the image approach.The primary concept is that one column vector represents one section of a well-log sequence, while well-log interval and compression factors, implying zoom-in and out of an image, are added to the well-log sequence.This process creates additional column vectors to the same well log series.He initially proposes to apply eight factors to the same well log sequence, resulting in eight vectors of an image, defined as a pad.The same process is used in other well-log sequences to create the complete image.
Applying the same concept, this study can make several well-logging images by changing the well-log interval and compression factors, as illustrated in Figure 1.The well-log series from left to right are the gamma-ray log, resistivity log, neutron log, and density log in the sandstone scale.The transformation of a well-log to an image is demonstrated.Figure 2 presents the MobileNets architecture.The transformed well log image is set up as an input layer in the CNN model, while the output layer is the geoscientists' marker interpretations.Since there is more than one marker in an output layer, the problem is set up as a multi-class classification problem.As the reader is aware that the transformed image consists of only a well log sequence, there is no additional information regarding well location, well depth, or any other information useful in identifying the geological marker.This study introduces two approaches to resolve the concern.The first approach is to create a placeholder column vector in the image as a place to input supplemental information such as well location and well depth.This method provides the simplest solution since it does not require any adjustment to the MobileNets architecture.The second approach is to include additional data input connecting directly to the fully connected layer.In this case, the modified MobileNets architecture needs to be constructed to add extra information to the CNN model.The two approaches are illustrated in Figure 3.In this work, the first approach is applied for simplicity in model creation, but this does not imply that the first approach is the best.As a result, this study proposes the modified MobileNets architecture as demonstrated in Table 1.
This study addresses this issue by creating an additional machine learning model that explores the probability of correctness in the marker prediction given the nearby marker prediction information.This process can be viewed simply as the nearby marker prediction should be similar, or if the marker is alphabetically sorted from shallow to deep, the nearby marker prediction will also be alphabetically sorted in the same manner.There are several techniques to formulate this kind of model.One of the approaches is to formulate it as a classification problem, where independent variables are the predicted markers from the CNN model, including predicted marker depth and other mathematical aggregation such as min, mean, mode, and a max of the numerical values, and the correctness of the marker prediction given the depth threshold as the target predicted values.

Metrics
This section discusses the metrics to validate the model performance.This study presents metrics for the 3-class classification problem, as illustrated in Table 2; however, the same formula can apply to the multi-class classification problem.The precision is the number of correctly classified divided by all predicted values with the same condition, while the recall is the number of correctly classified divided by all true values with the same condition.F1-score is another metric usually applied in classification problems.It can be written as a harmonic means of precision and recall.The mathematical expressions of precision, recall, and F1-score are shown in Eqs. 1 to 3 (Sokolova and Lapalme 2009).In this work, the model prediction performance in each marker is measured by F1-score, while the macro average of the F1-score measures the overall model prediction performance.During the first model development, called as marker prediction model, the marker and non-marker data are required as input to the model.The imbalanced dataset from significantly different amounts of data between marker and non-marker needs to be addressed.This work applies the simple sampling approach to reduce the non-marker data to reduce the imbalanced dataset and control the computational time and memory to stay within the acceptable range.

IN P U T L A Y E R D E P T H W ISE SE P
During the marker prediction model deployment, each interval of the well-log sequence is evaluated with the trained machine learning model defined in the previous step.The step size between each evaluation is a major concern.If the step size is too small, the number of function evaluations will be high, leading to higher computational time and vice versa.In addition, if the step size is too big, it also negatively impacts the model performance of the second process, implying that the second process will be less accurate.This study does not recommend any concrete solutions that always demonstrate the best step size and suggest further study in this area as needed.
In the second model, named the marker association model, development and deployment process are straightforward as there is no other process before the model deployment, as demonstrated in the marker prediction model.
Two fields in the Gulf of Thailand, classified as fluvial depositional environments, are evaluated (Table 3).Geoscientists can identify the geological markers in both fields as the initial set of data to perform the correlation tasks.However, the markers identified by each area are not the same.Since each location may have a different set of marker interpretations and originations, this study intends to develop and deploy the model in its area to honor its data distribution.Field A and Field B are trained and validated separately with the same model concept to demonstrate that the same process can apply in other locations without any adjustment.Less than half of the field data have been trained due to limited computational performance, while the models have been tested in all remaining data.

Discussion
The marker prediction models, the CNN models, yield an incredible performance.The trained model in fields A and B in the Gulf of Thailand presents comparable performance.The F1-score metric reaches up to 0.95 in most markers except the markers L, M, N, O, and NA in field A and field B. This result demonstrates that the same model concept can be applied in other fields without further adjustment, emphasizing that only the training data of the target field is required to implement the same model concept in other areas.The trained and validated loss function evaluation is demonstrated in Figure 6. Figure 7 demonstrates that applying the marker prediction model alone in the deployment phase confirms the correlation chaos without the marker association model, as presented as the solid light blue line.Integrating the marker association model deems necessary as it significantly reduces the number of incorrect correlation identifications, as demonstrated by the solid blue line.It also indicates that the marker association model is important in the process as long as the marker prediction model's precision scores do not reach the perfect score.The marker association model will complement and help reduce the marker prediction model's error.It also hypothesizes further optimization in the marker prediction model to provide a higher recall score so that the model tended to interpret more markers and applied the marker association model to remove the incorrectly predicted markers.However, this optimization idea has not been tested yet.

Figure 3 -
Figure 3-The modified MobileNets architecture alternative for additional data input (modified from Parimontonsakul 2021).(a) The modified MobileNets architecture added additional column vector; (b) The modified MobileNets architecture added additional data input to fully connected layer.

Figure 4 -Figure 5 -
Figure 4-The training and testing dataset definition.

Figure 6 -
Figure 6-The loss function evaluation in trained and validated data set in field A.

Figure 7 -
Figure 7-The stratigraphic correlation results from the models, the solid light blue line represents the results from the marker prediction model, The solid blue line represents the results from the marker association model.(a)The stratigraphic correlation results from the marker prediction model, elaborating few incorrect correlation identifications from connecting identical markers;(b)The stratigraphic correlation results from the marker prediction and marker association model, presenting an improved correlation identifications by complementing the error from marker prediction model.

Table 2 -The 3-class classification problem confusion matrix.
Figures 4 and 5 demonstrate two processes in model development and deployment for each model.The first process focuses on the first model, identifying the geological markers, development and deployment, while the second process focuses on the second model, analyzing the relationship between markers in the same wellbore, development, and deployment.