Comparative Analysis of Random Forest and Decision Tree for Arsenic Prediction in Groundwater in Begusarai, Bihar
1Department of Civil Engineering, National Institute of Technology Patna, Bihar, India
2Department of Civil Engineering, Bakhtiyarpur College of Engineering, Bihar, India
3Civil Engineering Department, Madan Mohan Malaviya University of Technology, Gorakhpur, Uttar Pradesh, India
Corresponding Author E-mail:nsmaurya@nitp.ac.in
Download this article as:
ABSTRACT:A comprehensive statistical analysis was conducted on groundwater samples from Matihani and Cheria Bariarpur blocks of Begusarai District, Bihar, to assess physicochemical properties and heavy metal concentrations. While pH levels in both blocks remained within acceptable limits (6.5–8.5), arsenic (As) concentrations significantly exceeded the permissible limit (0.01 µg/L), with mean values of 0.129 µg/L in Matihani and 0.1656 µg/L in Cheria Bariarpur. Machine learning models—Decision Tree Regressor (DTR) and Random Forest Regressor (RFR)—were applied to predict arsenic concentrations. RFR outperformed DTR, achieving a lower MSE (0.001867), RMSE (0.043208), and higher NSE and R² values (0.895), demonstrating superior predictive accuracy. The results indicate that arsenic contamination poses a serious public health risk, and Random Forest offers a robust tool for spatial prediction and water quality management. This study highlights the importance of integrating AI-driven modeling with field data to enhance groundwater risk assessment and support informed mitigation strategies.
KEYWORDS:Arsenic; Decision tree; Machine learning; Modelling; Prediction
Introduction
For the Earth to remain green and habitable, water is necessary. Effective management of water resources is essential to every country’s progress. The UN General Assembly acknowledged access to clean drinking and recreational water in 2008.1 Consuming water tainted with bacteria, viruses, and heavy metals may spread a number of illnesses with high death and morbidity rates, including cholera, dysentery, diarrhea, skin cancer, and typhoid. Since these illnesses make up 80% of all diseases that have been discovered so far, they have important.2 In Bihar’s cities and rural areas, groundwater is a vital resource for residential, commercial, and agricultural uses.3 Due to considerable industrial expansion and population increase, the need for water is increasing daily.4, 5 However, one of the biggest environmental problems facing any country today is the supply of fresh water, which is a result of several human activities. One of the main concerns is groundwater contamination.6 Groundwater quality is significantly impacted, particularly in India, by the discharge of untreated sewage, poor solid waste management, widespread use of pesticides and fertilizers in agriculture, overexploitation of groundwater, and changes in land use and land cover.7, 8 The geography of the area, soil characteristics, groundwater and rock interactions, and climate may all have an impact on groundwater quality in addition to pollution.9 In light of the previously described elements, it is now essential to assess the quality of groundwater and carry out corrective. Contamination of drinking water may happen during treatment and transportation processes or at the source as a result of heavy metals and minerals (geogenic). With careful monitoring and targeted water quality management techniques, the latter element may be controlled. However, when naturally occurring heavy metal concentrations above allowable limits, pollution becomes a significant problem. About seventeen of the fifty heavy metals are known to cause cancer, with lead, arsenic, mercury, cadmium, and vanadium being the most hazardous.10 Anthropogenic and natural sources of heavy metals seep into groundwater, causing environmental issues around the globe. One dangerous water contaminant is arsenic (As).11 Because arsenic is a trace element and may be released by geogenic or human activity, it is important to identify and quantify the hazards of consuming it through the skin, inhalation, and ingestion. Arsenic exhibits four oxidation states, which are designated as arsine, arsenic, arsenite, and arsenate. As (III) and As (V) are more commonly found in groundwater and surface water, respectively. Its solubility depends on the pH and ionic environment.12 In general, As (V) is less hazardous than As (III), and organic forms of arsenic are less dangerous than inorganic ones.13 Arsenic oxides are recognized to have the potential to be more hazardous than other arsenic compounds. Even when taken at smaller amounts over longer periods of time, many substances have a higher toxicity index.14 Significant non-carcinogenic hazards, including gangrene, keratosis, hyperpigmentation, black foot disease, and vascular disorders, as well as carcinogenic risks, such cancer, have been linked to long-term chronic exposure to inorganic arsenic.15 The International Agency for Research on Cancer (IARC) has designated arsenic as a class I human carcinogenic metalloid due to the potentially fatal effects of exposure.16 As per IS 10500:2012 permissible limit of arsenic is 0.01 mg/L17 Because of this, arsenic is currently regarded as a dangerous material worldwide, posing a number of immediate and long-term health risks. The incidence of diseases like cancer has increased significantly. A number of things may infect humans with arsenic, including food, beverages, and inadvertently.18 Drinking water is considered a significant exposure route for arsenic when compared to other ingestible channels, such as cutaneous and inhalation. Since arsenic in India’s arsenic-rich groundwater is causing increasing worries about both human health and the environment, it is imperative to evaluate the associated health risks.19
Rapid urbanization and excessive water usage are now occurring in the research locations that were chosen. Understanding the relationships between arsenic levels and other medium characteristics is thus crucial. Since it is highly difficult to acquire and quantify arsenic samples, a variety of machine learning models were created, and their effectiveness in predicting arsenic concentrations based on other input characteristics was evaluated.
Material and Methods
Study Area
One of India’s most fertile and highly inhabited areas is the Gangetic Plains. There are over 83 million people residing in the Middle and partially Upper Ganga Plains of Bihar, which has an area of about 94,163 km². The Mid-Gangetic Plains and the floodplains of the Sone and Gandak rivers, two of its tributaries, were the study’s locations. To achieve this, samples were drawn at random from tube wells close to and distant from the Ganga River’s banks at locations on both banks. Samples were specifically taken from Begusarai, with an emphasis on the blocks of Matihani and Cheria Bariarpur. Fig. 1 below shows the map that describes the research area:
![]() |
Figure 1: Study area of Begusarai (Cheria Bariarpur and Matihani). Click here to View Figure |
Geology and Hydrogeology
The district of Begusarai is located on the Ganga River’s northern bank. This district is traversed by the Burhi Gandak, Balan, Bainty, Baya, and Chandrabhaga rivers. The district is 1,918 square kilometers in size and is home to 2,954,367 people. Rainfall in the area averages 138.4 cm per year. Temperatures may dip to as low as 8° C in the winter, while they can rise as high as 40° C in the summer. The primary source for the people living in the Begusarai area for agricultural and domestic purposes is groundwater. For this research, the blocks of Cheria Bariarpur and Matihani were chosen in order to measure the amounts of arsenic in regions both close to and far from the Ganga River’s banks. While Cheria Bariarpur block is close to the rivers Burhi Gandak, Balan, Bainty, Baya, and Chandrabhaga, Matihani block is situated on the Ganga River’s bank.
Sample collection and analysis
From the tube wells (depth 60-300 feet) in the Begusarai district (Matihani and Cheria Bariarpur Block), a total of 100 samples were taken. (Figure 1). The sample protocol outlined in [20] was adhered to. Prior to sampling, the tube well’s stagnant water was eliminated. Groundwater samples were collected using pre-cleaned polyethylene bottles that had been rinsed with fresh water and demineralized water after being cleaned with a 10% nitric acid solution. Prior to sample collection, the sampling vials were washed with the sample water. EUTECH’s multi-parameter PCSTestr 35 was used on-site to measure the sample’s pH. To determine the presence of heavy metals, samples were taken from each tube well and acidified using concentrated HNO3 to bring the pH down to 2. After being collected, the samples were placed in an ice box and brought to the lab, where they were maintained at 4 °C. Prior to laboratory examination, samples were filtered via a 0.45-µm pore-size membrane. An Agilent 5110 ICP-OES inductively coupled plasma emission spectrometer was used to measure the levels of heavy metals (iron and arsenic).
Model Development
Scaling or Normalization
When the input data has different scales, machine learning algorithms often perform badly.21 Consequently, Eqn. 1 was used to scale the input variables (from 0 to 1):

Thirty percent of the dataset was put aside for testing, while seventy percent was used to train the machine learning models in order to adequately assess the models. The arsenic prediction was tested against the observed values after measured water quality parameters, aside from the quantity of arsenic, were specified as input variables. Python was used to implement the different models. The theoretical ideas and outcomes of the varoius machine learning models are shown and discussed below.
Random forest regressor model (RFR)
Breiman (2001) created the Random Forest Regressor Model (RFR), an ensemble technique that builds many decision trees using random sampling with replacement for making repetitive target variable prediction. The method is presented in Fig. 2 and is explained by [21], utilizes the performance of many decision tree algorithms to predict the level of arsenic. A training subset chosen at random with replacement is used by each decision tree, and this subset is repeated as many times as there are trees in the ensemble. A final forecast is then generated by combining the results of various decision trees.22 Using bootstrap sampling, each tree is constructed using a random subsample of the training dataset; the samples that are excluded from this subsample are known as “Out of Bag” (OOB) samples.23 The RFR model is internally cross-validated using these OOB samples.24 Hyperparameter tweaking was used to improve the prediction of arsenic concentration levels after the first RFR model was developed.
![]() |
Figure 2: Flow chart of the Random Forest regressor algorithm Click here to View Figure |
Decision Tree regressor model (DTR)
Regression and classification issues are addressed using non-parametric machine learning methods known as decision trees (DTs). Unlike black-box algorithms, DTs feature an easy-to-understand decision-making process and are very intuitive [21]. Before rendering decisions based on a range of input variables organized into layers of decision branches, the algorithm begins at a root node and proceeds via internal and terminal nodes. Following the first split of the data into two subsets, the decision tree (DT) uses the same reasoning to divide each subsequent subset recursively. Until the maximum designated depth is achieved or no further splits that minimize the loss function can be identified, this procedure is repeated [25]. Decision trees (DTs) are widely used classifiers and regressors for constructing binary classification and regression, respectively, because of their simplicity and interpretability. Compared to other algorithms, DTs handle numerical and categorical data efficiently and with fewer assumptions.
Results and discussion
Statistical Summary of Water Quality Parameters
A comprehensive statistical analysis of the groundwater quality from two blocks of Begusarai District—Matihani and Cheria Bariarpur—was performed to understand the distribution of key physicochemical parameters and heavy metals. Table 1 presents the water quality statistics for Matihani Block (N = 50). pH of samples were found (6.7 to 7.9), with a mean value of 7.3 and a standard deviation (SD) of 0.3142, which lies well within the acceptable range (6.5–8.5) specified by the Indian Standard (IS:10500, 2012). However, elevated concentrations of arsenic (As) were observed, ranging from 0.0163 µg/L to 0.203 µg/L, with a mean of 0.129 µg/L. This mean concentration significantly exceeds the permissible limit of 0.01 µg/L, suggesting a potential public health risk due to long-term exposure. Iron (Fe) levels in the water ranged from 0.0031 µg/L to 0.398 µg/L, with an average concentration of 0.0412 µg/L, which remains within the acceptable limit of 0.3 µg/L.
Table 1: Statistical description of Matihani Block, Begusarai District, N=50.
| Parameters | Min. | Max. | Mean | S.D. | Acceptable limit IS:10500, 2012 |
| Physicochemical Parameter | |||||
| pH | 6.7 | 7.9 | 7.3 | 0.3142 | 6.5-8.5 |
| Heavy metals (µg/L) | |||||
| As | 0.0163 | 0.203 | 0.1290 | 0.0433 | 0.01 |
| Fe | 0.0031 | 0.398 | 0.0412 | 0.0632 | 0.3 |
Table 2: Statistical description of Cheria Bariarpur block, Begusarai District, N=36.
| Parameters | Min. | Max. | Mean | S.D. | Acceptable limit IS:10500, 2012 |
| Physicochemical Parameters | |||||
| pH | 6.5 | 7.9 | 7.275 | 0.297 | 6.5-8.5 |
| Heavy metals (µg/L) | |||||
| As | 0.0007 | 0.5900 | 0.1656 | 0.1299 | 0.01 |
| Fe | 0.0020 | 0.2080 | 0.0600 | 0.0550 | 0.30 |
Similarly, the groundwater quality data for Cheria Bariarpur Block (N = 36) is summarized in Table 2. The pH values ranged from 6.5 to 7.9, with a mean of 7.275 and a SD of 0.297, also indicating a neutral to slightly basic nature of the water and compliance with the IS standard. Average arsenic concentration was found to be 0.1656 µg/L, with values as high as 0.5900 µg/L, far exceeding the acceptable limit. This block exhibits an even higher mean arsenic level than Matihani, highlighting a more severe contamination issue. The mean concentration of iron was 0.0600 µg/L, varying between 0.0020 µg/L and 0.2080 µg/L, which, like Matihani, remained within permissible limits.
These results indicate that while the general physicochemical characteristics of groundwater (such as pH) are within acceptable bounds, the widespread occurrence of arsenic contamination in both blocks poses a significant environmental and health concern. The spatial variability in arsenic concentrations suggests possible geogenic sources, such as the dissolution of arsenic-bearing minerals, although anthropogenic contributions cannot be ruled out. Such findings underscore the necessity for regular monitoring and the implementation of suitable mitigation strategies to ensure safe drinking water for the affected population.
Evaluation of Machine Learning Model Performance
To predict groundwater quality and associated risk parameters with improved accuracy, multiple machine learning (ML) models were employed, and their performance was evaluated based on standard statistical indices, namely Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Nash–Sutcliffe Efficiency (NSE), and the coefficient of determination (R²). Table 3 summarizes the comparative performance of the Decision Tree Regressor (DTR) and the Random Forest Regressor (RFR).
Table 3: Performance of various ML models
| Models | Performance criteria (Accuracy for testing sets) | |||
| MSE | RMSE | NSE | R2 | |
| DTRM | 0.002356 | 0.048538 | 0.874 | 0.874 |
| RFRM | 0.001867 | 0.043208 | 0.895 | 0.895 |
![]() |
Figure 3: Prediction accuracy plots for RFR Click here to View Figure |
The DTR exhibited a MSE of 0.002356 and an RMSE of 0.048538, with both NSE and R² values standing at 0.874. These metrics indicate that DTRM can capture nonlinear relationships in the dataset reasonably well. However, the RFRM outperformed the DTRM, achieving a lower MSE (0.001867) and RMSE (0.043208), along with higher NSE and R² values of 0.895. The superior performance of RFRM can be attributed to its ensemble-based nature, which reduces overfitting and enhances generalization by aggregating predictions from multiple decision trees.
Overall, the results suggest that Random Forest is a more reliable and robust model for predicting groundwater quality parameters in arsenic-affected regions. Its improved accuracy and predictive efficiency make it a valuable tool for water quality assessment and management. Integrating such data-driven approaches with field-level water testing can significantly improve decision-making processes for public health and resource planning.
![]() |
Figure 4: Prediction accuracy plots for DTR Click here to View Figure |
Conclusions
This study presents a comprehensive assessment of groundwater quality in two blocks of Begusarai District, Bihar—Matihani and Cheria Bariarpur—focusing on key physicochemical parameters and toxic heavy metals. While pH levels in both regions remain within the acceptable range defined by IS: 10500 (2012), arsenic concentrations significantly exceed permissible limits, posing serious health risks. Iron levels were found to be within safe boundaries, although variations were observed across locations. The elevated arsenic contamination suggests a geogenic origin, but the possibility of anthropogenic contributions cannot be dismissed. To enhance groundwater quality prediction, machine learning models were employed and evaluated. Among the tested models, the Random Forest Regression Model (RFRM) demonstrated superior performance with lower error values and higher predictive efficiency compared to the Decision Tree Regression Model (DTRM). These findings highlight the potential of ensemble learning techniques in supporting water quality monitoring and risk assessment frameworks. In conclusion, the combination of statistical analysis and machine learning provides a reliable methodology for identifying contamination hotspots and forecasting groundwater quality. The study emphasizes the urgent need for targeted mitigation strategies and continuous monitoring, especially in arsenic-prone areas, to ensure the provision of safe drinking water and protect public health.
Acknowledgement
We want to thank the lab facility of NIT Patna and also like to mention that no funding was obtained for this study.
Funding Sources
The author(s) received no financial support for the research, authorship, and/or publication of this article.
Conflict of Interest
The author(s) do not have any conflict of interest.
Data Availability Statement
This statement does not apply to this article.
Ethics Statement
This research did not involve human participants, animal subjects, or any material that requires ethical approval.
References
- WHO. 2008. Guidelines for Drinking-Water Quality SECOND ADDENDUM TO THIRD EDITION WHO Library Cataloguing-in-Publication Data. World Health Organization 1: 1–103.http://www.who.int/water_sanitation_health/ dwq/secondaddendum 20081119.pdf.
- WHO. Safe Water Technology for Arsenic Removal. UNICEF: 2001, 1–22.
- Kanth, K.M., Singh, S.K., Kashyap, A., Vijay Kumar Gupta, V.K., Shalini, S., Kumari, S., Kumari, R., and Puja, K. Bacteriological assessment of drinking water supplied inside the Government schools of Patna District, Bihar, India. Am. J. Environ. Prot. 2018, 6(1), 10-13. http://pubs.sciepub.com/env/6/1/2/index.html.
- Sarath Prasanth, S.V., Magesh, N.S., Jitheshlal, K.V., Chandrasekar, N. and Gangadhar, K.J.A.W.S., Evaluation of groundwater quality and its suitability for drinking and agricultural use in the coastal stretch of Alappuzha District, Kerala, India. Appl. Water Sci. 2012, 2, 165-175.
CrossRef - Sappa, G., Ergul, S., Ferranti, F., Sweya, L.N. and Luciani, G. Effects of seasonal change and seawater intrusion on water quality for drinking and irrigation purposes, in coastal aquifers of Dar es Salaam, Tanzania. J. Afr. Earth Sci. 2015, 105, 64-84. http://dx.doi.org/10.1016/j.jafrearsci.2015.02.007.
CrossRef - Mishra, A. K., & Maurya, N. S. Assessment Of Groundwater Quality Of Patna Urban And Sub-Urban Areas For Its Uses As Drinking And Irrigation Water. Environ. Eng. Manag. J., 2024, 23(8).
CrossRef - Stigter, T.Y., Van Ooijen, S.P.J., Post, V.E.A., Appelo, C.A.J. and Dill, A.C. A hydrogeological and hydrochemical explanation of the groundwater composition under irrigated land in a Mediterranean environment, Algarve, Portugal. J. Hydrol., 1998, 208(3-4), 262-279.
CrossRef - Peterson, E.W., Davis, R.K., Brahana, J.V. and Orndorff, H.A., . Movement of nitrate through regolith covered karst terrane, northwest Arkansas. J. Hydrol., 2002, 256(1-2), 35-47.
CrossRef - Reghunath, R., Murthy, T.S. and Raghavan, B.R., 2002. The utility of multivariate statistical techniques in hydrogeochemical studies: an example from Karnataka, India. Water Res., 2002, 36(10), 2437-2442.
CrossRef - Basu, A., Saha, D., Saha, R., Ghosh, T. and Saha, B. A review on sources, toxicity and remediation technologies for removing arsenic from drinking water. Res. Chem. Intermed.,2014, 40, 447-485.
CrossRef - Abeer, N., Khan, S.A., Muhammad, S., Rasool, A. and Ahmad, I. Health risk assessment and provenance of arsenic and heavy metal in drinking water in Islamabad, Pakistan. Environ. Technol. Innov., 2020, 20, 101171. https://doi.org/10.1016/j.eti.2020.101171.
CrossRef - Jain, C.K. and Ali, I. Arsenic: occurrence, toxicity and speciation techniques. Water Res., 2000, 34(17), 4304-4312.
CrossRef - Sharma, V.K. and Sohn, M. Aquatic arsenic: toxicity, speciation, transformations, and remediation. Environ. Int., 2009, 35(4), 743-759. http://dx.doi.org/10.1016/j.envint.2009.01.005.
CrossRef - Choong, T.S., Chuah, T.G., Robiah, Y., Koay, F.G. and Azni, I., . Arsenic toxicity, health hazards and removal techniques from water: an overview. Desalination, 2007, 217(1-3), 139-166.
CrossRef - Uddh-Söderberg, T.E., Gunnarsson, S.J., Hogmalm, K.J., Lindegård, M.B.G. and Augustsson, A.L., . An assessment of health risks associated with arsenic exposure via consumption of homegrown vegetables near contaminated glassworks. Sci. Total Environ., 2015, 536, 189-197.
CrossRef - IARC. Monographs on the Evaluation of Carcinogenic Risks to Humans. IARC Monographs 2004, 84: 41–267.
- IS 10500. 2012. Indian Standard Drinking Water — Specification (Second Revision) ICS 13.060.20.
- Hering, J.G., . Risk assessment for arsenic in drinking water: limits to achievable risk levels. J. Hazard. Mater., 1996, 45(2-3), 175-184.
CrossRef - USEPA. Quantitative Risk Assessment Calculations: Sustainable Futures /P2 Framework Manual. USEPA 2012, 748: 1–11.
- Bhardwaj, V., Singh, D.S. and Singh, A.K. Water quality of the Chhoti Gandak River using principal component analysis, Ganga Plain, India. J. Earth Syst. Sci., 2010, 119, 117-127.
CrossRef - Ibrahim, B., Ewusi, A., Ahenkorah, I. and Ziggah, Y.Y. Modelling of arsenic concentration in multiple water sources: a comparison of different machine learning methods. Groundw. Sustain. Dev., 2022, 17, 100745 https://doi.org/10.1016/j.gsd.2022.100745.
CrossRef - Xiangcao, Z., Su, C., Xianjun, X., Ge, W., Xiao, Z., Yang, L. and Pan, H., . Employing machine learning to predict the occurrence and spatial variability of high fluoride groundwater in intensively irrigated areas. Appl. Geochem., 2024, 167, 106000. https://doi.org/10.1016/j.apgeochem.2024.106000.
CrossRef - Jin, Z., Shang, J., Zhu, Q., Ling, C., Xie, W. and Qiang, B. RFRSF: Employee turnover prediction based on random forests and survival analysis. In Web Information Systems Engineering–WISE 2020: 21st International Conference, Amsterdam, The Netherlands, October 20–24, 2020, Proceedings, Part II 21,503-515. Springer International Publishing.
CrossRef - Chakraborty, M., Sarkar, S., Mukherjee, A., Shamsudduha, M., Ahmed, K.M., Bhattacharya, A. and Mitra, A. Modeling regional-scale groundwater arsenic hazard in the transboundary Ganges River Delta, India and Bangladesh: Infusing physically-based model with machine learning. Sci. Total Environ., 2020, 748, 141107. https://doi.org/10.1016/j.scitotenv.2020.141107.
CrossRef - Singh, S.K., Taylor, R.W., Pradhan, B., Shirzadi, A. and Pham, B.T. Predicting sustainable arsenic mitigation using machine learning techniques. Ecotoxicol. Environ. Saf., 2022, 232, 113271.. https://doi.org/10.1016/j.ecoenv.2022.113271.
CrossRef
Second Review by: Dr. Neetu Singh
Final Approval by: Dr. Tawkir Sheikh












