Use of Generalized Additive Models to identify risk factors of HIV / AIDS

The transmission of HIV/AIDS is the basic problem in Sub-Saharan African Countries. In Ethiopia, it is the basic problem. The purpose of this research was to identify social, geographic and demographic risk factors of HIV/AIDS based on the HIV test result. The data used in this investigation is from the 2011 Ethiopian Demographic and Health Survey (EDHS) for female respondents. The technique of Generalized Additive Model (GAM) was used to investigate the data and the outcome variable was the result of HIV/AIDS test result. The results gave more understanding about the distribution of current age of respondents, age at first cohabitation, age at 1st sex, age at 1st birth, husband/partner’s age, family size and children ever born. The results from the model confirm that HIV test result is high for early ages. Moreover, based on the result, respondents who know about HIV/AIDS and STI’s have higher chance to prevent HIV/AIDS. Besides these using condoms and have one sexual partner reduces the risk of HIV/AIDS.


Introduction
HIV/AIDS is a serious problem in Africa.Ethiopia has experienced a growing HIV epidemic similar to other African countries through years.The first diagnosed HIV infection was confirmed in 1984 [1] .From all parts of the country, in 1986 AIDS cases were first identified in Addis Ababa [2] .The first cases of HIV/AIDS were found among commercial sex workers [3] .After that, the infection spread to the overall inhabitants.Among the HIV positive occurrences, the majority were identified among pregnant women and blood donors.Because of the widespread of HIV/AIDS epidemic, it is considered as serious problem in the country.According to the new findings of the joint United Nations Program on HIV/AIDS currently, an estimated prevalence of HIV is 1.4 percent.This result is based on tests conducted on 5,780 men and 5,300 women of age 15 to 49 [4] .A better knowledge of the spread of the HIV/AIDS problem within regions, change is important to decrease the risk of HIV/AIDS.A number of researchers conducted different studies to identify the risk factors of HIV/AIDS in Africa.These studies conducted in Sub-Saharan Africa have revealed big differences in the prevalence of HIV by demographic, geographic and economic factors within and between countries [5] .These factors include educational attainment, occupation and exposure to the media [6][7][8] .These factors lead to increased risk of HIV infection [9] .Furthermore, these studies have revealed that having many sexual associates and having a casual sexual companions' increase the risk of exposed to HIV/AIDS [10][11][12][13] .From the studies, it has been found that having many sexual

RESEARCH ARTICLE
companions increases the problem of HIV spread speedily to others [14,15] .Therefore, limited to one companion can decrease the risk of being exposed to HIV infections [16] .The other clinical trial conducted in Sub-Saharan Africa has revealed that male circumcision can decrease the risk of being exposed [17,18] .In addition to this awareness of HIV protection methods and positive attitudes of acceptance towards persons living with HIV have contributed for the reduction of the spread of HIV infection [19] .
Identifying the socio-economic, demographic and geographic risk indicators related to the prevalence of HIV/AIDS using data from the 2011 Ethiopian Demographic and Health Survey is important.This kind of study is crucial in detecting individuals who need serious intervention.Therefore, the purpose of this work was to identify socio-economic demographic and geographic risk factors of HIV/AIDS based on the Ethiopian Demographic and Health Survey (EDHS) analysed using Semi-parametric statistical methods.

Study design
Ethiopia is located in the eastern part of Africa.In 2000, the first Ethiopian Demographic and Health Survey (EDHS) were conducted.After the 1 st EDHS, the 2 nd and 3 rd rounds were conducted in 2005 and 2011.EDHS survey is used to provide data in demographic and health risk factors to facilitate the country's health and development plans.The main purpose of the EDHS was to provide important material for preparation, policy preparation, monitoring and assessment of population and health programmes.Furthermore, EDHS also provides useful baseline data in controlling and assessment of the growth and transformation strategies to various sector programmes.

Source of data
This study uses the 2011 EDHS.This survey was the third country-wide Demographic and Health survey.The 2011 EDHS was conducted from 27 December 2010 through 3 June 2011 for five months period.In the survey, 17,817 households were selected.As a sampling frame, the 2007 Population and Housing Census of Ethiopia was used.Stratified two-stage cluster design was used to select households for the survey.Enumeration areas (EA) were used as the first stage sampling.In the survey, 624 EAs or clusters (187 urban and 437 rural) were selected.Data was obtained from each of the eleven geographic regions in Ethiopia.
All data from the survey was weighted to estimate values at the national level.In the survey, 9,096 women and 6,033 men aged 15-49 and 15-59 were interviewed respectively.Therefore, the 2011 EDHS sample provides for all eleven geographical areas [20] .

Variables in the study
Response variable: -The outcome of interest is the HIV result.To obtain the HIV/AIDS result, blood sample was collected on a filter paper card.The blood sample was dried overnight before sent to a laboratory for HIV prevalence testing.The testing procedure was completely anonymous.It was not possible to identify the respondent.A blood sample was collected from consent respondents.

Independent variables:
-The socio-economic, demographic and geographic covariates used in the study are region, place of residence, highest educational level, religion, family size, frequency of reading newspaper, listening ratio and watching television, wealth index, total children, respondents current age, age of first birth, age at first cohabitation, age at first sex, husband's age, current pregnancy, current marital status, number of unions, number of wives, recent sexual activity, ever heard of STI's, ever heard of AIDS, condom use, had any STI in last 12 months, ever been tested for HIV and total life time sexual partners.

Statistical methods
Sometimes, the response variable and some confounding covariates may have unknown functional form.Because of this, a model which incorporate unknown functional forms is necessary.For this kind of problem, Semi-parametric additive models can be used.In 1986, Generalized Additive Models (GAM) was proposed by Hastie and Tibshirani.This model generalizes the additive predictor by incorporating parametric nonlinear component [21][22][23] .The Semi-parametric model construction has extensive applications in several scientific problems.Sometimes, the use of pametric methods might result in confounding effects for some predictors.This kind of parameters are significant to assess nonparametrically.For this model, nonparametric smoothing methods have to be used.This smoothing methods are flexible techinque to find structure and connections within data.In theory, GAM extended from Generalized linear model replacing the linear form with the additive form, i.e., ∑      and ∑ (  )  respectively.The steps in GLM can be replaced by nonparametric additive regression steps to find the suitable smooth function .Hence, the GAM can be presented as [23] where   ≡ (  ) and   has some exponential family distribution,   * is the design matrix,  is the corresponding parameter vector, and   (. ) are smooth functions of covariates.Equation ( 1) is simply an additive model to identity link  and normally distributed response.The choice of smoothing bases is important for estimation of parameters for GAM.A smoother is a method used for summarizing the development of a response as a function of independent variables [22] .
For GAM analysis, statistical interpretation on the nonparametric functions   (. ), the estimation of smooth parameters, λ, and interpretation on variance component  is required.In theory, smoothing spline estimators and linear models have close relationships [24][25][26] .For a given value of λ and , the natural cubic smoothing spline estimators of   (. ) maximize the penalized log quasi-likelihood [27] {;  0 ,   (. ), } − where (  ,   ) defines the range of the j th covariate and  = ( 1 , . . .,   ) ′ is a vector of smoothing parameters.In the analysis,  influences the trade-off between goodness of fit and the smoothness of the assessed functions.Moreover,   (. ) is an   × 1 unknown vector of the values of   (. ) eveluated at the   ordered values of the   = ( = 1, . . ., ) and   is the smoothing matrix [34] .

Results
In earlier study, the HIV/AIDS result was fitted to different covariates by means of parametric models and assumed a linear age, family size, age at first sex, etc [11,[29][30][31] .But, some factors may have unknown relationships to the response variable.Therefore, the purpose of this work was to model the influence of current age of respondents, age at first cohabitation, age at 1 st sex, age at 1 st birth, husband/partner's age, family size and total children everborn nonparametrically by keeping the other covariates parametric using GAM.The effects are continious and might have non-linear relationship with HIV test result.Fitting these covariates non-parametrically is very crucial.The final GAM model consists of different socio-economic, demographic and geographic covariates.These covariates are region, place of residence, highest educational level, religion, family size, frequency of reading newspaper, listening ratio and watching television, wealth index, total children, respondents current age, age of first birth, age at first cohabitation, age at first sex, husband's age, current pregnancy, current marital status, number of unions, number of wives, recent sexual activity, ever heard of STI's, ever heard of AIDS, condom use, having STI in last 12 months, tested for HIV and life time sexual companions.Therefore, HIV test result with Semi-parametric logistic regression model was used with all covariates including potential interaction effects.Different from the past studies [11,[29][30][31] 1 gives the significant influencing parametric factors of the model.The result displays that region.type of place of residence, educational level, frequency of watching television, wealth index, currently pregnant, marital status, total life time sex partners, ever heard of sexually transmitted infections and ever been tested for HIV found to be significant main effects on HIV test result.Among these main effects, region, currently pregnant, wealth index, type of place of residence, frequency of watching TV, highest educational level and use of condoms were involved in the joint effects.These joint effects are region and currently pregnant, wealth index and currently pregnant, type of place of residence and frequency of watching TV; and highest educational level and use of condoms (Table 2).
The results from Generalized Additive Model (GAM) analysis shows the odds of positive HIV test result for individuals who never been in union were 0.357 ( −1.031 ), C.I. (0.096, 0.986) times less likely to be positive for HIV test result than for those currently not in union.On contrary, the odds of HIV positive for married or respondents who are living with partner were found to be 1.301, (C.I.( 1.083, 1.722)) times more likely to be positive for HIV test result in comparison to respondents who are currently not in union.Also, the odds of positive HIV results for respondents who have one life time sex partner found to be 0.721, (C.I.(0.476, 0.927)) times less likely to be positive for HIV test result than persons who have one and more life time sex partners.Similary, the odds of positive HIV test result for respondents who never heard of sexually transmitted infections were found to be 5.789 times more likely be positive for HIV test result compared to respondents who heard about sexually transmitted infections (STI).On the other hand, respondents who have heard about HIV/AIDS were 0.151, (C.I. (0.049, 0.656) times less likely to be positive for HIV test result than persons not heard about AIDS.Also, respondents who have not been tested for HIV were found to be 1.513, C.I. (1.036, 2.211) times more likely to be positive for HIV test result compared to respondents who have tested for HIV.
Additional to the main parametric influencing factors, the fitted GAM model have four two-way joint (interaction) effects.These effects are region and currently pregnant, wealth index and currently pregnant, type of place of residence and frequency of watching TV; and highest educational level and use of condoms (Table 2).
Interaction effects between region and current pregnancy is given in Figure 1.From the result, it is obviously seen that positive HIV test result was significantly higher for those respondents who are not pregnant for all regions.But, the HIV positive result is higher for Somali and Benshangul-Gumuz regions for both pregnant and not pregnant women respondents.But, for Afar region, the HIV test result is lower compared to other regions for all respondents (Figure 1).
The other significant two-way joint effect was among wealth index and current pregnacy.This is presented in Table 2.The result of a positive HIV test was significantly upper for those not pregnant respondents.Generally, respondents from richer households have higher positive HIV test followed by poor and middle income households for crrent pregnant women (Figure 2).But, for women who are not pregnant, respondents from poor household have high risk for HIV followed by rich and middle households.Figure 3 gives the joint effect between type of place of residence and frequency of watching TV.As can be seen from the figure, occurrence of HIV was significantly higher for rural respondents than for urban respondents who were not watching TV at all.But, for rural respondents who were watching TV positive HIV result is much lower.
The interaction effect between highest educational level and use of condoms is given in Figure 4.The Figure displays the odds of positive HIV for respondents who have not use condom is higher for all groups of education.For respondents who have used condoms, the risk of having HIV is almost similar for all education groups.Besides parametric effects, there were special effects which were controlled non-parametrically in GAM model.Current age of respondents, age at first cohabitation, age at 1st sex, age at 1st birth, husband/partner's age, family size and children everborn have been examined as a smooth.The findings in Table 3   The estimated smoothing components for HIV test result with A) Family size, B) Children ever born, C) Current age of respondents, D) Age at 1st birth, E) Age at first cohabitation, F) Age at 1st sex and G) Husband/partner's age is given in Figure 5.In each figure, the smooth line is the estimated trend from the model.Figure 5(C) shows the estimated smooth function of current age of respondents ( ̂()) and its 95% confidence limit.The y-axis denotes the effect of the current age of respondents.Furthermore, the figure suggests that the HIV test result is lower at the first age group then increases for few years and starts decreasing up to age 35.After that it increases up to age 40 and then steadily decreased afterwards.The test statistic was 6386 with 3 df, given strong suggestion (p-value=<0.0001)against the hypothesis that current age of respondents is linearly related with HIV test result (Table 3).Figure 5(A) gives the estimated smooth function for family size.Small p-value in the figure relates to progressively nonlinear associations.Furthermore, the HIV test result is higher up to 6 number of household members.The test result decreases up to 14 number of household members.After that the HIV positive test result increases for an increase in the family size.In addition to this, Table 3 gives the significant effect of total children ever born on HIV test result.In Figure 5(A), the family size estimated smooth function is given.The result in figure illustrates increasing nonlinear association.Additionally, the F -value is 226.11 with p-value <.0001 recommended that family size is not linearly related with HIV test result.The other significant results were between age at 1 st birth, age at 1 st cohabitation, age at 1 st sex and husband/partner's age.The figures are presented in Figure 5.These figures suggested nonlinear relationship with HIV test result.Figure 5 (D) shows that HIV test is higher for respondents who gave birth at earlier age.But, after age 20, the risk of being positive for HIV decreases.For respondents

Discussion
HIV is mainly related to poor socio-economic factors.In many sub-Sahara African countries, HIV disproportionately affects poor people with limited health care.It is known that poverty can be expressed in relation to socio-economic elements.To understand the risk of HIV, it is significant to study the relationship between HIV risk and socio-economic and demographic factors.Identifying the elements that influences the problem of HIV can be useful for policy maker to tackle the problem.
In this study, emphasis was given to statistical methods for analysing data in a complex survey design with binary outcome.Therefore, this study used the 2011 Ethiopian Demographic and Health Survey (EDHS) data and the outcome factor was the HIV test result.The explanatory variables were classified as socio-economic, demographic and geographic determinants of HIV influencing elements.Because, in reality, females are more at risk than men, the analysis was done for female respondents.HIV testing was successfully conducted for eligible women.
For sampling procedure to select households for 2011 EDHS, the sample size was designed to represent all regions (administrative classifications) in Ethiopia.All of the eleven regions were involved for sample selection.The sampling frame was used from the 2007 Ethiopian Population and Housing Census results.
For the analysis, Generalized Additive models (GAM) were used.GAM was used to different studies by different researchers [27,32,33] .So far, many sophisticated applications have been developed.These methods are important to discover the hidden constructions of the data.Therefore, these models help to decrease modelling biases of the parametric methods.The uses of parametric models have some restrictions.Because of this, there is strong demand in recent years on developing nonparametric regression methods.Using of this method, it is possible to observe the hidden structure by using flexible functional forms to estimate parameters.These estimates of the data help to detect possibly complicated associations between response and predictor variables.The data analytic approaches are also referred as nonparametric procedures.The basic principle behind the nonparametric method is determining the most appropriate form of the function 34] .Therefore, in this work, the influence of socio-economic, demographic and geographic variables were investigated using GAM model with nonparametric family size, children ever born, current age of respondents, age at 1st birth, age at first cohabitation, age at 1st sex and husband/partner's age.Additionally, the other parametric effects were involved in the model.These effects were region, place of residence, highest educational level, religion, family size, frequency of reading newspaper, listening ratio and watching television, wealth index, currently pregnant, current marital status, number of unions, number of wives, recent sexual activity, ever heard of STI's, ever heard of AIDS, condom use, had any STI in last 12 months, tested for HIV and life time sexual companions with possible two-way interaction effects.The interaction effects were region and currently pregnant, wealth index and currently pregnant, type of place of residence and frequency of watching TV; and highest educational level and use of condoms.
The result from this study suggested that respondents who never been in union with one and more life time sex partners, who heard of sexually transmitted infections, who heard about HIV/AIDS and respondents who are single have less chance to be positive for HIV.Based on the interaction effects, for respondents who are not pregnant HIV test is higher for all regions except for Afar region.Respondents from richer households have higher positive HIV test followed by poor and middle households for crrent pregnant women.Moreover, the findings gave more insight concerning the distribution of family size, children ever born, current age of respondents, age at 1st birth, age at first cohabitation, age at 1st sex and husband/partner's age.The results from the non-parametric part of the model endorse that HIV test result is lower at the first age group and increases up to age 40.The HIV test result is higher for lower house members.Besides this, children ever born, age at 1 st birth, age at 1 st cohabitation, age at 1 st sex and husband/partner's age have non-linear effects.
Finally, the government of Ethiopia has accepted many policies to regulate HIV/AIDS problems.To reduce the problem, early diagnosis and prompt treatment were various strategies adapted by the government.Therefore, the results from this study showed that HIV/AIDS is related to socio-economic, demographic and geographic influences.Based on the result, respondents who knows about HIV/AIDS and STI's have higher chance to prevent HIV/AIDS.Besides these using condome and have one sexual partner reduces the risk of HIV/AIDS.

Figure 5 .
Figure 5. stimated smoothing components for HIV test result with A) Family size, B) Children ever born, C) Current age of respondents, D) Age at 1st birth, E) Age at first cohabitation, F) Age at 1st sex and G)Husband/partner's age.

Total life time sex partners (Ref. Three and more
where (. ) is the logit link,  are parametric coefficients,   are smooth functions, region is region, pres is place of residence, educ is highest educational level, religion is religion, famsize is family size, newsp is frequency of reading newspaper, radio is listening ratio and TV is watching television, wealth is wealth index, pregnancy is currently pregnant, marital is current marital status, unions is number of unions, wives is number of wives, sexact is recent sexual activity, SDI is ever heard of STI's, AIDS is ever heard of AIDS, condom is condom use, SDI_I is STI in last 12 months, HIV_T is tested for HIV, partner is life time sexual companions, cage is current age of respondents, age_C is age at first cohabitation, age_sex is age at 1st sex, age_b is age at 1st birth, age_h is husband/partner's age, fam_s is family size and TCE is total children everborn.For the analysis, PROC GAM in SAS was implemented.

Type of place of residence and Frequency of watching TV (Ref. Urban At least once)
gave the significant effects of current age of respondents, age at first cohabitation, age at 1st sex, age at 1st birth, husband/partner's age, family size and children everborn on HIV test result.The smooth term for nonparametric effects given in Figure 5.The figure proposes that current age of respondents, age at first cohabitation, age at 1st sex, age at 1st birth,