MATCHING RECORDS IN 2018 PMA NUTRITION SURVEYS
How to Create a Pointer Variable With Functional Programming in R
The Performance Monitoring for Action 2018 nutrition survey module for Burkina Faso and Kenya features hundreds of indicators measuring diet and nutritional status for children under age 5 and women aged 10-49. While the survey is not expressly designed to capture relationships between sampled individuals living together in the same household, there may be certain research contexts in which these relationships are key.
For example, each woman aged 10-49 who gave birth to a living child within 2 years prior to the survey was given a series of questions related to the antenatal care and nutritional assistance she received during her most recent pregnancy. If the child from that pregnancy was also included in the nutrition sample, it may be possible for researchers to link the mother and child together. Such connections could be used to determine whether certain types of antenatal interventions referenced on the female questionnaire ultimately improve nutritional outcomes for children after birth.
In the 2018 data, a match between mother and child can be made 1) if they each reside in the same household HHID
, and 2) if the date of the woman’s most recent birth LASTBIRTHMO
and LASTBIRTHYR
is the same as the birth date of the child KIDBIRTHMO
and KIDBIRTHYR
. Both criteria can be used together in a function creating a user-generated variable representing a match between mother and child.
For R users, the most concise approach to building a function like this one comes from the functional programming toolkit, purrr
, imported with the package tidyverse
(STATA users: check out our example using egen
at the end of this post). Here, the purrr
function pmap_chr
iterates over each record in the Burkina Faso 2018 dataset, applying our match criteria to create a new pointer variable, MOMID
. This new variable contains the PERSONID
associated with a child’s mother if a match has been found.
Load libraries and import data from the IPUMS PMA website
library(tidyverse)
library(ipumsr)
bf2018 <- read_ipums_micro(ddi = "pma_00001.xml")
The loaded libraries include:
tidyverse
, which attachespurrr
and other packages needed to use the “tidy” syntax used in this example. See tidyverse.org for a full list of imported packages.ipumsr
, which includes the functionread_ipums_micro
used to import both the dataset and its associated codebook downloaded from pma.ipums.org (see theipumsr
notes on CRAN for detailed instructions). In this example, our extract includes only the sample “Burkina Faso 2018 Round 2 Nutrition”, along with all of the variables used below (available under the “NUTRITION - PERSON” unit of analysis).
Note: the file pma_00001.xml
should be in your R working directory, or else its full path should be specified.
Matching by household and birth date with pmap
bf2018 <- bf2018%>%
mutate(MOMID = case_when(ELIGTYPE < 20 ~ pmap_chr(
bf2018%>%
select(HHID,
KIDBIRTHYR,
KIDBIRTHMO,
LASTBIRTHYR,
LASTBIRTHMO),
function(...){
kid <- tibble(...)
sameYr <- which(LASTBIRTHYR == kid$KIDBIRTHYR)
sameMo <- which(LASTBIRTHMO == kid$KIDBIRTHMO)
sameHH <- which(HHID == kid$HHID)
mom <- intersect(intersect(sameYr, sameMo), sameHH)
if(length(mom)==1){
return(PERSONID[mom])
}else{
return(NA)
}
}
)))
The variable MOMID
is created in several steps:
mutate
creates the variableMOMID
.case_when
assignsNA
toMOMID
, except in defined cases.MOMID
cases are defined only for children, indicated byELIGTYPE < 20
.pmap_chr
iterates through each of those cases, looking for data within a certain group of variables:HHID
,KIDBIRTHYR
,KIDBIRTHMO
,LASTBIRTHYR
, andLASTBIRTHMO
. (The variantpmap_chr
returns a character vector, rather than a list, which is returned by the base functionpmap
).function(...)
creates an unnamed function; we could list the variables in the parentheses again here, but using...
passes them to the function automatically.kid <- tibble(...)
creates a tibble data structure that resembles the larger dataset, but includes only one child in each iteration of the function. Using...
again here efficiently passes only the variables we selected above.sameYr <- which(LASTBIRTHYR == kid$KIDBIRTHYR)
finds all records where the variable for “most recent birth-year”LASTBIRTHYR
holds the same value as the child’s birth-yearkid$KIDBIRTHYR
.sameMo <- which(LASTBIRTHMO == kid$KIDBIRTHMO)
finds all records where the variable for “most recent birth-month”LASTBIRTHMO
holds the same value as the child’s birth-monthkid$KIDBIRTHMO
.sameHH <- which(HHID == kid$HHID)
finds all records where the “household identification number”HHID
matches the number on the child’s recordkid$HHID
.mom
includes the row index for all records where a match was found in eachsameYr
,sameMo
, andsameHH
. If there is only one such personif(length(mom)==1)
, they must be the child’s mother, and her identification number is returned as the child’sMOMID
withreturn(PERSONID[mom])
. If there were no matches, or if there were two or more possible matches, the child’sMOMID
must beNA
.
Which children were linked to a mother’s record?
To check our results, create a binary indicator LINKED
that shows whether each record is a child linked to a mother via MOMID
.
bf2018%>%
mutate(LINKED = case_when(
is.na(MOMID) ~ "no",
!is.na(MOMID) ~ "yes"
))%>%
group_by(ELIGTYPE, LINKED)%>%
count()
## # A tibble: 9 x 3
## # Groups: ELIGTYPE, LINKED [9]
## ELIGTYPE LINKED n
## <int+lbl> <chr> <int>
## 1 11 [Infant under age 2 (INF)] no 731
## 2 11 [Infant under age 2 (INF)] yes 463
## 3 14 [Aged 2-5 (K)] no 1329
## 4 14 [Aged 2-5 (K)] yes 333
## 5 20 [Selected women aged 10-49 (WN)] no 2411
## 6 31 [Member of household included in nutrition sample] no 9045
## 7 32 [Member of household excluded from nutrition sample] no 3535
## 8 95 [Not interviewed (female questionnaire)] no 51
## 9 96 [Not interviewed (household questionnaire)] no 101
In total, 796 children were linked to a mother: this is only about 28% of the 2856 children listed by the grouping variable ELIGTYPE
.
Why were so few children matched successfully? Part of the answer has to do with the nutrition sample design: in Burkina Faso, women aged 10-49 were included from just 45% of households selected at random from the household screening sample (in the Kenya 2018 sample, 25% of households were randomly selected). Meanwhile, children under age 5 from all households were selected for the nutrition sample. As a result, we might estimate that only around 45% of children could possibly be linked to mother included in the nutrition sample:
bf2018%>%
mutate(MOMS_IN_HH = case_when(ELIGTYPE < 20 ~ pmap_chr(
bf2018%>%
select(HHID, ELIGTYPE),
function(...){
kid <- tibble(...)
sameHH <- which(HHID == kid$HHID & ELIGTYPE == 20)
if(length(sameHH)>0){return("yes")}else{return("no")}
}
)))%>%
group_by(ELIGTYPE, MOMS_IN_HH)%>%
count()
## # A tibble: 9 x 3
## # Groups: ELIGTYPE, MOMS_IN_HH [9]
## ELIGTYPE MOMS_IN_HH n
## <int+lbl> <chr> <int>
## 1 11 [Infant under age 2 (INF)] no 655
## 2 11 [Infant under age 2 (INF)] yes 539
## 3 14 [Aged 2-5 (K)] no 924
## 4 14 [Aged 2-5 (K)] yes 738
## 5 20 [Selected women aged 10-49 (WN)] <NA> 2411
## 6 31 [Member of household included in nutrition sample] <NA> 9045
## 7 32 [Member of household excluded from nutrition sample] <NA> 3535
## 8 95 [Not interviewed (female questionnaire)] <NA> 51
## 9 96 [Not interviewed (household questionnaire)] <NA> 101
This confirms our estimate: only 1277 of the 2856 sampled children live in a household with a woman aged 10-49 who was also included in the nutrition sample. In other words, 45% of sampled children live in a household with a sampled mother. Our search found matches for 28% of all children, but this reflects 62% of the children who live in a household with a sampled mother.
Fortunately, because households were randomly selected for the female nutrition sample, we should have no reason to suspect that the 796 children matched to mothers by our search criteria represent a biased sub-sample of the 2856 children overall. However, before proceeding with further analysis, it would be worthwhile to see if we can increase the number of linked children with additional search criteria.
An Expanded Search for Additional Cases
In the example above, the use of LASTBIRTHMO
and LASTBIRTHYR
ensured that only mother’s most recent child could be linked to her record.
It’s possible to expand these criteria in certain circumstances using RELATEKID
, which describes the relationship between each child and the person who provided responses to the interviewer on their behalf. When RELATEKID == 1
, this respondent is the child’s mother. So, if RELATEKID == 1
and only one woman in the child’s household has ever given birth, we should identify that person as the child’s mother.
bf2018 <- bf2018%>%
mutate(MOMID = case_when(
ELIGTYPE < 20 ~ pmap_chr(
bf2018%>%
select(
HHID,
KIDBIRTHYR,
KIDBIRTHMO,
LASTBIRTHYR,
LASTBIRTHMO,
RELATEKID
),
function(...){
kid <- tibble(...)
sameYr <- which(LASTBIRTHYR == kid$KIDBIRTHYR)
sameMo <- which(LASTBIRTHMO == kid$KIDBIRTHMO)
sameHH <- which(HHID == kid$HHID)
mom <- intersect(intersect(sameYr, sameMo), sameHH)
all_moms <- which(LASTBIRTHYR < 9000)
moms_in_hh <- intersect(sameHH, all_moms)
if(length(mom)==1){
return(PERSONID[mom])
}
else if(length(moms_in_hh) == 1 & kid$RELATEKID == 1){
return(PERSONID[moms_in_hh])
}
else{
return(NA)
}
}
)
))
This is the same function as before, but with a few additions:
RELATEKID
is also selected in the first argument ofpmap_chr
all_moms <- which(LASTBIRTHYR < 9000)
finds all women who have ever given birth (values above 9000 are codes for different types of missing data, rather than years)moms_in_hh <- intersect(sameHH, all_moms)
finds the women inall_moms
who live in the child’s householdelse if(length(moms_in_hh) == 1 & kid$RELATEKID == 1)
establishes an additional criterion for children who were not linked in the example above: if there is only one possible mother in the child’s householdmom_in_hh
, and if the respondent for the childkid$RELATEKID
was their mother, that mother’s identification number is returned as the child’sMOMID
withreturn(PERSONID[mom])
How many additional children were linked?
bf2018%>%
mutate(LINKED = case_when(
is.na(MOMID) ~ "no",
!is.na(MOMID) ~ "yes"
))%>%
group_by(ELIGTYPE, LINKED)%>%
count()
## # A tibble: 9 x 3
## # Groups: ELIGTYPE, LINKED [9]
## ELIGTYPE LINKED n
## <int+lbl> <chr> <int>
## 1 11 [Infant under age 2 (INF)] no 688
## 2 11 [Infant under age 2 (INF)] yes 506
## 3 14 [Aged 2-5 (K)] no 1089
## 4 14 [Aged 2-5 (K)] yes 573
## 5 20 [Selected women aged 10-49 (WN)] no 2411
## 6 31 [Member of household included in nutrition sample] no 9045
## 7 32 [Member of household excluded from nutrition sample] no 3535
## 8 95 [Not interviewed (female questionnaire)] no 51
## 9 96 [Not interviewed (household questionnaire)] no 101
With these expanded search criteria, 1079 children were matched to a mother’s record: an improvement by 283 cases. Because only 1277 children in the nutrition sample live with a sampled mother, we have now established almost 85% of the possible links between mothers and children. However, these additional 283 cases should be used with some degree of caution: they represent children from smaller households compared to the remaining 15% of linkable children who, for the most part, live in households where more two or more mothers reside together. Further analysis should determine whether and how the selection of these smaller households might bias the social and economic composition of our sub-sample.
Replicating in STATA
In STATA, we can reproduce these results with the function egen
(although users should note that, without a direct analogue to pmap
, this script ran for us in just under an hour):
*Step1
bys hhid: egen housesize = max(lineno)
levelsof hhid, local(household)
gen momid = ""
foreach x in `household' {
local i = 1
levelsof housesize if hhid == "`x'", local(j)
while `i' <= `j' {
bys hhid: replace momid = personid[`i'] if lastbirthyr[`i'] == kidbirthyr & lastbirthmo[`i'] == kidbirthmo
local i = `i' + 1
}
}
*Step 2
gen flag = 1 if lastbirthyr < 9000
bys hhid: egen moms_in_hh = count(flag)
bys hhid: egen housesize = max(lineno)
levelsof hhid, local(household)
gen momid = ""
foreach x in `household' {
local i = 1
levelsof housesize if hhid == "`x'", local(j)
while `i' <= `j' {
bys hhid: replace momid = personid[`i'] if lastbirthyr[`i'] == kidbirthyr & lastbirthmo[`i'] == kidbirthmo & relatekid == 1
bys hhid: replace momid = personid[`i'] if relatekid == 1 & moms_in_hh == 1 & flag[`i'] == 1
local i = `i' + 1
}
}
gen linked = 1 if momid != ""
Next steps
With the pointer variable MOMID
, it’s now possible to begin exploring the relationship between different types of antenatal intervention and nutritional outcomes for the children in our sample. For example, among those women who received any type of antenatal care for a pregnancy in the last two years, the variable RPFEEDINFO
indicates whether she was specifically given instructions about how to feed her newborn child as a part of that care. Another variable, KIDMEASTOLD
, reports what, if anything, health providers have mentioned about a living child’s growth & malnourishment.
bf2018 <- bf2018%>%
mutate(RPFEEDINFO_M = pmap_int(
bf2018%>%
select(
MOMID,
PERSONID,
RPFEEDINFO
),
function(...){
kid <- tibble(...)
if(is.na(kid$MOMID)){return(NA)}
else{
mom <- which(PERSONID == kid$MOMID)
return(RPFEEDINFO[mom])
}
}
)
)
attributes(bf2018$RPFEEDINFO_M) <- attributes(bf2018$RPFEEDINFO)
bf2018%>%
filter(KIDMEASTOLD < 90)%>%
group_by(RPFEEDINFO_M, KIDMEASTOLD)%>%
count()%>%
print(n=Inf)
## # A tibble: 12 x 3
## # Groups: RPFEEDINFO_M, KIDMEASTOLD [12]
## RPFEEDINFO_M KIDMEASTOLD n
## <int+lbl> <int+lbl> <int>
## 1 0 [No] 0 [Not told about child's growth] 81
## 2 0 [No] 1 [Growing well / not malnourished] 34
## 3 0 [No] 2 [Not growing well / malnourished] 12
## 4 1 [Yes] 0 [Not told about child's growth] 51
## 5 1 [Yes] 1 [Growing well / not malnourished] 33
## 6 1 [Yes] 2 [Not growing well / malnourished] 2
## 7 99 [NIU (not in universe)] 0 [Not told about child's growth] 46
## 8 99 [NIU (not in universe)] 1 [Growing well / not malnourished] 14
## 9 99 [NIU (not in universe)] 2 [Not growing well / malnourished] 4
## 10 NA 0 [Not told about child's growth] 290
## 11 NA 1 [Growing well / not malnourished] 174
## 12 NA 2 [Not growing well / malnourished] 31
Here, a new variable RPFEEDINFO_M
is created on the child’s record: it points to the value for RPFEEDINFO
on their mother’s record (the line attributes(bf2018$RPFEEDINFO_M) <- attributes(bf2018$RPFEEDINFO)
ensures that it has the same value labels).
In this simple example, it appears that, among children whose linked mother received no information about newborn feeding during her pregnancy, over 9% were later diagnosed with poor growth / malnourishment. By comparison, among children whose linked mother did receive this information, only about 2% were later diagnosed with poor growth / malnourishment.
We hope this post has been helpful! Let us know how you plan to use this pointer variable or ask us questions by email at ipums@umn.edu, or by tweeting us @ipums.