| Title: | Tools and Methods for Working with MIDFIELD Data in 'R' |
|---|---|
| Description: | Provides tools in 'R' for working with undergraduate, longitudinal, student-level records modeled on the MIDFIELD database. Tools facilitate identifying academic program codes, excluding post-baccalaureate terms, excluding records for insufficient data, and assessing timely completion. The tools support the workflow of collecting programs, refining the population, constructing blocs of records for aggregation, and calculating quantitative metrics. 'midfieldr' interacts with practice data provided in the 'midfielddata' package or with any data modeled on the MIDFIELD database. The development of 'midfieldr' and 'midfielddata' was supported by the US National Science Foundation through grant numbers 1545667 and 2142087. |
| Authors: | Richard Layton [cre, aut, cph], Russell Long [aut, cph], Matthew Ohland [aut, cph], Marisa Orr [aut, cph], Susan Lord [aut, cph], US National Science Foundation [fnd] |
| Maintainer: | Richard Layton <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 1.0.3.9011 |
| Built: | 2026-07-01 19:25:36 UTC |
| Source: | https://github.com/midfieldr/midfieldr |
These functions were deprecated in midfieldr 1.0.4.
add_completion_status(dframe, midfield_degree = degree) add_data_sufficiency(dframe, midfield_term = term) filter_cip(keep_text = NULL, drop_text = NULL, cip = NULL, select = NULL) select_required(midfield_x, select_add = NULL) add_timely_term( dframe, midfield_term = term, ..., sched_span = NULL, span = NULL )add_completion_status(dframe, midfield_degree = degree) add_data_sufficiency(dframe, midfield_term = term) filter_cip(keep_text = NULL, drop_text = NULL, cip = NULL, select = NULL) select_required(midfield_x, select_add = NULL) add_timely_term( dframe, midfield_term = term, ..., sched_span = NULL, span = NULL )
dframe |
Data frame or data frame extension (e.g., data.table or tibble). |
midfield_degree |
MIDFIELD records degree data frame or data frame extension. |
midfield_term |
MIDFIELD records term data frame or data frame extension. |
keep_text |
Deprecated |
drop_text |
Deprecated |
cip |
Deprecated |
select |
Deprecated |
midfield_x |
Deprecated |
select_add |
Deprecated |
... |
Not used for passing values; forces subsequent arguments to be referable only by name. |
sched_span |
Integer scalar |
span |
Integer scalar |
add_completion_status()is deprecated in favor of
completion_status(). Update midfieldr file names and argument names,
dropping columns not used by the function, and preserving data frame
class.
add_data_sufficiency()is deprecated in favor of
data_sufficiency(). Update midfieldr file names and argument names,
dropping columns not used by the function, and preserving data frame
class.
add_timely_term()is deprecated in favor of
timely_term(). Update midfieldr file names and argument names,
dropping columns not used by the function, and preserving data frame
class.
filter_cip()is deprecated in favor of
filter_programs(). The new function is similar but with the CIP
data frame as the first argument, enabling chained functions like those
encountered using dplyr and friends.
select_required()is deprecated in favor of
select_records(). The new functionality is similar but with
exact matching to the default column names plus preserving data
frame class.
Data frame of IDs after processing the practice data for data sufficiency and degree seeking. Provides a convenient bloc to start many of the analysis illustrated in the package articles.
baseline_mcidbaseline_mcid
data.table with 76875 rows and 1 column:
mcidCharacter, de-identified student ID. Key column.
Other case-study-data:
study_observations,
study_programs,
study_results
A wrapper on base::tryCatch() for previewing an error message, if any.
catch_error(f)catch_error(f)
f |
Function with arguments expecting an error |
Does not return anything. The side effect is to output to the terminal.
# Example data frames s <- toy_student[, .(mcid)] t <- toy_term[, .(mcid, term)] d <- toy_degree[, .(mcid, term_degree)] # No error catch_error(post_bacc_terms(t, d)) # Error, no term variable catch_error(post_bacc_terms(s, d)) # Error, missing dframe argument catch_error(post_bacc_terms()) # Error, missing degree argument catch_error(post_bacc_terms(t))# Example data frames s <- toy_student[, .(mcid)] t <- toy_term[, .(mcid, term)] d <- toy_degree[, .(mcid, term_degree)] # No error catch_error(post_bacc_terms(t, d)) # Error, no term variable catch_error(post_bacc_terms(s, d)) # Error, missing dframe argument catch_error(post_bacc_terms()) # Error, missing degree argument catch_error(post_bacc_terms(t))
A data table based on the US National Center for Education Statistics (NCES), Integrated Postsecondary Education Data System (IPEDS), 2010 CIP. The data are codes and names for 1582 instructional programs organized on three levels: a 2-digit series, a 4-digit series, and a 6-digit series.
cipcip
A data.table with 1582 rows and 6 columns keyed by the
6-digit CIP code:
cip6nameCharacter, program name at the 6-digit level
cip6Character, 6-digit code representing "specific instructional programs" (US National Center for Education Statistics).
cip4nameCharacter, program name at the 4-digit level.
cip4Character, 4-digit code (the first 4 digits of cip6)
representing "intermediate groupings of programs that have
comparable content and objectives."
cip2nameCharacter, program name at the 2-digit level.
cip2Character, 2-digit code (the first 2 digits of cip6)
representing "the most general groupings of related programs."
The midfielddata taxonomy includes one non-IPEDS code (999999) for Undecided or Unspecified, instances in which institutions reported no program information or that students were not enrolled in a program.
https://nces.ed.gov/ipeds/cipcode/
Other cip-data:
cip2010,
fye_proxy
A data table of the 2010 Classification of Instructional Programs (CIP)
accessed in 2026 from the US National Center for Education Statistics
(NCES). Like the cip data set originally included with midfieldr,
cip2010 provides codes and names for instructional programs organized
on three levels: a 2-digit series, a 4-digit series, and a 6-digit series.
cip2010cip2010
data.table with 1849 rows and 6 columns keyed by the 6-digit CIP
code:
cip6Character, 6-digit code representing "specific instructional programs" (US National Center for Education Statistics).
cip6nameCharacter, program name at the 6-digit level
cip4Character, 4-digit code (the first 4 digits of cip6)
representing "intermediate groupings of programs that have
comparable content and objectives."
cip4nameCharacter, program name at the 4-digit level.
cip2Character, 2-digit code (the first 2 digits of cip6)
representing "the most general groupings of related programs."
cip2nameCharacter, program name at the 2-digit level.
The midfielddata taxonomy includes one non-IPEDS code (999999) for Undecided or Unspecified, instances in which institutions reported no program information or that students were not enrolled in a program.
https://nces.ed.gov/ipeds/cipcode/
Other cip-data:
cip,
fye_proxy
To a data frame keyed by student ID, add a column indicating if a student completed their program, and if so, whether their completion was timely or late. Columns of supporting information are also added. Unrelated columns are dropped.
completion_status(dframe, midfield_rec = degree)completion_status(dframe, midfield_rec = degree)
dframe |
Data frame or data frame extension (e.g., data.table or tibble). Required variables: |
midfield_rec |
MIDFIELD records degree data frame or data frame extension. Required variables:
|
In many studies, students must complete their programs in a specified time span to be considered "timely", for example 4, 6, or 8 years after admission. By "completion" we mean an undergraduate earning their first baccalaureate degree (or degrees, for students earning more than one degree in the same term).
The goal of determining timely completion is to refine a population, that
is, obtain a data frame of IDs that satisfy our constraints. Thus
completion_status() yields a column of completion status values and
columns of supporting information keyed by ID. All other columns in
dframe (if any) are dropped.
The supporting information in the output is provided so that the user can review the findings. After review, we usually delete all columns except the IDs, yielding the refined population that was our goal.
Data frame with the following properties:
Data frame class is preserved. Groups and keys are not preserved.
Rows are filtered for unique mcid values.
Columns {mcid, timely_term} are retained (all other columns
are dropped). New columns added:
term_degree. Character. Term in which the first degree(s) are
completed, encoded YYYYT. Joined from midfield_rec.
completion_status. Character. Possible values are "timely"
for students completing a degree no later than their timely
completion terms; "late" for students completing their program
after their timely completion term; and "NA" for non-completers.
# Start with an excerpt from the student data set dframe <- toy_student[1:10, .(mcid)] # Timely term column is required to add completion status column dframe <- timely_term(dframe, toy_term) # Add completion status column completion_status(dframe, toy_degree) # Existing completion_status column, if any, is overwritten dframe[, completion_status := NA_character_][] completion_status(dframe, toy_degree)# Start with an excerpt from the student data set dframe <- toy_student[1:10, .(mcid)] # Timely term column is required to add completion status column dframe <- timely_term(dframe, toy_term) # Add completion status column completion_status(dframe, toy_degree) # Existing completion_status column, if any, is overwritten dframe[, completion_status := NA_character_][] completion_status(dframe, toy_degree)
To a data frame keyed by student ID, add a column indicating that an institution's data range is sufficient to reliably assess a student's program completion. Columns of supporting information are also added. Unrelated columns are dropped.
data_sufficiency(dframe, midfield_rec = term)data_sufficiency(dframe, midfield_rec = term)
dframe |
Data frame or data frame extension (e.g., data.table or tibble). Required variables: |
midfield_rec |
MIDFIELD records term data frame or data frame extension. Required variables:
|
Because the time span of MIDFIELD term data varies by institution, each has their own lower and upper bounds. When assessing a student's program completion, an unavoidable ambiguity arises for student records at or near these bounds. Such records must be identified and in most cases excluded to prevent false summary counts.
The data sufficiency criterion states that student records are limited to those for which available data are sufficient to assess timely completion without biased counts of completers or non-completers. In practice, the criteria is implemented via two filters. Rows are labeled for exclusion when: 1) a student ID is extant in the non-summer lower limit of an institution's data range; or 2) a student ID has a timely completion term that exceeds the upper limit of the institution's data range.
The goal of determining data sufficiency is to refine a population, that
is, obtain a data frame of IDs that satisfy our constraints. Thus
data_sufficiency() yields a column of data sufficiency values and
columns of supporting information keyed by ID. All other columns in
dframe (if any) are dropped.
The supporting information in the output is provided so that the user can review the findings. After review, we usually delete all columns except the IDs, yielding the refined population that was our goal.
Data frame with the following properties:
Data frame class is preserved. Groups and keys are not preserved.
Rows are filtered for unique mcid values.
Columns {mcid, term_i, timely_term} are retained (all other columns
are dropped). New columns added:
institution. Character. Institution in which the student is
enrolled in the given term. Extracted from midfield_rec. The
limits given in the next two columns are specific to the institution.
lower_limit. Character. Initial term of an institution's
data range, encoded YYYYT. Extracted from midfield_rec.
Compared to term_i to determine the lower-limit exclusion.
upper_limit. Character. Final term of an institution's
data range, encoded YYYYT. Extracted from midfield_rec.
Compared to timely_term to determine upper-limit exclusion.
data_sufficiency. Character. Possible values are "include",
if the data are sufficient; and "exclude-lower" or "exclude-upper"
if not, indicating at which boundary of the data range the ambiguity
occurs.
# Start with an excerpt from the student data set dframe <- toy_student[1:10, .(mcid)] # Timely term column is required to add data sufficiency column dframe <- timely_term(dframe, toy_term) # Add data sufficiency column data_sufficiency(dframe, toy_term) # Existing data_sufficiency column, if any, is replaced dframe[, data_sufficiency := NA_character_][] data_sufficiency(dframe, toy_term)# Start with an excerpt from the student data set dframe <- toy_student[1:10, .(mcid)] # Timely term column is required to add data sufficiency column dframe <- timely_term(dframe, toy_term) # Add data sufficiency column data_sufficiency(dframe, toy_term) # Existing data_sufficiency column, if any, is replaced dframe[, data_sufficiency := NA_character_][] data_sufficiency(dframe, toy_term)
Subset a CIP data frame, retaining rows that match or partially match any string in a vector of character strings.
filter_programs(dframe, pattern, ..., negate = FALSE)filter_programs(dframe, pattern, ..., negate = FALSE)
dframe |
Data frame or data frame extension (e.g., data.table or tibble). Expected variables (or subset thereof):
|
pattern |
Character vector of search strings, including regular expressions. |
... |
Not used for passing values; forces subsequent arguments to be referable only by name. |
negate |
Logical (default FALSE). If TRUE, inverts the resulting Boolean vector. |
Each element of the pattern vector is matched row-wise to every
value in dframe using grepl(). Row values are coerced to character
strings if possible. If negate = FALSE (default), a match retains
the full row; if negate = TRUE, a match removes the full row.
Data frame with the following properties:
Data frame class is preserved. Groups and keys are not preserved.
Rows are a subset of the input and appear in the same order.
Columns are not modified.
# Subset using keywords filter_programs(cip, pattern = "history") # Subset using codes filter_programs(cip, pattern = "^54") # Multiple passes to narrow the results first_pass <- filter_programs(cip, "math")[, .(cip6name, cip6)] first_pass second_pass <- filter_programs(first_pass, c("bio", "educ"), negate = TRUE) second_pass third_pass <- filter_programs(second_pass, c("^27", "^30")) third_pass# Subset using keywords filter_programs(cip, pattern = "history") # Subset using codes filter_programs(cip, pattern = "^54") # Multiple passes to narrow the results first_pass <- filter_programs(cip, "math")[, .(cip6name, cip6)] first_pass second_pass <- filter_programs(first_pass, c("bio", "educ"), negate = TRUE) second_pass third_pass <- filter_programs(second_pass, c("^27", "^30")) third_pass
Proxies are the degree-granting engineering programs we estimate that First-Year Engineering (FYE) students would have declared had they not been required to enroll in FYE. Keyed by student ID. Proxies are provided for all students in the midfielddata practice data who enroll in FYE in their first term.
fye_proxyfye_proxy
data.table with 4623 rows and 2 columns keyed by student ID:
mcidCharacter, de-identified student ID. Key column.
proxyCharacter, 6-digit CIP code of the estimated proxy program.
The proxy variable contains 6-digit CIP codes of degree-granting engineering programs, e.g., Electrical Engineering, Mechanical Engineering, etc., that are substituted for the FYE CIP code when an analysis requires degree-granting starting programs. The most common application is a graduation rate calculation.
The estimation is based on students' first post-FYE programs and a multiple imputation suitable for categorical variables using the mice package. The predictor variables are institution, race, and sex. The estimated variable is the 6-digit CIP code of a degree-granting engineering program at their institution.
fye_proxy holds only for the practice data in midfielddata—these values
cannot be commingled with the MIDFIELD research database.
Data frame of letter grades and conventional point assignments used for computing grade point averages.
grade_scalegrade_scale
data.table with 12 rows and 2 columns:
letter_gradeCharacter, letter grades using the conventional US scale from A to F.
pointsNumerical, 4.0 scale of points assigned to letter grades.
Other scales:
sat_act_scale
A wrapper on base::str() with arguments set to not show attributes,
to not show length, and to cut the width.
look_at(x)look_at(x)
x |
Any R object. |
Does not return anything. The side effect is to output to the terminal.
# data frames look_at(cip) look_at(toy_degree) # character vectors x <- sort(unique(toy_degree$institution)) look_at(x)# data frames look_at(cip) look_at(toy_degree) # character vectors x <- sort(unique(toy_degree$institution)) look_at(x)
Transform a data frame such that two independent categorical variables are factors with levels ordered for display in a multiway dot plot.
order_multiway( dframe, quantity, categories, ..., method = NULL, ratio_of = NULL )order_multiway( dframe, quantity, categories, ..., method = NULL, ratio_of = NULL )
dframe |
Data frame or data frame extension (e.g., data.table or tibble). Required variables: a single quantitative value for every combination of levels of two categorical variables. |
quantity |
Character, name of the single multiway quantitative variable |
categories |
Character, vector of names of the two multiway categorical variables |
... |
Not used for passing values; forces subsequent arguments to be referable only by name. |
method |
Character, “median” (default) or “percent”, method of ordering the levels of the categories. The median method computes the medians of the quantitative column grouped by category. The percent method computes percentages based on the same ratio underlying the quantitative percentage variable except grouped by category. |
ratio_of |
Character vector with the names of the
numerator and denominator columns that produced the quantitative
variable, required when |
Multiway data comprise a single quantitative value (or response) for every combination of levels of two categorical variables. The ordering of the rows and panels is crucial to the perception of effects (Cleveland, 1993).
In our context, "multiway" refers to the data structure and graph design defined by Cleveland (1993), not to the methods of analysis described by Kroonenberg (2008).
Multiway data comprise three variables: a categorical variable of m levels; a second independent categorical variable of n levels; and a quantitative variable (or response) of length mn that cross-classifies the categories, that is, there is a value of the response for each combination of levels of the two categorical variables.
In a multiway dot plot, one category is encoded by the panels, the second category is encoded by the rows of each panel, and the quantitative variable is encoded along identical horizontal scales.
Data frame with the following properties:
Data frame class is preserved. Groups and keys are not preserved.
Rows are preserved.
Column specified by quantity is converted to type double.
Columns specified by categories are converted to factors and ordered.
Other columns are preserved with the exception that columns added by the function overwrite existing columns of the same name (if any).
Two new columns CATEGORY_median added when method is "median." Numeric medians of the quantitative variable grouped by the categorical variables. The CATEGORY placeholder in the column name is replaced by a category name from the categories argument. For example, suppose categories = c("program", "people") and method = "median". The two new column names would be program_median and people_median.
Two new columns CATEGORY_QUANTITY added when method is "percent." Numeric percentages based on the same ratio that produces the quantitative variable except grouped by the categorical variables. The CATEGORY placeholder in the column name is replaced by a category name from the categories argument; the QUANTITY placeholder is replaced by the quantitative variable name in the quantity argument. For example, suppose quantity = "grad_rate", categories = c("program", "people"), and method = "percent". The two new column names would be program_grad_rate and people_grad_rate.
Cleveland WS (1993). Visualizing Data. Hobart Press, Summit, NJ.
Kroonenberg PM (2008). Applied Multiway Data Analysis. Wiley, Hoboken, NJ.
# Subset of built-in data set dframe <- study_results[program == "EE" | program == "ME"] dframe[, people := paste(race, sex)] dframe[, c("race", "sex") := NULL] data.table::setcolorder(dframe, c("program", "people")) # Class before ordering class(dframe$program) class(dframe$people) # Class and levels after ordering mw1 <- order_multiway(dframe, quantity = "stickiness", categories = c("program", "people")) class(mw1$program) levels(mw1$program) class(mw1$people) levels(mw1$people) # Display category medians mw1 # Existing factors (if any) are re-ordered mw2 <- dframe mw2$program <- factor(mw2$program, levels = c("ME", "EE")) # Levels before conditioning levels(mw2$program) # Levels after conditioning mw2 <- order_multiway(dframe, quantity = "stickiness", categories = c("program", "people")) levels(mw2$program) # Ordering using percent method order_multiway(dframe, quantity = "stickiness", categories = c("program", "people"), method = "percent", ratio_of = c("graduates", "ever_enrolled"))# Subset of built-in data set dframe <- study_results[program == "EE" | program == "ME"] dframe[, people := paste(race, sex)] dframe[, c("race", "sex") := NULL] data.table::setcolorder(dframe, c("program", "people")) # Class before ordering class(dframe$program) class(dframe$people) # Class and levels after ordering mw1 <- order_multiway(dframe, quantity = "stickiness", categories = c("program", "people")) class(mw1$program) levels(mw1$program) class(mw1$people) levels(mw1$people) # Display category medians mw1 # Existing factors (if any) are re-ordered mw2 <- dframe mw2$program <- factor(mw2$program, levels = c("ME", "EE")) # Levels before conditioning levels(mw2$program) # Levels after conditioning mw2 <- order_multiway(dframe, quantity = "stickiness", categories = c("program", "people")) levels(mw2$program) # Ordering using percent method order_multiway(dframe, quantity = "stickiness", categories = c("program", "people"), method = "percent", ratio_of = c("graduates", "ever_enrolled"))
To a data frame keyed by student ID and containing an academic term variable, add a column that clusters terms with respect to a student's first degree term. Post-baccalaureate terms are typically excluded from the term, course, and degree data tables.
post_bacc_terms(dframe, midfield_rec = degree)post_bacc_terms(dframe, midfield_rec = degree)
dframe |
Data frame or data frame extension (e.g., data.table or tibble). Required variables: |
midfield_rec |
MIDFIELD records degree data frame or data frame extension. Required variables:
|
In a typical analysis, one is interested in a student's progress up to and including the term in which they earn their first degree or degrees. Any terms later than the first baccalaureate can usually be excluded from study.
Data frame with the following properties:
Data frame class is preserved. Groups and keys are not preserved.
Rows are not modified.
Columns are not modified except new columns overwrite old columns of the same name. New columns:
first_degree_term. Character. Term of a student's first
baccalaureate, encoded YYYYT or, if no degree recorded, NA.
Joined from midfield_rec$term_degree.
term_cluster. Character, indicating that a term belongs
to one of three clusters: terms that are prior to ("pre-degree"),
equal to ("first-degree"), or subsequent to ("post-first-degree")
the student’s first degree term.
# Examples TBD x <- 1# Examples TBD x <- 1
Constructs a data frame of student-level records of First-Year Engineering (FYE) programs and conditions the data for later use as an input to the mice R package for multiple imputation. Sets up three variables as predictors (institution, race/ethnicity, and sex) and one variable to be estimated (program CIP code).
prep_fye_mice( midfield_student = student, midfield_term = term, ..., fye_codes = NULL )prep_fye_mice( midfield_student = student, midfield_term = term, ..., fye_codes = NULL )
midfield_student |
MIDFIELD records student data frame or data frame extension. Required variables:
|
midfield_term |
MIDFIELD records term data frame or data frame extension. Required variables:
|
... |
Not used for passing values; forces subsequent arguments to be referable only by name. |
fye_codes |
Optional character vector of 6-digit CIP codes to identify FYE programs, default "140102". Codes must be 6-digit strings of numbers; regular expressions are prohibited. Non-engineering codes—those that do not start with 14—produce an error. |
At some US institutions, engineering students are required to complete a First-Year Engineering (FYE) program as a prerequisite for declaring an engineering major. Administratively, degree-granting engineering programs such as Electrical Engineering or Mechanical Engineering treat their incoming post-FYE students as their "starting" cohorts. However, when computing a metric that requires a count of starters—graduation rate, for example—FYE records must be treated with special care to avoid a miscount.
To illustrate the potential for miscounting starters, suppose we wish to calculate a Mechanical Engineering (ME) graduation rate. Students starting in ME constitute the starting pool and the fraction of that pool graduating in ME is the graduation rate. At FYE institutions, an ME program would typically define their starting pool as the post-FYE cohort entering their program. This may be the best information available, but it invariably undercounts starters by failing to account for FYE students who do not transition (post-FYE) to degree-granting engineering programs—students who may have left the institution or switched to non-engineering majors. In either case, in the absence of the FYE requirement, some of these students would have been ME starters. By neglecting these students, the count of ME starters is artificially low resulting in an ME graduation rate that is artificially high. The same is true for every degree-granting engineering discipline in an FYE institution.
Therefore, to avoid miscounting starters at FYE institutions, we have to estimate an "FYE proxy", that is, the 6-digit CIP codes of the degree-granting engineering programs that FYE students would have declared had they not been required to enroll in FYE. The purpose of 'prep_fye_mice()“ is to prepare the data for making that estimation.
After running prep_fye_mice() but before running mice(), one can edit
variables or add variables to create a custom set of predictors. The mice
package expects all predictors and the proxy variables to be factors. Do not
delete the institution variable because it ensures that a student's imputed
program is available at their institution.
In addition, ensure that the only missing values are in the proxy column. Other variables are expected to be complete (no NA values). A value of "unknown" in a predictor column, e.g., race/ethnicity or sex, is an acceptable value, not missing data. Observations with missing or unknown values in the ID or institution columns (if any) should be removed.
A data frame in data.table format conditioned for later use as
an input to the mice R package for multiple imputation. The data frame
comprises one row for every FYE student, first-term and migrator.
Grouping structures are not preserved. The columns returned are:
mcidCharacter, anonymized student identifier. Returned as-is.
raceFactor, race/ethnicity as self-reported by the student. An imputation predictor variable.
sexFactor, sex as self-reported by the student. An imputation predictor variable.
institutionFactor, anonymized institution name. An imputation predictor variable.
proxyFactor, 6-digit CIP code of a student's known, post-FYE engineering program or NA representing missing values to be imputed.
The function extracts all terms for all FYE students,
including those who migrate to enter Engineering after their first term,
and identifies the first post-FYE program in which they enroll, if any.
This treatment yields two possible outcomes for values returned in the
proxy column:
The student completes FYE and enrolls in an engineering major. For this outcome, we know that at the student's first opportunity, they enrolled in an engineering program of their choosing. The CIP code of that program is returned as the student's FYE proxy.
The student does not enroll post-FYE in an engineering major. Such
students have no further records in the database or switched from
Engineering to another program. For this outcome, the data provide no
information regarding what engineering program the student would have
declared originally had the institution not required them to enroll in
FYE. For these students a proxy value of NA is returned. These are the
data treated as missing values to be imputed by mice().
In cases where students enter FYE, change programs, and re-enter FYE, only the first group of FYE terms is considered. Any programs before FYE are ignored.
The resulting data frame is ready for use as input for the mice package,
with all variables except mcid returned as factors.
# Using toy data prep_fye_mice(toy_student, toy_term) # Other columns, if any, are dropped colnames(toy_student) colnames(prep_fye_mice(toy_student, toy_term)) # Optional argument permits multiple CIP codes for FYE prep_fye_mice(midfield_student = toy_student, midfield_term = toy_term, fye_codes = c("140101", "140102"))# Using toy data prep_fye_mice(toy_student, toy_term) # Other columns, if any, are dropped colnames(toy_student) colnames(prep_fye_mice(toy_student, toy_term)) # Optional argument permits multiple CIP codes for FYE prep_fye_mice(midfield_student = toy_student, midfield_term = toy_term, fye_codes = c("140101", "140102"))
Data frame for converting between ACT and SAT scores. A range of SAT scores converts to a single ACT score; an ACT score converts to a single value equivalent SAT score.
sat_act_scalesat_act_scale
data.table with 28 rows and 4 columns:
act_compNumerical, ACT composite score.
sat_lowerNumerical, total SAT, lower limit of range corresponding to the ACT composite score.
sat_equivNumerical, total SAT, value to use when converting ACT score to a single SAT score.
sat_upperNumerical, total SAT, upper limit of range corresponding to the ACT composite score.
ACT/SAT Concordance (2018) ACT Education Corp. https://www.act.org/content/dam/act/unsecured/documents/ACT-SAT-Concordance-Tables.pdf
Other scales:
grade_scale
Subset a data frame, selecting columns by matching a vector of character strings. A convenience function to reduce the dimensions of a MIDFIELD data table by selecting only those columns required by other midfieldr functions or that are required to form a composite key. Particularly useful in interactive sessions when viewing the data tables at various stages of an analysis.
select_records(dframe, type = NULL, ..., col_pattern = NULL)select_records(dframe, type = NULL, ..., col_pattern = NULL)
dframe |
Data frame of student records from which columns are selected.
Expected choices are |
type |
Character identifying the record type. Possible values are "s", "t", "c", "d", or "a". See Details. |
... |
Not used for passing values; forces subsequent arguments to be referable only by name. |
col_pattern |
Character vector containing strings or regular
expressions to be matched or partially matched to the column
names of |
Several midfieldr functions require input data frames containing
specific variables (column names) such as mcid or cip6. In addition,
the MIDFIELD data tables have specific variables that act as keys
or composite keys to the information in that table. The type argument
determines which columns are returned, if those columns exist in dframe:
type = "s" (student table) returns columns mcid, race, sex
type = "t" (term table) returns columns mcid, term, cip6, institution, level
type = "c" (course table) returns columns mcid, term_course, abbrev, number
type = "d" (degree table) returns columns mcid, term_degree, cip6
type = "a" (default) returns all the above
Additional column names can be included by using the col_pattern
argument. In all cases, unmatched search strings are silently ignored.
A data frame of the same type as dframe. The output has the
following properties:
Rows are not modified.
Columns are a subset of the input, but appear in the same order.
Groups are not necessarily preserved.
Data frame attributes are preserved with the exception of grouped tibbles.
# Basic usage select_records(toy_student[1:5]) select_records(toy_term[1:5]) select_records(toy_course[1:5]) select_records(toy_degree[1:5]) # Return columns by record type select_records(toy_student[1:5], type = "s") select_records(toy_term[1:5], "t") select_records(toy_course[1:5], "c") select_records(toy_degree[1:5], "d") # With col_patterns for additional columns DT <- toy_student[141:146] select_records(DT, "t", col_pattern = c("transfer", "hours_tranfer")) # Using regular expressions these_IDs <- DT$mcid DT <- toy_term[mcid %chin% these_IDs] select_records(DT, "t", col_pattern = c("^gpa"))# Basic usage select_records(toy_student[1:5]) select_records(toy_term[1:5]) select_records(toy_course[1:5]) select_records(toy_degree[1:5]) # Return columns by record type select_records(toy_student[1:5], type = "s") select_records(toy_term[1:5], "t") select_records(toy_course[1:5], "c") select_records(toy_degree[1:5], "d") # With col_patterns for additional columns DT <- toy_student[141:146] select_records(DT, "t", col_pattern = c("transfer", "hours_tranfer")) # Using regular expressions these_IDs <- DT$mcid DT <- toy_term[mcid %chin% these_IDs] select_records(DT, "t", col_pattern = c("^gpa"))
A strict version of sort() and unique() (without ...)
applied to vectors only.
sort_uniq(x, ..., na.rm = FALSE, decreasing = FALSE, na.last = FALSE)sort_uniq(x, ..., na.rm = FALSE, decreasing = FALSE, na.last = FALSE)
x |
Vector of values to be sorted with any duplicate values removed. |
... |
Not used for passing values; forces subsequent arguments to be referable only by name. |
na.rm |
Logical. Indicates if missing values (including NaN)
should be removed. Passed to |
decreasing |
Logical. Should the sort be increasing or decreasing?
Passed to |
na.last |
Logical. Position of NA values. Passed to |
A vector of unique values, sorted.
# Character vector x <- toy_student$race sort_uniq(x) # Numeric vector x <- toy_term$hours_term_attempt sort_uniq(x)# Character vector x <- toy_student$race sort_uniq(x) # Numeric vector x <- toy_term$hours_term_attempt sort_uniq(x)
Data table of post-processed observations of students ever enrolled in, and students graduating from, the four programs of the case study. Keyed by student ID. Provided for the convenience of vignette users.
study_observationsstudy_observations
data.table with 8919 rows and 5 columns.
The variables are:
mcidCharacter, de-identified student ID. Key column.
raceCharacter, race/ethnicity as self-reported by the student, e.g., Asian, Black, Hispanic, etc.
sexCharacter, sex as self-reported by the student, possible values are Female, Male, and Unknown.
programCharacter, academic program label.
blocCharacter, indicating the grouping
(ever_enrolled or graduates) to which an observation
belongs.
Starting with the case-study starting pool of students ever enrolled in the four programs of the study (Civil, Electrical, Industrial/Systems, and Mechanical Engineering), we filtered the data for data sufficiency, degree seeking, program, and timely completion.
A data frame of "ever enrolled" and a data frame of "timely graduates" were
bound using shared column names and are distinguished in the bloc variable.
This data structure facilitates grouping and summarizing by race, sex,
program, and group.
Other case-study-data:
baseline_mcid,
study_programs,
study_results
Data table of program CIP codes and labels of the four programs of the case study. Keyed by 6-digit CIPs. Provided for the convenience of vignette users.
study_programsstudy_programs
data.table with 15 rows and 2 columns. The variables are:
cip6Character, 6-digit CIP code of program in which a student is enrolled in a term.
programCharacter, abbreviated labels for four engineering programs. Values are "CE" (Civil Engineering), "EE" (Electrical Engineering), "ISE" (Industrial/Systems Engineering), and "ME" (Mechanical Engineering).
Starting with the midfieldr cip data set, we extracted the CIPs of the four
programs of the case study and assigned them a custom label to be used for
grouping and summarizing.
Other case-study-data:
baseline_mcid,
study_observations,
study_results
Data table of longitudinal stickiness for the four programs of the case study (Civil, Electrical, Industrial/Systems, and Mechanical Engineering) grouped by program, race/ethnicity, and sex. Provided for the convenience of vignette users.
study_resultsstudy_results
data.table with 50 rows and 6 columns:
programCharacter, academic program label.
sexCharacter, sex as self-reported by the student, possible values are Female, Male, and Unknown.
raceCharacter, race/ethnicity as self-reported by the student, e.g., Asian, Black, Hispanic, etc.
ever_enrolledNumerical, number of students ever enrolled in a program.
graduatesNumerical, number of students completing a program.
stickinessNumerical, program stickiness, the
ratio of graduates to ever_enrolled, in percent.
Longitudinal stickiness is the ratio of the number of students graduating from a program to the number of students ever enrolled in the program over the time span of available data. Results are based on data that have been filtered for data sufficiency, degree seeking, and timely completion.
Other case-study-data:
baseline_mcid,
study_observations,
study_programs
To a data frame keyed by student ID, add a column indicating the student's timely completion term. Columns of supporting information are also added. Unrelated columns are dropped.
timely_term(dframe, midfield_rec = term, ..., sched_span = NULL, span = NULL)timely_term(dframe, midfield_rec = term, ..., sched_span = NULL, span = NULL)
dframe |
Data frame or data frame extension (e.g., data.table or tibble). Required variable: |
midfield_rec |
MIDFIELD records term data frame or data frame extension. Required variables:
|
... |
Not used for passing values; forces subsequent arguments to be referable only by name. |
sched_span |
Integer scalar (default 4), the number of years an institution officially schedules for completing a program. |
span |
Integer scalar (default 6), number of years to define timely
completion, typically 4, 6, or 8 years (100%, 150%, 200% respectively
of |
In many studies, students must complete their programs in a specified time span to be considered "timely", for example 4, 6, or 8 years after admission. The latest term by which program completion would be considered timely is the timely completion term. By "completion" we mean an undergraduate earning their first baccalaureate degree (or degrees, for students earning more than one degree in the same term).
The timely completion term is required for determining data sufficiency
as well as timely completion status. The goal in either case is to refine
a population, that is, obtain a data frame of IDs that satisfy our
constraints. Thus timely_term() yields a column of timely term
values and columns of supporting information keyed by ID. All other columns
in dframe (if any) are dropped.
Our heuristic assigns span number of years (default 6) to every
student. For students admitted at second-year level or higher, the span is
reduced by one year for each full year the student is assumed to have
completed. For example, a student admitted at the second-year level is
assumed to have completed one year of a program, so their span is reduced by
one year. The adjusted span is added to their initial term to create the
timely_term values.
The supporting information in the output is provided so that the user
can review the findings. Moreover, data_sufficiency() and
completion_status() require one or both of the added columns
{term_i, timely_term}.
Data frame with the following properties:
Data frame class is preserved. Groups and keys are not preserved.
Rows are filtered for unique mcid values.
Column {mcid} is retained (all other columns are dropped). New
columns added:
term_i. Initial term of a student's longitudinal record,
encoded YYYYT. Extracted from midfield_rec.
level_i. Character. Student level (01 Freshman, 02 Sophomore,
etc.) in their initial term. Extracted from midfield_rec.
adj_span. Numeric. Integer span of years for timely completion
adjusted for a student's initial level.
timely_term. Character. Latest term by which program completion
would be considered timely for every student. Encoded YYYYT.
# Start with an excerpt from the student data set dframe <- toy_student[1:10, .(mcid)] # Add timely completion term column timely_term(dframe, toy_term) # Define timely completion as 200% of scheduled span (8 years) timely_term(dframe, toy_term, span = 8) # Existing timely_term column, if any, is overwritten dframe[, timely_term := NA_character_][] timely_term(dframe, toy_term)# Start with an excerpt from the student data set dframe <- toy_student[1:10, .(mcid)] # Add timely completion term column timely_term(dframe, toy_term) # Define timely completion as 200% of scheduled span (8 years) timely_term(dframe, toy_term, span = 8) # Existing timely_term column, if any, is overwritten dframe[, timely_term := NA_character_][] timely_term(dframe, toy_term)
Selected variables modeled on those in the course practice data for use in
package examples and articles. Sampled from an early version of the practice
data, the toy data are not a current practice data sample.
toy_coursetoy_course
data.table with 5812 rows and 12 columns keyed by student ID,
term, course abbreviation, and course number.
mcidCharacter, de-identified student ID. Key column.
termCharacter, academic year and term, format YYYYT. Key column.
abbrevCharacter, course alphabetical identifier, e.g. ENGR, MATH, ENGL. Key column.
numberCharacter, course numeric identifier, e.g. 101, 3429. Key column.
institutionCharacter, de-identified institution name, e.g., Institution A, Institution B, etc.
courseCharacter, course name, e.g., Astrophysics III, Calculus For Social Science And Business, Corp Financial Rprtng 1, Environmental Sanitation II, Fitness and Wellness, Introductory Astronomy 2, Our Changing Environment, etc.
sectionCharacter, course section identifier, from
one to four characters, e.g., 1, 2, 01, 14, 001,
040, 785, H02, R01, 300E, 888R, etc.
typeCharacter, predominant delivery method for this
section, e.g., Blended, Distance Education, Face-to-Face,
Online, etc.
faculty_rankCharacter, academic rank of the
person teaching the course, e.g., Assistant Professor,
Associate Professor, Graduate Assistant,
Visiting Faculty, etc.
hours_courseNumeric, number of credit-hours for successful course completion.
gradeCharacter, course grade, e.g., A+, A, A-, B+, I, NG, etc.
discipline_midfieldCharacter, a variable
for grouping courses by academic discipline
assigned by the pre-2023 MIDFIELD data curator, e.g.,
Anthropology, Business, Computer Science,
Engineering, Language and Literature,
Mathematics,Visual and Performing Arts, etc.
Other toy-data:
toy_degree,
toy_student,
toy_term
Selected variables modeled on those in the degree practice data for use in
package examples and articles. Sampled from an early version of the practice
data, the toy data are not a current practice data sample.
toy_degreetoy_degree
data.table with 96 rows and 4 columns keyed by student ID,
term, and program (CIP code or degree label).
mcidCharacter, de-identified student ID. Key column.
term_degreeCharacter, academic year and term in which a student completes their program, format YYYYT.
cip6Character, 6-digit CIP code of program in which a student is enrolled in a term. Key column.
institutionCharacter, de-identified institution name, e.g., Institution A, Institution B, etc.
degreeCharacter, type of degree awarded, e.g., Bachelor of Arts in Geography, Bachelor of Science in Finance, etc.
Other toy-data:
toy_course,
toy_student,
toy_term
Selected variables modeled on those in the student practice data for use in
package examples and articles. Sampled from an early version of the practice
data, the toy data are not a current practice data sample.
toy_studenttoy_student
data.table with 150 rows and 13 columns keyed by student ID.
mcidCharacter, de-identified student ID. Key column.
raceCharacter, race/ethnicity as self-reported by the student, e.g., Asian, Black, Hispanic, etc.
sexCharacter, sex as self-reported by the student, possible values are Female, Male, and Unknown.
institutionCharacter, de-identified institution name, e.g., Institution A, Institution B, etc.
transferCharacter, transfer status, possible
values are First-Time in College, First-Time Transfer.
hours_transferNumeric, number of credit hours transferred (or NA).
age_descCharacter, age group, possible values
are 25 and Older, Under 25.
us_citizenCharacter, US citizenship, possible
values are No, Yes.
home_zipCharacter, home ZIP code (or NA),
e.g., 02056, 20170, 51301, 80129, etc.
high_schoolCharacter, code for the last high
school attended before admission (or NA), e.g., 060075,
210512, 431800, 502195, etc.
sat_mathNumeric, SAT mathematics test score
(or NA).
sat_verbalNumeric, SAT reading test score
(or NA).
act_compNumeric, ACT composite test score
(or NA).
Other toy-data:
toy_course,
toy_degree,
toy_term
Selected variables modeled on those in the term practice data for use in
package examples and articles. Sampled from an early version of the practice
data, the toy data are not a current practice data sample.
toy_termtoy_term
data.table with 1095 rows and 13 columns keyed by student ID
and term.
mcidCharacter, de-identified student ID. Key column.
termCharacter, academic year and term, format YYYYT. Key column.
cip6Character, 6-digit CIP code of program in
which a student is enrolled in a term, e.g., 090101,
141201, 260901, 420101, etc.
institutionCharacter, de-identified institution name, e.g., Institution A, Institution B, etc.
levelCharacter, 01 Freshman, 02 Sophomore, etc. The equivalent values in the current practice data are 01 First-Year, 02-Second Year, etc.
standingCharacter, academic standing, e.g.,
Good Standing, Academic Warning, etc.
coopCharacter, cooperative education term, possible
values are Yes, No.
hours_termNumeric, credit hours earned in the term.
hours_term_attemptNumeric, credit hours attempted in the term.
hours_cumulNumeric, cumulative credit hours earned.
hours_cumul_attemptNumeric, cumulative credit hours attempted.
gpa_termNumeric, term grade point average.
gpa_cumulNumeric, cumulative grade point average.
Other toy-data:
toy_course,
toy_degree,
toy_student