Title: | Tools and Methods for Working with MIDFIELD Data in 'R' |
---|---|
Description: | Provides tools and demonstrates methods for working with individual undergraduate student-level records (registrar's data) in 'R'. Tools include filters for program codes, data sufficiency, and timely completion. Methods include gathering blocs of records, computing quantitative metrics such as graduation rate, and creating charts to visualize comparisons. 'midfieldr' interacts with practice data provided in 'midfielddata', an R data package available at <https://midfieldr.github.io/midfielddata/>. 'midfieldr' also interacts with the full MIDFIELD database for users who have access. This work is supported by the US National Science Foundation through grant numbers 1545667 and 2142087. |
Authors: | Richard Layton [cre, aut, cph], Russell Long [aut, cph, dtm], Matthew Ohland [aut, cph], Marisa Orr [aut, cph], Susan Lord [aut, cph] |
Maintainer: | Richard Layton <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.0.2.9003 |
Built: | 2024-11-05 02:41:37 UTC |
Source: | https://github.com/midfieldr/midfieldr |
Add columns to a data frame of student-level records that indicate whether a student completed a degree, and if so, whether their completion was timely.
add_completion_status(dframe, midfield_degree = degree)
add_completion_status(dframe, midfield_degree = degree)
dframe |
Working data frame of student-level records
to which completion-status columns are to be added. Required variables
are |
midfield_degree |
MIDFIELD |
By "completion" we mean an undergraduate earning their first baccalaureate degree (or degrees, for students earning more than one degree in the same term). Additional degrees, if any, earned later than the term of the first degree are ignored.
In many studies, students must complete a degree in a specified time span, for example 4-, 6-, or 8-years after admission. If they do, their completion is timely; if not, their completion is late and they are grouped with the non-completers when computing a metric such as graduation rate.
Completion status is "timely" for students completing a degree no later than
their timely completion terms. See also add_timely_term()
.
A data frame in data.table
format with the following
properties: rows are preserved; columns are preserved with the exception
that columns added by the function overwrite existing columns of the
same name (if any); grouping structures are not preserved. The added
columns are:
term_degree
Character. Term in which the first degree(s) are
completed. Encoded YYYYT. Joined from midfield_degree
data table.
completion_status
Character. Label each observation to indicate completion status. Possible values are: "timely", indicating completion no later than the timely completion term; "late", indicating completion after the timely completion term; and "NA" indicating non-completion.
Other add_*:
add_data_sufficiency()
,
add_timely_term()
# Start with an excerpt from the student data set dframe <- toy_student[1:10, .(mcid)] # Timely term column is required to add completion status column dframe <- add_timely_term(dframe, toy_term) # Add completion status column add_completion_status(dframe, toy_degree) # Existing completion_status column, if any, is overwritten dframe[, completion_status := NA_character_] add_completion_status(dframe, toy_degree)
# Start with an excerpt from the student data set dframe <- toy_student[1:10, .(mcid)] # Timely term column is required to add completion status column dframe <- add_timely_term(dframe, toy_term) # Add completion status column add_completion_status(dframe, toy_degree) # Existing completion_status column, if any, is overwritten dframe[, completion_status := NA_character_] add_completion_status(dframe, toy_degree)
Add a column to a data frame of student-level records that labels each row for inclusion or exclusion based on data sufficiency near the upper and lower bounds of an institution's data range.
add_data_sufficiency(dframe, midfield_term = term)
add_data_sufficiency(dframe, midfield_term = term)
dframe |
Working data frame of student-level records
to which data-sufficiency columns are to be added. Required variables
are |
midfield_term |
MIDFIELD |
The time span of MIDFIELD term data varies by institution, each having their own lower and upper bounds. For some student records, being at or near these bounds creates unavoidable ambiguity when trying to assess degree completion. Such records must be identified and in most cases excluded to prevent false summary counts.
A data frame in data.table
format with the following
properties: rows are preserved; columns are preserved with the exception
that columns added by the function overwrite existing columns of the
same name (if any); grouping structures are not preserved. The added
columns are:
term_i
Character. Initial term of a student's longitudinal
record, encoded YYYYT. Not overwritten if present in dframe.
lower_limit
Character. Initial term of an institution's data range, encoded YYYYT
upper_limit
Character. Final term of an institution's data range, encoded YYYYT
data_sufficiency
Character. Label each observation for
inclusion or exclusion based on data sufficiency. Possible values are:
include
, indicating that available data are sufficient for
estimating timely completion; exclude-upper
, indicating
that data are insufficient at the upper limit of a data range; and
exclude
-lower, indicating that data are insufficient at the
lower limit.
Other add_*:
add_completion_status()
,
add_timely_term()
# Start with an excerpt from the student data set dframe <- toy_student[1:10, .(mcid)] # Timely term column is required to add data sufficiency column dframe <- add_timely_term(dframe, midfield_term = toy_term) # Add data sufficiency column add_data_sufficiency(dframe, midfield_term = toy_term) # Existing data_sufficiency column, if any, is overwritten dframe[, data_sufficiency := NA_character_] add_data_sufficiency(dframe, midfield_term = toy_term)
# Start with an excerpt from the student data set dframe <- toy_student[1:10, .(mcid)] # Timely term column is required to add data sufficiency column dframe <- add_timely_term(dframe, midfield_term = toy_term) # Add data sufficiency column add_data_sufficiency(dframe, midfield_term = toy_term) # Existing data_sufficiency column, if any, is overwritten dframe[, data_sufficiency := NA_character_] add_data_sufficiency(dframe, midfield_term = toy_term)
Add a column to a data frame of student-level records that indicates the latest term by which degree completion would be considered timely for every student.
add_timely_term( dframe, midfield_term = term, ..., span = NULL, sched_span = NULL )
add_timely_term( dframe, midfield_term = term, ..., span = NULL, sched_span = NULL )
dframe |
Working data frame of student-level records
to which timely-term columns are to be added. Required variable
is |
midfield_term |
MIDFIELD |
... |
Not used for passing values; forces subsequent arguments to be referable only by name. |
span |
Optional integer scalar, number of years to define timely
completion. Commonly used values are are 100, 150, and 200 percent of
|
sched_span |
Optional integer scalar, the number of years an institution officially schedules for completing a program. Default 4 years. |
By "completion" we mean an undergraduate earning their first baccalaureate degree (or degrees, for students earning more than one degree in the same term).
In many studies, students must complete their programs in a specified time span, for example 4-, 6-, or 8-years after admission. If they do, their completion is timely; if not, their completion is late and they are grouped with the non-completers when computing a metric such as graduation rate.
Our heuristic assigns span
number of years (default is 6 years) to every
student. For students admitted at second-year level or higher, the span is
reduced by one year for each full year the student is assumed to have
completed. For example, a student admitted at the second-year level is
assumed to have completed one year of a program, so their span is reduced by
one year.
The adjusted span is added to the initial term to create the timely
completion term in the timely_term
column.
A data frame in data.table
format with the following
properties: rows are preserved; columns are preserved with the exception
that columns added by the function overwrite existing columns of the
same name (if any); grouping structures are not preserved. The added
columns are:
term_i
Character. Initial term of a student's longitudinal record, encoded YYYYT
level_i
Character. Student level (01 Freshman, 02 Sophomore, etc.) in their initial term
adj_span
Numeric. Integer span of years for timely completion adjusted for a student's initial level.
timely_term
Character. Latest term by which program completion would be considered timely for every student. Encoded YYYYT.
Other add_*:
add_completion_status()
,
add_data_sufficiency()
# Start with an excerpt from the student data set dframe <- toy_student[1:10, .(mcid)] # Add timely completion term column add_timely_term(dframe, midfield_term = toy_term) # Define timely completion as 200% of scheduled span (8 years) add_timely_term(dframe, midfield_term = toy_term, span = 8) # Existing timely_term column, if any, is overwritten dframe[, timely_term := NA_character_] add_timely_term(dframe, midfield_term = toy_term)
# Start with an excerpt from the student data set dframe <- toy_student[1:10, .(mcid)] # Add timely completion term column add_timely_term(dframe, midfield_term = toy_term) # Define timely completion as 200% of scheduled span (8 years) add_timely_term(dframe, midfield_term = toy_term, span = 8) # Existing timely_term column, if any, is overwritten dframe[, timely_term := NA_character_] add_timely_term(dframe, midfield_term = toy_term)
Data frame of IDs after processing the practice data for data sufficiency and degree seeking. Provides a convenient bloc to start many of the analysis illustrated in the package articles.
baseline_mcid
baseline_mcid
data.table
with 76875 rows and 1 column:
mcid
Character, de-identified student ID. Key column.
Other case-study-data:
study_observations
,
study_programs
,
study_results
A data table based on the US National Center for Education Statistics (NCES), Integrated Postsecondary Education Data System (IPEDS), 2010 CIP, http://nces.ed.gov/ipeds/cipcode/. The data are codes and names for 1582 instructional programs organized on three levels: a 2-digit series, a 4-digit series, and a 6-digit series.
cip
cip
data.table
with 1582 rows and 6 columns keyed by the 6-digit CIP
code:
cip6
Character 6-digit code representing "specific instructional programs" (US National Center for Education Statistics).
cip6name
Character program name at the 6-digit level
cip4
Character 4-digit code (the first 4 digits of cip6
)
representing "intermediate groupings of programs that have
comparable content and objectives."
cip4name
Character program name at the 4-digit level.
cip2
Character 2-digit code (the first 2 digits of cip6
)
representing "the most general groupings of related programs."
cip2name
Character program name at the 2-digit level.
The midfielddata taxonomy includes one non-IPEDS code (999999) for Undecided or Unspecified, instances in which institutions reported no program information or that students were not enrolled in a program.
The MIDFIELD research database include CIPs for undergraduate pre-majors such as pre-med (511102), pre-law (220001), and pre-vet (511104).
Other cip-data:
fye_proxy
Subset a CIP data frame, retaining rows that match or partially match a vector of character strings. Columns are not subset unless selected in an optional argument.
filter_cip(keep_text = NULL, ..., drop_text = NULL, cip = NULL, select = NULL)
filter_cip(keep_text = NULL, ..., drop_text = NULL, cip = NULL, select = NULL)
keep_text |
Character vector of search text for retaining rows,
not case-sensitive. Can be empty if |
... |
Not used for passing values; forces subsequent arguments to be referable only by name. |
drop_text |
Optional character vector of search text for dropping rows, default NULL. |
cip |
Data frame to be searched. Default |
select |
Optional character vector of column names to return, default all columns. |
Search terms can include regular expressions. Uses grepl()
, therefore
non-character columns (if any) that can be coerced to character are also
searched for matches. Columns are subset by the values in select
after the
search concludes.
If none of the optional arguments are specified, the function returns the original data frame.
A data frame in data.table
format, a subset of cip
,
with the following properties: exclude rows that match
elements of drop_text
; of the remaining rows, include those that
match elements of keep_text
; if select
is empty, all columns are
preserved, otherwise only columns included in select
are retained;
grouping structures are not preserved.
# Subset using keywords filter_cip(keep_text = "engineering") # Multiple passes to narrow the results first_pass <- filter_cip("civil") second_pass <- filter_cip("engineering", cip = first_pass) filter_cip(drop_text = "technology", cip = second_pass) # drop_text argument, when used, must be named filter_cip("civil engineering", drop_text = "technology") # Subset using numerical codes filter_cip(keep_text = c("050125", "160501")) # Subset using regular expressions filter_cip(keep_text = "^54") filter_cip(keep_text = c("^1407", "^1408")) # Select columns filter_cip(keep_text = "^54", select = c("cip6", "cip4name"))
# Subset using keywords filter_cip(keep_text = "engineering") # Multiple passes to narrow the results first_pass <- filter_cip("civil") second_pass <- filter_cip("engineering", cip = first_pass) filter_cip(drop_text = "technology", cip = second_pass) # drop_text argument, when used, must be named filter_cip("civil engineering", drop_text = "technology") # Subset using numerical codes filter_cip(keep_text = c("050125", "160501")) # Subset using regular expressions filter_cip(keep_text = "^54") filter_cip(keep_text = c("^1407", "^1408")) # Select columns filter_cip(keep_text = "^54", select = c("cip6", "cip4name"))
Proxies are the degree-granting engineering programs we estimate that First-Year Engineering (FYE) students would have declared had they not been required to enroll in FYE. Keyed by student ID. Proxies are provided for all students in the midfielddata practice data who enroll in FYE in their first term.
fye_proxy
fye_proxy
data.table
with 4623 rows and 2 columns keyed by student ID:
mcid
Character, de-identified student ID. Key column.
proxy
Character, 6-digit CIP code of the estimated proxy program.
The proxy variable contains 6-digit CIP codes of degree-granting engineering programs, e.g., Electrical Engineering, Mechanical Engineering, etc., that are substituted for the FYE CIP code when an analysis requires degree-granting starting programs. The most common application is a graduation rate calculation.
The estimation is based on students' first post-FYE programs and a multiple imputation suitable for categorical variables using the mice package. The predictor variables are institution, race, and sex. The estimated variable is the 6-digit CIP code of a degree-granting engineering program at their institution.
fye_proxy
holds only for the practice data in midfielddata—these values
cannot be commingled with the MIDFIELD research database.
Other cip-data:
cip
Data frame of letter grades and conventional point assignments used for computing grade point averages.
grade_scale
grade_scale
data.table
with 12 rows and 2 columns:
letter_grade
Character, letter grades using the conventional US scale from A to F.
points
Numerical, 4.0 scale of points assigned to letter grades.
Other scales:
sat_act_scale
Transform a data frame such that two independent categorical variables are factors with levels ordered for display in a multiway dot plot. Multiway data comprise a single quantitative value (or response) for every combination of levels of two categorical variables. The ordering of the rows and panels is crucial to the perception of effects (Cleveland, 1993).
order_multiway( dframe, quantity, categories, ..., method = NULL, ratio_of = NULL )
order_multiway( dframe, quantity, categories, ..., method = NULL, ratio_of = NULL )
dframe |
Data frame containing a single quantitative value (or response) for every combination of levels of two categorical variables. Categories may be class character or factor. Two additional numeric columns are required when using the "percent" ordering method. |
quantity |
Character, name (in quotes) of the single multiway quantitative variable |
categories |
Character, vector of names (in quotes) of the two multiway categorical variables |
... |
Not used for passing values; forces subsequent arguments to be referable only by name. |
method |
Character, “median” (default) or “percent”, method of ordering the levels of the categories. The median method computes the medians of the quantitative column grouped by category. The percent method computes percentages based on the same ratio underlying the quantitative percentage variable except grouped by category. |
ratio_of |
Character vector with the names (in quotes) of the
numerator and denominator columns that produced the quantitative
variable, required when |
In our context, "multiway" refers to the data structure and graph design defined by Cleveland (1993), not to the methods of analysis described by Kroonenberg (2008).
Multiway data comprise three variables: a categorical variable of m levels; a second independent categorical variable of n levels; and a quantitative variable (or response) of length mn that cross-classifies the categories, that is, there is a value of the response for each combination of levels of the two categorical variables.
In a multiway dot plot, one category is encoded by the panels, the second category is encoded by the rows of each panel, and the quantitative variable is encoded along identical horizontal scales.
A data frame in data.table
format with
the following properties: rows are preserved; columns specified by
categories
are converted to factors and ordered; the column specified
by quantity
is converted to type double; other columns are preserved
with the exception that columns added by the function overwrite existing
columns of the same name (if any); grouping structures are not preserved.
The added columns are:
CATEGORY_median
columns (when ordering method is "median")Numeric. Two columns of medians of the quantitative variable grouped
by the categorical variables. The CATEGORY
placeholder in
the column name is replaced by a category name from the
categories
argument. For example, suppose
categories = c("program", "people")
and
method = "median"
. The two new column names would be
program_median
and people_median.
CATEGORY_QUANTITY
columns (when ordering method is "percent")Numeric. Two columns of percentages based on the same ratio that
produces the quantitative variable except grouped by the categorical
variables. The CATEGORY
placeholder in the column name is
replaced by a category name from the categories
argument; the
QUANTITY
placeholder is replaced by the quantitative variable
name in the quantity
argument. For example, suppose
categories = c("program", "people")
, and
quantity = "grad_rate"
, and method = "percent"
. The two
new column names would be program_grad_rate
and
people_grad_rate.
Cleveland WS (1993). Visualizing Data. Hobart Press, Summit, NJ.
Kroonenberg PM (2008). Applied Multiway Data Analysis. Wiley, Hoboken, NJ.
# Subset of built-in data set dframe <- study_results[program == "EE" | program == "ME"] dframe[, people := paste(race, sex)] dframe[, c("race", "sex") := NULL] data.table::setcolorder(dframe, c("program", "people")) # Class before ordering class(dframe$program) class(dframe$people) # Class and levels after ordering mw1 <- order_multiway(dframe, quantity = "stickiness", categories = c("program", "people")) class(mw1$program) levels(mw1$program) class(mw1$people) levels(mw1$people) # Display category medians mw1 # Existing factors (if any) are re-ordered mw2 <- dframe mw2$program <- factor(mw2$program, levels = c("ME", "EE")) # Levels before conditioning levels(mw2$program) # Levels after conditioning mw2 <- order_multiway(dframe, quantity = "stickiness", categories = c("program", "people")) levels(mw2$program) # Ordering using percent method order_multiway(dframe, quantity = "stickiness", categories = c("program", "people"), method = "percent", ratio_of = c("graduates", "ever_enrolled"))
# Subset of built-in data set dframe <- study_results[program == "EE" | program == "ME"] dframe[, people := paste(race, sex)] dframe[, c("race", "sex") := NULL] data.table::setcolorder(dframe, c("program", "people")) # Class before ordering class(dframe$program) class(dframe$people) # Class and levels after ordering mw1 <- order_multiway(dframe, quantity = "stickiness", categories = c("program", "people")) class(mw1$program) levels(mw1$program) class(mw1$people) levels(mw1$people) # Display category medians mw1 # Existing factors (if any) are re-ordered mw2 <- dframe mw2$program <- factor(mw2$program, levels = c("ME", "EE")) # Levels before conditioning levels(mw2$program) # Levels after conditioning mw2 <- order_multiway(dframe, quantity = "stickiness", categories = c("program", "people")) levels(mw2$program) # Ordering using percent method order_multiway(dframe, quantity = "stickiness", categories = c("program", "people"), method = "percent", ratio_of = c("graduates", "ever_enrolled"))
Constructs a data frame of student-level records of First-Year Engineering (FYE) programs and conditions the data for later use as an input to the mice R package for multiple imputation. Sets up three variables as predictors (institution, race/ethnicity, and sex) and one variable to be estimated (program CIP code).
prep_fye_mice( midfield_student = student, midfield_term = term, ..., fye_codes = NULL )
prep_fye_mice( midfield_student = student, midfield_term = term, ..., fye_codes = NULL )
midfield_student |
MIDFIELD |
midfield_term |
MIDFIELD |
... |
Not used for passing values; forces subsequent arguments to be referable only by name. |
fye_codes |
Optional character vector of 6-digit CIP codes to identify FYE programs, default "140102". Codes must be 6-digit strings of numbers; regular expressions are prohibited. Non-engineering codes—those that do not start with 14—produce an error. |
At some US institutions, engineering students are required to complete a First-Year Engineering (FYE) program as a prerequisite for declaring an engineering major. Administratively, degree-granting engineering programs such as Electrical Engineering or Mechanical Engineering treat their incoming post-FYE students as their "starting" cohorts. However, when computing a metric that requires a count of starters—graduation rate, for example—FYE records must be treated with special care to avoid a miscount.
To illustrate the potential for miscounting starters, suppose we wish to calculate a Mechanical Engineering (ME) graduation rate. Students starting in ME constitute the starting pool and the fraction of that pool graduating in ME is the graduation rate. At FYE institutions, an ME program would typically define their starting pool as the post-FYE cohort entering their program. This may be the best information available, but it invariably undercounts starters by failing to account for FYE students who do not transition (post-FYE) to degree-granting engineering programs—students who may have left the institution or switched to non-engineering majors. In either case, in the absence of the FYE requirement, some of these students would have been ME starters. By neglecting these students, the count of ME starters is artificially low resulting in an ME graduation rate that is artificially high. The same is true for every degree-granting engineering discipline in an FYE institution.
Therefore, to avoid miscounting starters at FYE institutions, we have to estimate an "FYE proxy", that is, the 6-digit CIP codes of the degree-granting engineering programs that FYE students would have declared had they not been required to enroll in FYE. The purpose of 'prep_fye_mice()“ is to prepare the data for making that estimation.
After running prep_fye_mice()
but before running mice()
, one can edit
variables or add variables to create a custom set of predictors. The mice
package expects all predictors and the proxy variables to be factors. Do not
delete the institution variable because it ensures that a student's imputed
program is available at their institution.
In addition, ensure that the only missing values are in the proxy column. Other variables are expected to be complete (no NA values). A value of "unknown" in a predictor column, e.g., race/ethnicity or sex, is an acceptable value, not missing data. Observations with missing or unknown values in the ID or institution columns (if any) should be removed.
A data frame in data.table
format conditioned for
later use as an input to the mice R package for multiple imputation. The
data frame comprises one row for every FYE student, first-term and
migrator. Grouping structures are not preserved. The columns returned
are:
mcid
Character, anonymized student identifier. Returned as-is.
race
Factor, race/ethnicity as self-reported by the student. An imputation predictor variable.
sex
Factor, sex as self-reported by the student. An imputation predictor variable.
institution
Factor, anonymized institution name. An imputation predictor variable.
proxy
Factor, 6-digit CIP code of a student's known, post-FYE engineering program or NA representing missing values to be imputed.
The function extracts all terms for all FYE students,
including those who migrate to enter Engineering after their first term,
and identifies the first post-FYE program in which they enroll, if any.
This treatment yields two possible outcomes for values returned in the
proxy
column:
The student completes FYE and enrolls in an engineering major. For this outcome, we know that at the student's first opportunity, they enrolled in an engineering program of their choosing. The CIP code of that program is returned as the student's FYE proxy.
The student does not enroll post-FYE in an engineering major. Such
students have no further records in the database or switched from
Engineering to another program. For this outcome, the data provide no
information regarding what engineering program the student would have
declared originally had the institution not required them to enroll in
FYE. For these students a proxy value of NA is returned. These are the
data treated as missing values to be imputed by mice()
.
In cases where students enter FYE, change programs, and re-enter FYE, only the first group of FYE terms is considered. Any programs before FYE are ignored.
The resulting data frame is ready for use as input for the mice package,
with all variables except mcid
returned as factors.
# Using toy data prep_fye_mice(toy_student, toy_term) # Other columns, if any, are dropped colnames(toy_student) colnames(prep_fye_mice(toy_student, toy_term)) # Optional argument permits multiple CIP codes for FYE prep_fye_mice(midfield_student = toy_student, midfield_term =toy_term, fye_codes = c("140101", "140102"))
# Using toy data prep_fye_mice(toy_student, toy_term) # Other columns, if any, are dropped colnames(toy_student) colnames(prep_fye_mice(toy_student, toy_term)) # Optional argument permits multiple CIP codes for FYE prep_fye_mice(midfield_student = toy_student, midfield_term =toy_term, fye_codes = c("140101", "140102"))
Data frame of SAT total scores and corresponding ACT composite scores. Converting from SAT to ACT, a range of SAT scores convert to a single ACT value. Converting from ACT to SAT, a single-value SAT equivalent is provided.
sat_act_scale
sat_act_scale
data.table
with 28 rows and 4 columns:
sat_upper
Numerical, total SAT, upper limit of range corresponding to the ACT composite score.
sat_equiv
Numerical, total SAT, value to use when converting ACT score to a single SAT score.
sat_lower
Numerical, total SAT, lower limit of range corresponding to the ACT composite score.
act_comp
Numerical, ACT composite score.
ACT/SAT Concordance (2018) ACT Education Corp. https://www.act.org/content/dam/act/unsecured/documents/ACT-SAT-Concordance-Tables.pdf
Other scales:
grade_scale
Subset a data frame, selecting columns by matching or partially matching a vector of character strings. A convenience function to reduce the dimensions of a MIDFIELD data table at the start of a session by selecting only those columns required by other midfieldr functions or that are required to form a composite key. Particularly useful in interactive sessions when viewing the data tables at various stages of an analysis.
select_required(midfield_x, ..., select_add = NULL)
select_required(midfield_x, ..., select_add = NULL)
midfield_x |
Data frame from which columns are selected. |
... |
Not used for passing values; forces subsequent arguments to be referable only by name. |
select_add |
Character vector of additional column names to return. |
Several midfieldr functions require one or more of the variables mcid
,
institution
, race
, sex
, ^term
, cip6
, and level
. And if
one requires a composite key to uniquely identify rows, course variables
abbrev
, number
and degree variable degree
are also required. A vector
of these names comprises the default subset.
Additional column names or partial names can be included by using the
select_add
argument.
The column names of midfield_x
are searched for matches or partial matches
using grep()
, thus search terms can include regular expressions. Variables
with names that match or partially match the search terms are returned; all
other columns are dropped. Rows are unaffected. Search terms not present are
silently ignored.
One could use this function to select columns from a non-MIDFIELD data frame, but with no benefit to the user—conventional column selection syntax is better suited to that task. Here, we specialize the column selection to serve midfieldr functions.
A data frame in data.table
format with the
following properties: rows are preserved; columns with names that
match or partially match values in the default set plus any in
select_add
are retained; grouping structures are not preserved.
# Default character vector for selecting columns default_cols<- c("mcid", "institution", "race", "sex", "^term", "cip6", "level") # Create one string separated by OR search_pattern <- paste(default_cols, collapse = "|") # Find names of columns matching or partially matching x <- select_required(toy_student) names(x) grepl(search_pattern, names(x)) x <- select_required(toy_term) names(x) grepl(search_pattern, names(x)) x <- select_required(toy_degree) names(x) grepl(search_pattern, names(x)) x <- select_required(toy_course) names(x) grepl(search_pattern, names(x)) # Adding search terms x <- select_required(toy_course, select_add = c("abbrev", "number", "grade")) names(x)
# Default character vector for selecting columns default_cols<- c("mcid", "institution", "race", "sex", "^term", "cip6", "level") # Create one string separated by OR search_pattern <- paste(default_cols, collapse = "|") # Find names of columns matching or partially matching x <- select_required(toy_student) names(x) grepl(search_pattern, names(x)) x <- select_required(toy_term) names(x) grepl(search_pattern, names(x)) x <- select_required(toy_degree) names(x) grepl(search_pattern, names(x)) x <- select_required(toy_course) names(x) grepl(search_pattern, names(x)) # Adding search terms x <- select_required(toy_course, select_add = c("abbrev", "number", "grade")) names(x)
Data table of post-processed observations of students ever enrolled in, and students graduating from, the four programs of the case study. Keyed by student ID. Provided for the convenience of vignette users.
study_observations
study_observations
data.table
with 8917 rows and 5 columns.
The variables are:
mcid
Character, de-identified student ID. Key column.
race
Character, race/ethnicity as self-reported by the student, e.g., Asian, Black, Latine, etc.
sex
Character, sex as self-reported by the student, possible values are Female, Male, and Unknown.
program
Character, academic program label.
bloc
Character, indicating the grouping
(ever_enrolled
or graduates
) to which an observation
belongs.
Starting with the case-study starting pool of students ever enrolled in the four programs of the study (Civil, Electrical, Industrial/Systems, and Mechanical Engineering), we filtered the data for data sufficiency, degree seeking, program, and timely completion.
A data frame of "ever enrolled" and a data frame of "timely graduates" were
bound using shared column names and are distinguished in the bloc
variable.
This data structure facilitates grouping and summarizing by race, sex,
program, and group.
Other case-study-data:
baseline_mcid
,
study_programs
,
study_results
Data table of program CIP codes and labels of the four programs of the case study. Keyed by 6-digit CIPs. Provided for the convenience of vignette users.
study_programs
study_programs
data.table
with 15 rows and 2 columns.
The variables are:
cip6
Character, 6-digit CIP code of program in which a student is enrolled in a term.
program
Character, abbreviated labels for four engineering programs. Values are "CE" (Civil Engineering), "EE" (Electrical Engineering), "ISE" (Industrial/Systems Engineering), and "ME" (Mechanical Engineering).
Starting with the midfieldr cip
data set, we extracted the CIPs of the four
programs of the case study and assigned them a custom label to be used for
grouping and summarizing.
Other case-study-data:
baseline_mcid
,
study_observations
,
study_results
Data table of longitudinal stickiness for the four programs of the case study (Civil, Electrical, Industrial/Systems, and Mechanical Engineering) grouped by program, race/ethnicity, and sex. Provided for the convenience of vignette users.
study_results
study_results
data.table
with 50 rows and 6 columns:
program
Character, academic program label.
sex
Character, sex as self-reported by the student, possible values are Female, Male, and Unknown.
race
Character, race/ethnicity as self-reported by the student, e.g., Asian, Black, Latine, etc.
ever_enrolled
Numerical, number of students ever enrolled in a program.
graduates
Numerical, number of students completing a program.
stickiness
Numerical, program stickiness, the
ratio of graduates
to ever_enrolled
, in percent.
Longitudinal stickiness is the ratio of the number of students graduating from a program to the number of students ever enrolled in the program over the time span of available data. Results are based on data that have been filtered for data sufficiency, degree seeking, and timely completion.
Other case-study-data:
baseline_mcid
,
study_observations
,
study_programs
Selected variables modeled on those in the course
practice data for use in
package examples and articles. Sampled from an early version of the practice
data, the toy data are not a current practice data sample.
toy_course
toy_course
data.table
with 4616 rows and 6 columns keyed by student ID:
mcid
Character, de-identified student ID. Key column.
institution
Character, de-identified institution name, e.g., Institution A, Institution B, etc.
term
Character, academic year and term, format YYYYT. Key column.
abbrev
Character, course alphabetical identifier, e.g. ENGR, MATH, ENGL. Key column.
number
Character, course numeric identifier, e.g. 101, 3429. Key column.
grade
Character, course grade, e.g., A+, A, A-, B+, I, NG, etc.
Other toy-data:
toy_degree
,
toy_student
,
toy_term
Selected variables modeled on those in the degree
practice data for use in
package examples and articles. Sampled from an early version of the practice
data, the toy data are not a current practice data sample.
toy_degree
toy_degree
data.table
with 65 rows and 4 columns keyed by student ID.
The variables are:
mcid
Character, de-identified student ID. Key column.
institution
Character, de-identified institution name, e.g., Institution A, Institution B, etc.
term_degree
Character, academic year and term in which a student completes their program, format YYYYT.
cip6
Character, 6-digit CIP code of program in which a student is enrolled in a term.
Other toy-data:
toy_course
,
toy_student
,
toy_term
Selected variables modeled on those in the student
practice data for use in
package examples and articles. Sampled from an early version of the practice
data, the toy data are not a current practice data sample.
toy_student
toy_student
data.table
with 99 rows and 4 columns keyed by student ID:
mcid
Character, de-identified student ID. Key column.
institution
Character, de-identified institution name, e.g., Institution A, Institution B, etc.
race
Character, race/ethnicity as self-reported by the student, e.g., Asian, Black, Latine, etc.
sex
Character, sex as self-reported by the student, possible values are Female, Male, and Unknown.
Other toy-data:
toy_course
,
toy_degree
,
toy_term
Selected variables modeled on those in the term
practice data for use in
package examples and articles. Sampled from an early version of the practice
data, the toy data are not a current practice data sample.
toy_term
toy_term
data.table
with 150 rows and 5 columns keyed by student ID.
The variables are:
mcid
Character, de-identified student ID. Key column.
institution
Character, de-identified institution name, e.g., Institution A, Institution B, etc.
term
Character, academic year and term, format YYYYT. Key column.
cip6
Character, 6-digit CIP code of program in which a student is enrolled in a term.
level
Character, 01 Freshman, 02 Sophomore, etc. The equivalent values in the current practice data are 01 First-Year, 02-Second Year, etc.
Other toy-data:
toy_course
,
toy_degree
,
toy_student