Package 'midfieldr'

Title: Tools and Methods for Working with MIDFIELD Data in 'R'
Description: Provides tools in 'R' for working with undergraduate, longitudinal, student-level records modeled on the MIDFIELD database. Tools facilitate identifying academic program codes, excluding post-baccalaureate terms, excluding records for insufficient data, and assessing timely completion. The tools support the workflow of collecting programs, refining the population, constructing blocs of records for aggregation, and calculating quantitative metrics. 'midfieldr' interacts with practice data provided in the 'midfielddata' package or with any data modeled on the MIDFIELD database. The development of 'midfieldr' and 'midfielddata' was supported by the US National Science Foundation through grant numbers 1545667 and 2142087.
Authors: Richard Layton [cre, aut, cph], Russell Long [aut, cph], Matthew Ohland [aut, cph], Marisa Orr [aut, cph], Susan Lord [aut, cph], US National Science Foundation [fnd]
Maintainer: Richard Layton <[email protected]>
License: MIT + file LICENSE
Version: 1.0.3.9011
Built: 2026-07-01 19:25:36 UTC
Source: https://github.com/midfieldr/midfieldr

Help Index


midfieldr deprecated functions

Description

These functions were deprecated in midfieldr 1.0.4.

Usage

add_completion_status(dframe, midfield_degree = degree)

add_data_sufficiency(dframe, midfield_term = term)

filter_cip(keep_text = NULL, drop_text = NULL, cip = NULL, select = NULL)

select_required(midfield_x, select_add = NULL)

add_timely_term(
  dframe,
  midfield_term = term,
  ...,
  sched_span = NULL,
  span = NULL
)

Arguments

dframe

Data frame or data frame extension (e.g., data.table or tibble).

midfield_degree

MIDFIELD records degree data frame or data frame extension.

midfield_term

MIDFIELD records term data frame or data frame extension.

keep_text

Deprecated filter_cip(). Character vector of search text to keep.

drop_text

Deprecated filter_cip(). Character vector of search text to drop.

cip

Deprecated filter_cip(). Data frame of programs to be searched.

select

Deprecated filter_cip(). Character vector of column names to select.

midfield_x

Deprecated select_required(). Data frame from which columns are selected.

select_add

Deprecated select_required(). Character vector of col_patterns to search dframe column names.

...

Not used for passing values; forces subsequent arguments to be referable only by name.

sched_span

Integer scalar

span

Integer scalar

Details

add_completion_status()

is deprecated in favor of completion_status(). Update midfieldr file names and argument names, dropping columns not used by the function, and preserving data frame class.

add_data_sufficiency()

is deprecated in favor of data_sufficiency(). Update midfieldr file names and argument names, dropping columns not used by the function, and preserving data frame class.

add_timely_term()

is deprecated in favor of timely_term(). Update midfieldr file names and argument names, dropping columns not used by the function, and preserving data frame class.

filter_cip()

is deprecated in favor of filter_programs(). The new function is similar but with the CIP data frame as the first argument, enabling chained functions like those encountered using dplyr and friends.

select_required()

is deprecated in favor of select_records(). The new functionality is similar but with exact matching to the default column names plus preserving data frame class.


Baseline ID bloc to start a typical analysis

Description

Data frame of IDs after processing the practice data for data sufficiency and degree seeking. Provides a convenient bloc to start many of the analysis illustrated in the package articles.

Usage

baseline_mcid

Format

data.table with 76875 rows and 1 column:

mcid

Character, de-identified student ID. Key column.

See Also

Other case-study-data: study_observations, study_programs, study_results


Error handling

Description

A wrapper on base::tryCatch() for previewing an error message, if any.

Usage

catch_error(f)

Arguments

f

Function with arguments expecting an error

Value

Does not return anything. The side effect is to output to the terminal.

Examples

# Example data frames
s <- toy_student[, .(mcid)]
t <- toy_term[, .(mcid, term)]
d <- toy_degree[, .(mcid, term_degree)]

# No error
catch_error(post_bacc_terms(t, d))

# Error, no term variable 
catch_error(post_bacc_terms(s, d))

# Error, missing dframe argument
catch_error(post_bacc_terms())

# Error, missing degree argument
catch_error(post_bacc_terms(t))

Table of academic programs

Description

A data table based on the US National Center for Education Statistics (NCES), Integrated Postsecondary Education Data System (IPEDS), 2010 CIP. The data are codes and names for 1582 instructional programs organized on three levels: a 2-digit series, a 4-digit series, and a 6-digit series.

Usage

cip

Format

A data.table with 1582 rows and 6 columns keyed by the 6-digit CIP code:

cip6name

Character, program name at the 6-digit level

cip6

Character, 6-digit code representing "specific instructional programs" (US National Center for Education Statistics).

cip4name

Character, program name at the 4-digit level.

cip4

Character, 4-digit code (the first 4 digits of cip6) representing "intermediate groupings of programs that have comparable content and objectives."

cip2name

Character, program name at the 2-digit level.

cip2

Character, 2-digit code (the first 2 digits of cip6) representing "the most general groupings of related programs."

Details

The midfielddata taxonomy includes one non-IPEDS code (999999) for Undecided or Unspecified, instances in which institutions reported no program information or that students were not enrolled in a program.

Source

https://nces.ed.gov/ipeds/cipcode/

See Also

Other cip-data: cip2010, fye_proxy


Alternate table of academic programs

Description

A data table of the 2010 Classification of Instructional Programs (CIP) accessed in 2026 from the US National Center for Education Statistics (NCES). Like the cip data set originally included with midfieldr, cip2010 provides codes and names for instructional programs organized on three levels: a 2-digit series, a 4-digit series, and a 6-digit series.

Usage

cip2010

Format

data.table with 1849 rows and 6 columns keyed by the 6-digit CIP code:

cip6

Character, 6-digit code representing "specific instructional programs" (US National Center for Education Statistics).

cip6name

Character, program name at the 6-digit level

cip4

Character, 4-digit code (the first 4 digits of cip6) representing "intermediate groupings of programs that have comparable content and objectives."

cip4name

Character, program name at the 4-digit level.

cip2

Character, 2-digit code (the first 2 digits of cip6) representing "the most general groupings of related programs."

cip2name

Character, program name at the 2-digit level.

Details

The midfielddata taxonomy includes one non-IPEDS code (999999) for Undecided or Unspecified, instances in which institutions reported no program information or that students were not enrolled in a program.

Source

https://nces.ed.gov/ipeds/cipcode/

See Also

Other cip-data: cip, fye_proxy


Determine completion status

Description

To a data frame keyed by student ID, add a column indicating if a student completed their program, and if so, whether their completion was timely or late. Columns of supporting information are also added. Unrelated columns are dropped.

Usage

completion_status(dframe, midfield_rec = degree)

Arguments

dframe

Data frame or data frame extension (e.g., data.table or tibble). Required variables: ⁠{mcid, timely_term}⁠.

midfield_rec

MIDFIELD records degree data frame or data frame extension. Required variables: ⁠{mcid, term_degree}⁠.

Details

In many studies, students must complete their programs in a specified time span to be considered "timely", for example 4, 6, or 8 years after admission. By "completion" we mean an undergraduate earning their first baccalaureate degree (or degrees, for students earning more than one degree in the same term).

The goal of determining timely completion is to refine a population, that is, obtain a data frame of IDs that satisfy our constraints. Thus completion_status() yields a column of completion status values and columns of supporting information keyed by ID. All other columns in dframe (if any) are dropped.

The supporting information in the output is provided so that the user can review the findings. After review, we usually delete all columns except the IDs, yielding the refined population that was our goal.

Value

Data frame with the following properties:

  • Data frame class is preserved. Groups and keys are not preserved.

  • Rows are filtered for unique mcid values.

  • Columns ⁠{mcid, timely_term}⁠ are retained (all other columns are dropped). New columns added:

    • term_degree.   Character. Term in which the first degree(s) are completed, encoded YYYYT. Joined from midfield_rec.

    • completion_status.   Character. Possible values are "timely" for students completing a degree no later than their timely completion terms; "late" for students completing their program after their timely completion term; and "NA" for non-completers.

Examples

# Start with an excerpt from the student data set 
dframe <- toy_student[1:10, .(mcid)]

# Timely term column is required to add completion status column
dframe <- timely_term(dframe, toy_term)

# Add completion status column
completion_status(dframe, toy_degree)

# Existing completion_status column, if any, is overwritten
dframe[, completion_status := NA_character_][]
completion_status(dframe, toy_degree)

Determine data sufficiency

Description

To a data frame keyed by student ID, add a column indicating that an institution's data range is sufficient to reliably assess a student's program completion. Columns of supporting information are also added. Unrelated columns are dropped.

Usage

data_sufficiency(dframe, midfield_rec = term)

Arguments

dframe

Data frame or data frame extension (e.g., data.table or tibble). Required variables: ⁠{mcid, term_i, timely_term}⁠.

midfield_rec

MIDFIELD records term data frame or data frame extension. Required variables: ⁠{mcid, term, institution}⁠.

Details

Because the time span of MIDFIELD term data varies by institution, each has their own lower and upper bounds. When assessing a student's program completion, an unavoidable ambiguity arises for student records at or near these bounds. Such records must be identified and in most cases excluded to prevent false summary counts.

The data sufficiency criterion states that student records are limited to those for which available data are sufficient to assess timely completion without biased counts of completers or non-completers. In practice, the criteria is implemented via two filters. Rows are labeled for exclusion when: 1) a student ID is extant in the non-summer lower limit of an institution's data range; or 2) a student ID has a timely completion term that exceeds the upper limit of the institution's data range.

The goal of determining data sufficiency is to refine a population, that is, obtain a data frame of IDs that satisfy our constraints. Thus data_sufficiency() yields a column of data sufficiency values and columns of supporting information keyed by ID. All other columns in dframe (if any) are dropped.

The supporting information in the output is provided so that the user can review the findings. After review, we usually delete all columns except the IDs, yielding the refined population that was our goal.

Value

Data frame with the following properties:

  • Data frame class is preserved. Groups and keys are not preserved.

  • Rows are filtered for unique mcid values.

  • Columns ⁠{mcid, term_i, timely_term}⁠ are retained (all other columns are dropped). New columns added:

    • institution.   Character. Institution in which the student is enrolled in the given term. Extracted from midfield_rec. The limits given in the next two columns are specific to the institution.

    • lower_limit.   Character. Initial term of an institution's data range, encoded YYYYT. Extracted from midfield_rec. Compared to term_i to determine the lower-limit exclusion.

    • upper_limit.   Character. Final term of an institution's data range, encoded YYYYT. Extracted from midfield_rec. Compared to timely_term to determine upper-limit exclusion.

    • data_sufficiency.   Character. Possible values are "include", if the data are sufficient; and "exclude-lower" or "exclude-upper" if not, indicating at which boundary of the data range the ambiguity occurs.

Examples

# Start with an excerpt from the student data set 
dframe <- toy_student[1:10, .(mcid)]

# Timely term column is required to add data sufficiency column
dframe <- timely_term(dframe, toy_term)

# Add data sufficiency column
data_sufficiency(dframe, toy_term)

# Existing data_sufficiency column, if any, is replaced
dframe[, data_sufficiency := NA_character_][]
data_sufficiency(dframe, toy_term)

Choose rows of CIP data

Description

Subset a CIP data frame, retaining rows that match or partially match any string in a vector of character strings.

Usage

filter_programs(dframe, pattern, ..., negate = FALSE)

Arguments

dframe

Data frame or data frame extension (e.g., data.table or tibble). Expected variables (or subset thereof): ⁠{cip6name, cip6, cip4name, cip4, cip2name, cip2}.⁠

pattern

Character vector of search strings, including regular expressions.

...

Not used for passing values; forces subsequent arguments to be referable only by name.

negate

Logical (default FALSE). If TRUE, inverts the resulting Boolean vector.

Details

Each element of the pattern vector is matched row-wise to every value in dframe using ⁠grepl().⁠ Row values are coerced to character strings if possible. If negate = FALSE (default), a match retains the full row; if ⁠negate = TRUE,⁠ a match removes the full row.

Value

Data frame with the following properties:

  • Data frame class is preserved. Groups and keys are not preserved.

  • Rows are a subset of the input and appear in the same order.

  • Columns are not modified.

Examples

# Subset using keywords
filter_programs(cip, pattern = "history")

# Subset using codes
filter_programs(cip, pattern = "^54")

# Multiple passes to narrow the results
first_pass <- filter_programs(cip, "math")[, .(cip6name, cip6)]
first_pass

second_pass <- filter_programs(first_pass, c("bio", "educ"), negate = TRUE)
second_pass

third_pass <- filter_programs(second_pass, c("^27", "^30"))
third_pass

Starting program proxies for FYE students

Description

Proxies are the degree-granting engineering programs we estimate that First-Year Engineering (FYE) students would have declared had they not been required to enroll in FYE. Keyed by student ID. Proxies are provided for all students in the midfielddata practice data who enroll in FYE in their first term.

Usage

fye_proxy

Format

data.table with 4623 rows and 2 columns keyed by student ID:

mcid

Character, de-identified student ID. Key column.

proxy

Character, 6-digit CIP code of the estimated proxy program.

Details

The proxy variable contains 6-digit CIP codes of degree-granting engineering programs, e.g., Electrical Engineering, Mechanical Engineering, etc., that are substituted for the FYE CIP code when an analysis requires degree-granting starting programs. The most common application is a graduation rate calculation.

The estimation is based on students' first post-FYE programs and a multiple imputation suitable for categorical variables using the mice package. The predictor variables are institution, race, and sex. The estimated variable is the 6-digit CIP code of a degree-granting engineering program at their institution.

fye_proxy holds only for the practice data in midfielddata—these values cannot be commingled with the MIDFIELD research database.

See Also

Other cip-data: cip, cip2010


Grade scale

Description

Data frame of letter grades and conventional point assignments used for computing grade point averages.

Usage

grade_scale

Format

data.table with 12 rows and 2 columns:

letter_grade

Character, letter grades using the conventional US scale from A to F.

points

Numerical, 4.0 scale of points assigned to letter grades.

See Also

Other scales: sat_act_scale


Display structure

Description

A wrapper on base::str() with arguments set to not show attributes, to not show length, and to cut the width.

Usage

look_at(x)

Arguments

x

Any R object.

Value

Does not return anything. The side effect is to output to the terminal.

Examples

# data frames
look_at(cip)
look_at(toy_degree)

# character vectors
x <- sort(unique(toy_degree$institution))
look_at(x)

Order multiway categories

Description

Transform a data frame such that two independent categorical variables are factors with levels ordered for display in a multiway dot plot.

Usage

order_multiway(
  dframe,
  quantity,
  categories,
  ...,
  method = NULL,
  ratio_of = NULL
)

Arguments

dframe

Data frame or data frame extension (e.g., data.table or tibble). Required variables: a single quantitative value for every combination of levels of two categorical variables.

quantity

Character, name of the single multiway quantitative variable

categories

Character, vector of names of the two multiway categorical variables

...

Not used for passing values; forces subsequent arguments to be referable only by name.

method

Character, “median” (default) or “percent”, method of ordering the levels of the categories. The median method computes the medians of the quantitative column grouped by category. The percent method computes percentages based on the same ratio underlying the quantitative percentage variable except grouped by category.

ratio_of

Character vector with the names of the numerator and denominator columns that produced the quantitative variable, required when method is "percent". Names can be in any order; the algorithm assumes that the parameter with the larger column sum is the denominator of the ratio.

Details

Multiway data comprise a single quantitative value (or response) for every combination of levels of two categorical variables. The ordering of the rows and panels is crucial to the perception of effects (Cleveland, 1993).

In our context, "multiway" refers to the data structure and graph design defined by Cleveland (1993), not to the methods of analysis described by Kroonenberg (2008).

Multiway data comprise three variables: a categorical variable of m levels; a second independent categorical variable of n levels; and a quantitative variable (or response) of length mn that cross-classifies the categories, that is, there is a value of the response for each combination of levels of the two categorical variables.

In a multiway dot plot, one category is encoded by the panels, the second category is encoded by the rows of each panel, and the quantitative variable is encoded along identical horizontal scales.

Value

Data frame with the following properties:

  • Data frame class is preserved. Groups and keys are not preserved.

  • Rows are preserved.

  • Column specified by quantity is converted to type double.

  • Columns specified by categories are converted to factors and ordered.

  • Other columns are preserved with the exception that columns added by the function overwrite existing columns of the same name (if any).

  • Two new columns CATEGORY_median added when method is "median." Numeric medians of the quantitative variable grouped by the categorical variables. The CATEGORY placeholder in the column name is replaced by a category name from the categories argument. For example, suppose categories = c("program", "people") and method = "median". The two new column names would be program_median and people_median.

  • Two new columns CATEGORY_QUANTITY added when method is "percent." Numeric percentages based on the same ratio that produces the quantitative variable except grouped by the categorical variables. The CATEGORY placeholder in the column name is replaced by a category name from the categories argument; the QUANTITY placeholder is replaced by the quantitative variable name in the quantity argument. For example, suppose quantity = "grad_rate", categories = c("program", "people"), and method = "percent". The two new column names would be program_grad_rate and people_grad_rate.

References

Cleveland WS (1993). Visualizing Data. Hobart Press, Summit, NJ.

Kroonenberg PM (2008). Applied Multiway Data Analysis. Wiley, Hoboken, NJ.

Examples

# Subset of built-in data set
dframe <- study_results[program == "EE" | program == "ME"]
dframe[, people := paste(race, sex)]
dframe[, c("race", "sex") := NULL]
data.table::setcolorder(dframe, c("program", "people"))

# Class before ordering
class(dframe$program)
class(dframe$people)

# Class and levels after ordering
mw1 <- order_multiway(dframe, 
                      quantity = "stickiness", 
                      categories = c("program", "people"))
class(mw1$program)
levels(mw1$program)
class(mw1$people)
levels(mw1$people)

# Display category medians 
mw1

# Existing factors (if any) are re-ordered
mw2 <- dframe
mw2$program <- factor(mw2$program, levels = c("ME", "EE"))

# Levels before conditioning
levels(mw2$program) 

# Levels after conditioning
mw2 <- order_multiway(dframe, 
                      quantity = "stickiness", 
                      categories = c("program", "people"))
levels(mw2$program) 

# Ordering using percent method
order_multiway(dframe, 
               quantity = "stickiness", 
               categories = c("program", "people"), 
               method = "percent", 
               ratio_of = c("graduates", "ever_enrolled"))

Identify post-baccalaureate terms

Description

To a data frame keyed by student ID and containing an academic term variable, add a column that clusters terms with respect to a student's first degree term. Post-baccalaureate terms are typically excluded from the term, course, and degree data tables.

Usage

post_bacc_terms(dframe, midfield_rec = degree)

Arguments

dframe

Data frame or data frame extension (e.g., data.table or tibble). Required variables: {mcid} and one of ⁠{term, term_course, term_degree}.⁠

midfield_rec

MIDFIELD records degree data frame or data frame extension. Required variables: ⁠{mcid, term_degree}.⁠

Details

In a typical analysis, one is interested in a student's progress up to and including the term in which they earn their first degree or degrees. Any terms later than the first baccalaureate can usually be excluded from study.

Value

Data frame with the following properties:

  • Data frame class is preserved. Groups and keys are not preserved.

  • Rows are not modified.

  • Columns are not modified except new columns overwrite old columns of the same name. New columns:

    • first_degree_term.   Character. Term of a student's first baccalaureate, encoded YYYYT or, if no degree recorded, NA. Joined from midfield_rec$term_degree.

    • term_cluster.   Character, indicating that a term belongs to one of three clusters: terms that are prior to ("pre-degree"), equal to ("first-degree"), or subsequent to ("post-first-degree") the student’s first degree term.

Examples

# Examples TBD
x <- 1

Prepare FYE data for imputation

Description

Constructs a data frame of student-level records of First-Year Engineering (FYE) programs and conditions the data for later use as an input to the mice R package for multiple imputation. Sets up three variables as predictors (institution, race/ethnicity, and sex) and one variable to be estimated (program CIP code).

Usage

prep_fye_mice(
  midfield_student = student,
  midfield_term = term,
  ...,
  fye_codes = NULL
)

Arguments

midfield_student

MIDFIELD records student data frame or data frame extension. Required variables: ⁠{mcid, race, sex}.⁠

midfield_term

MIDFIELD records term data frame or data frame extension. Required variables: ⁠{mcid, term, cip6, institution}.⁠

...

Not used for passing values; forces subsequent arguments to be referable only by name.

fye_codes

Optional character vector of 6-digit CIP codes to identify FYE programs, default "140102". Codes must be 6-digit strings of numbers; regular expressions are prohibited. Non-engineering codes—those that do not start with 14—produce an error.

Details

At some US institutions, engineering students are required to complete a First-Year Engineering (FYE) program as a prerequisite for declaring an engineering major. Administratively, degree-granting engineering programs such as Electrical Engineering or Mechanical Engineering treat their incoming post-FYE students as their "starting" cohorts. However, when computing a metric that requires a count of starters—graduation rate, for example—FYE records must be treated with special care to avoid a miscount.

To illustrate the potential for miscounting starters, suppose we wish to calculate a Mechanical Engineering (ME) graduation rate. Students starting in ME constitute the starting pool and the fraction of that pool graduating in ME is the graduation rate. At FYE institutions, an ME program would typically define their starting pool as the post-FYE cohort entering their program. This may be the best information available, but it invariably undercounts starters by failing to account for FYE students who do not transition (post-FYE) to degree-granting engineering programs—students who may have left the institution or switched to non-engineering majors. In either case, in the absence of the FYE requirement, some of these students would have been ME starters. By neglecting these students, the count of ME starters is artificially low resulting in an ME graduation rate that is artificially high. The same is true for every degree-granting engineering discipline in an FYE institution.

Therefore, to avoid miscounting starters at FYE institutions, we have to estimate an "FYE proxy", that is, the 6-digit CIP codes of the degree-granting engineering programs that FYE students would have declared had they not been required to enroll in FYE. The purpose of 'prep_fye_mice()“ is to prepare the data for making that estimation.

After running prep_fye_mice() but before running mice(), one can edit variables or add variables to create a custom set of predictors. The mice package expects all predictors and the proxy variables to be factors. Do not delete the institution variable because it ensures that a student's imputed program is available at their institution.

In addition, ensure that the only missing values are in the proxy column. Other variables are expected to be complete (no NA values). A value of "unknown" in a predictor column, e.g., race/ethnicity or sex, is an acceptable value, not missing data. Observations with missing or unknown values in the ID or institution columns (if any) should be removed.

Value

A data frame in data.table format conditioned for later use as an input to the mice R package for multiple imputation. The data frame comprises one row for every FYE student, first-term and migrator. Grouping structures are not preserved. The columns returned are:

mcid

Character, anonymized student identifier. Returned as-is.

race

Factor, race/ethnicity as self-reported by the student. An imputation predictor variable.

sex

Factor, sex as self-reported by the student. An imputation predictor variable.

institution

Factor, anonymized institution name. An imputation predictor variable.

proxy

Factor, 6-digit CIP code of a student's known, post-FYE engineering program or NA representing missing values to be imputed.

Method

The function extracts all terms for all FYE students, including those who migrate to enter Engineering after their first term, and identifies the first post-FYE program in which they enroll, if any. This treatment yields two possible outcomes for values returned in the proxy column:

  1. The student completes FYE and enrolls in an engineering major. For this outcome, we know that at the student's first opportunity, they enrolled in an engineering program of their choosing. The CIP code of that program is returned as the student's FYE proxy.

  2. The student does not enroll post-FYE in an engineering major. Such students have no further records in the database or switched from Engineering to another program. For this outcome, the data provide no information regarding what engineering program the student would have declared originally had the institution not required them to enroll in FYE. For these students a proxy value of NA is returned. These are the data treated as missing values to be imputed by mice().

In cases where students enter FYE, change programs, and re-enter FYE, only the first group of FYE terms is considered. Any programs before FYE are ignored.

The resulting data frame is ready for use as input for the mice package, with all variables except mcid returned as factors.

Examples

# Using toy data
prep_fye_mice(toy_student, toy_term)

# Other columns, if any, are dropped
colnames(toy_student)
colnames(prep_fye_mice(toy_student, toy_term))

# Optional argument permits multiple CIP codes for FYE
prep_fye_mice(midfield_student = toy_student, 
              midfield_term = toy_term, 
              fye_codes = c("140101", "140102"))

SAT-ACT conversion scale

Description

Data frame for converting between ACT and SAT scores. A range of SAT scores converts to a single ACT score; an ACT score converts to a single value equivalent SAT score.

Usage

sat_act_scale

Format

data.table with 28 rows and 4 columns:

act_comp

Numerical, ACT composite score.

sat_lower

Numerical, total SAT, lower limit of range corresponding to the ACT composite score.

sat_equiv

Numerical, total SAT, value to use when converting ACT score to a single SAT score.

sat_upper

Numerical, total SAT, upper limit of range corresponding to the ACT composite score.

Source

ACT/SAT Concordance (2018) ACT Education Corp. https://www.act.org/content/dam/act/unsecured/documents/ACT-SAT-Concordance-Tables.pdf

See Also

Other scales: grade_scale


Choose columns of student records

Description

Subset a data frame, selecting columns by matching a vector of character strings. A convenience function to reduce the dimensions of a MIDFIELD data table by selecting only those columns required by other midfieldr functions or that are required to form a composite key. Particularly useful in interactive sessions when viewing the data tables at various stages of an analysis.

Usage

select_records(dframe, type = NULL, ..., col_pattern = NULL)

Arguments

dframe

Data frame of student records from which columns are selected. Expected choices are student, term, course, degree or their equivalent.

type

Character identifying the record type. Possible values are "s", "t", "c", "d", or "a". See Details.

...

Not used for passing values; forces subsequent arguments to be referable only by name.

col_pattern

Character vector containing strings or regular expressions to be matched or partially matched to the column names of dframe.

Details

Several midfieldr functions require input data frames containing specific variables (column names) such as mcid or cip6. In addition, the MIDFIELD data tables have specific variables that act as keys or composite keys to the information in that table. The type argument determines which columns are returned, if those columns exist in dframe:

  • type = "s" (student table) returns columns ⁠mcid, race, sex⁠

  • type = "t" (term table) returns columns ⁠mcid, term, cip6, institution, level⁠

  • type = "c" (course table) returns columns ⁠mcid, term_course, abbrev, number⁠

  • type = "d" (degree table) returns columns ⁠mcid, term_degree, cip6⁠

  • type = "a" (default) returns all the above

Additional column names can be included by using the col_pattern argument. In all cases, unmatched search strings are silently ignored.

Value

A data frame of the same type as dframe. The output has the following properties:

  • Rows are not modified.

  • Columns are a subset of the input, but appear in the same order.

  • Groups are not necessarily preserved.

  • Data frame attributes are preserved with the exception of grouped tibbles.

Examples

# Basic usage
select_records(toy_student[1:5])
select_records(toy_term[1:5])
select_records(toy_course[1:5])
select_records(toy_degree[1:5])

# Return columns by record type
select_records(toy_student[1:5], type = "s")
select_records(toy_term[1:5], "t")
select_records(toy_course[1:5], "c")
select_records(toy_degree[1:5], "d")

# With col_patterns for additional columns
DT <- toy_student[141:146]
select_records(DT, "t", col_pattern = c("transfer", "hours_tranfer"))

# Using regular expressions
these_IDs <- DT$mcid
DT <- toy_term[mcid %chin% these_IDs]
select_records(DT, "t", col_pattern = c("^gpa"))

Extract unique elements and sort

Description

A strict version of sort() and unique() (without ...) applied to vectors only.

Usage

sort_uniq(x, ..., na.rm = FALSE, decreasing = FALSE, na.last = FALSE)

Arguments

x

Vector of values to be sorted with any duplicate values removed.

...

Not used for passing values; forces subsequent arguments to be referable only by name.

na.rm

Logical. Indicates if missing values (including NaN) should be removed. Passed to unique().

decreasing

Logical. Should the sort be increasing or decreasing? Passed to sort().

na.last

Logical. Position of NA values. Passed to sort().

Value

A vector of unique values, sorted.

Examples

# Character vector
x <- toy_student$race
sort_uniq(x)

# Numeric vector
x <- toy_term$hours_term_attempt
sort_uniq(x)

Case-study observations

Description

Data table of post-processed observations of students ever enrolled in, and students graduating from, the four programs of the case study. Keyed by student ID. Provided for the convenience of vignette users.

Usage

study_observations

Format

data.table with 8919 rows and 5 columns. The variables are:

mcid

Character, de-identified student ID. Key column.

race

Character, race/ethnicity as self-reported by the student, e.g., Asian, Black, Hispanic, etc.

sex

Character, sex as self-reported by the student, possible values are Female, Male, and Unknown.

program

Character, academic program label.

bloc

Character, indicating the grouping (ever_enrolled or graduates) to which an observation belongs.

Details

Starting with the case-study starting pool of students ever enrolled in the four programs of the study (Civil, Electrical, Industrial/Systems, and Mechanical Engineering), we filtered the data for data sufficiency, degree seeking, program, and timely completion.

A data frame of "ever enrolled" and a data frame of "timely graduates" were bound using shared column names and are distinguished in the bloc variable. This data structure facilitates grouping and summarizing by race, sex, program, and group.

See Also

Other case-study-data: baseline_mcid, study_programs, study_results


Case-study program labels and codes

Description

Data table of program CIP codes and labels of the four programs of the case study. Keyed by 6-digit CIPs. Provided for the convenience of vignette users.

Usage

study_programs

Format

data.table with 15 rows and 2 columns. The variables are:

cip6

Character, 6-digit CIP code of program in which a student is enrolled in a term.

program

Character, abbreviated labels for four engineering programs. Values are "CE" (Civil Engineering), "EE" (Electrical Engineering), "ISE" (Industrial/Systems Engineering), and "ME" (Mechanical Engineering).

Details

Starting with the midfieldr cip data set, we extracted the CIPs of the four programs of the case study and assigned them a custom label to be used for grouping and summarizing.

See Also

Other case-study-data: baseline_mcid, study_observations, study_results


Case-study results

Description

Data table of longitudinal stickiness for the four programs of the case study (Civil, Electrical, Industrial/Systems, and Mechanical Engineering) grouped by program, race/ethnicity, and sex. Provided for the convenience of vignette users.

Usage

study_results

Format

data.table with 50 rows and 6 columns:

program

Character, academic program label.

sex

Character, sex as self-reported by the student, possible values are Female, Male, and Unknown.

race

Character, race/ethnicity as self-reported by the student, e.g., Asian, Black, Hispanic, etc.

ever_enrolled

Numerical, number of students ever enrolled in a program.

graduates

Numerical, number of students completing a program.

stickiness

Numerical, program stickiness, the ratio of graduates to ever_enrolled, in percent.

Details

Longitudinal stickiness is the ratio of the number of students graduating from a program to the number of students ever enrolled in the program over the time span of available data. Results are based on data that have been filtered for data sufficiency, degree seeking, and timely completion.

See Also

Other case-study-data: baseline_mcid, study_observations, study_programs


Calculate timely completion terms

Description

To a data frame keyed by student ID, add a column indicating the student's timely completion term. Columns of supporting information are also added. Unrelated columns are dropped.

Usage

timely_term(dframe, midfield_rec = term, ..., sched_span = NULL, span = NULL)

Arguments

dframe

Data frame or data frame extension (e.g., data.table or tibble). Required variable: {mcid}.

midfield_rec

MIDFIELD records term data frame or data frame extension. Required variables: ⁠{mcid, term, level}⁠.

...

Not used for passing values; forces subsequent arguments to be referable only by name.

sched_span

Integer scalar (default 4), the number of years an institution officially schedules for completing a program.

span

Integer scalar (default 6), number of years to define timely completion, typically 4, 6, or 8 years (100%, 150%, 200% respectively of sched_span).

Details

In many studies, students must complete their programs in a specified time span to be considered "timely", for example 4, 6, or 8 years after admission. The latest term by which program completion would be considered timely is the timely completion term. By "completion" we mean an undergraduate earning their first baccalaureate degree (or degrees, for students earning more than one degree in the same term).

The timely completion term is required for determining data sufficiency as well as timely completion status. The goal in either case is to refine a population, that is, obtain a data frame of IDs that satisfy our constraints. Thus timely_term() yields a column of timely term values and columns of supporting information keyed by ID. All other columns in dframe (if any) are dropped.

Our heuristic assigns span number of years (default 6) to every student. For students admitted at second-year level or higher, the span is reduced by one year for each full year the student is assumed to have completed. For example, a student admitted at the second-year level is assumed to have completed one year of a program, so their span is reduced by one year. The adjusted span is added to their initial term to create the timely_term values.

The supporting information in the output is provided so that the user can review the findings. Moreover, data_sufficiency() and completion_status() require one or both of the added columns ⁠{term_i, timely_term}.⁠

Value

Data frame with the following properties:

  • Data frame class is preserved. Groups and keys are not preserved.

  • Rows are filtered for unique mcid values.

  • Column {mcid} is retained (all other columns are dropped). New columns added:

    • term_i.   Initial term of a student's longitudinal record, encoded YYYYT. Extracted from midfield_rec.

    • level_i.   Character. Student level (01 Freshman, 02 Sophomore, etc.) in their initial term. Extracted from midfield_rec.

    • adj_span.   Numeric. Integer span of years for timely completion adjusted for a student's initial level.

    • timely_term.   Character. Latest term by which program completion would be considered timely for every student. Encoded YYYYT.

Examples

# Start with an excerpt from the student data set 
dframe <- toy_student[1:10, .(mcid)]

# Add timely completion term column
timely_term(dframe, toy_term)

# Define timely completion as 200% of scheduled span (8 years)
timely_term(dframe, toy_term, span = 8)

# Existing timely_term column, if any, is overwritten
dframe[, timely_term := NA_character_][]
timely_term(dframe, toy_term)

Course data for examples

Description

Selected variables modeled on those in the course practice data for use in package examples and articles. Sampled from an early version of the practice data, the toy data are not a current practice data sample.

Usage

toy_course

Format

data.table with 5812 rows and 12 columns keyed by student ID, term, course abbreviation, and course number.

mcid

Character, de-identified student ID. Key column.

term

Character, academic year and term, format YYYYT. Key column.

abbrev

Character, course alphabetical identifier, e.g. ENGR, MATH, ENGL. Key column.

number

Character, course numeric identifier, e.g. 101, 3429. Key column.

institution

Character, de-identified institution name, e.g., Institution A, Institution B, etc.

course

Character, course name, e.g., ⁠Astrophysics III⁠, ⁠Calculus For Social Science And Business⁠, ⁠Corp Financial Rprtng 1⁠, ⁠Environmental Sanitation II⁠, ⁠Fitness and Wellness⁠, ⁠Introductory Astronomy 2⁠, ⁠Our Changing Environment⁠, etc.

section

Character, course section identifier, from one to four characters, e.g., 1, 2, 01, 14, 001, 040, 785, H02, R01, ⁠300E⁠, ⁠888R⁠, etc.

type

Character, predominant delivery method for this section, e.g., Blended, ⁠Distance Education⁠, Face-to-Face, Online, etc.

faculty_rank

Character, academic rank of the person teaching the course, e.g., ⁠Assistant Professor⁠, ⁠Associate Professor⁠, ⁠Graduate Assistant⁠, ⁠Visiting Faculty⁠, etc.

hours_course

Numeric, number of credit-hours for successful course completion.

grade

Character, course grade, e.g., A+, A, A-, B+, I, NG, etc.

discipline_midfield

Character, a variable for grouping courses by academic discipline assigned by the pre-2023 MIDFIELD data curator, e.g., Anthropology, Business, ⁠Computer Science⁠, Engineering, ⁠Language and Literature⁠, Mathematics,⁠Visual and Performing Arts⁠, etc.

See Also

Other toy-data: toy_degree, toy_student, toy_term


Degree data for examples

Description

Selected variables modeled on those in the degree practice data for use in package examples and articles. Sampled from an early version of the practice data, the toy data are not a current practice data sample.

Usage

toy_degree

Format

data.table with 96 rows and 4 columns keyed by student ID, term, and program (CIP code or degree label).

mcid

Character, de-identified student ID. Key column.

term_degree

Character, academic year and term in which a student completes their program, format YYYYT.

cip6

Character, 6-digit CIP code of program in which a student is enrolled in a term. Key column.

institution

Character, de-identified institution name, e.g., Institution A, Institution B, etc.

degree

Character, type of degree awarded, e.g., Bachelor of Arts in Geography, Bachelor of Science in Finance, etc.

See Also

Other toy-data: toy_course, toy_student, toy_term


Student data for examples

Description

Selected variables modeled on those in the student practice data for use in package examples and articles. Sampled from an early version of the practice data, the toy data are not a current practice data sample.

Usage

toy_student

Format

data.table with 150 rows and 13 columns keyed by student ID.

mcid

Character, de-identified student ID. Key column.

race

Character, race/ethnicity as self-reported by the student, e.g., Asian, Black, Hispanic, etc.

sex

Character, sex as self-reported by the student, possible values are Female, Male, and Unknown.

institution

Character, de-identified institution name, e.g., Institution A, Institution B, etc.

transfer

Character, transfer status, possible values are ⁠First-Time in College⁠, ⁠First-Time Transfer⁠.

hours_transfer

Numeric, number of credit hours transferred (or NA).

age_desc

Character, age group, possible values are ⁠25 and Older⁠, ⁠Under 25⁠.

us_citizen

Character, US citizenship, possible values are No, Yes.

home_zip

Character, home ZIP code (or NA), e.g., 02056, 20170, 51301, 80129, etc.

high_school

Character, code for the last high school attended before admission (or NA), e.g., 060075, 210512, 431800, 502195, etc.

sat_math

Numeric, SAT mathematics test score (or NA).

sat_verbal

Numeric, SAT reading test score (or NA).

act_comp

Numeric, ACT composite test score (or NA).

See Also

Other toy-data: toy_course, toy_degree, toy_term


Term data for examples

Description

Selected variables modeled on those in the term practice data for use in package examples and articles. Sampled from an early version of the practice data, the toy data are not a current practice data sample.

Usage

toy_term

Format

data.table with 1095 rows and 13 columns keyed by student ID and term.

mcid

Character, de-identified student ID. Key column.

term

Character, academic year and term, format YYYYT. Key column.

cip6

Character, 6-digit CIP code of program in which a student is enrolled in a term, e.g., 090101, 141201, 260901, 420101, etc.

institution

Character, de-identified institution name, e.g., Institution A, Institution B, etc.

level

Character, 01 Freshman, 02 Sophomore, etc. The equivalent values in the current practice data are 01 First-Year, 02-Second Year, etc.

standing

Character, academic standing, e.g., ⁠Good Standing⁠, ⁠Academic Warning⁠, etc.

coop

Character, cooperative education term, possible values are Yes, No.

hours_term

Numeric, credit hours earned in the term.

hours_term_attempt

Numeric, credit hours attempted in the term.

hours_cumul

Numeric, cumulative credit hours earned.

hours_cumul_attempt

Numeric, cumulative credit hours attempted.

gpa_term

Numeric, term grade point average.

gpa_cumul

Numeric, cumulative grade point average.

See Also

Other toy-data: toy_course, toy_degree, toy_student