---
title: "Programs"
output:
  rmarkdown::html_vignette:
    css: extra.css
description: >
  Exploring the CIP codes for US academic programs. 
vignette: >
  %\VignetteIndexEntry{Programs}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
bibliography: ../inst/REFERENCES.bib
link-citations: yes
always_allow_html: true
toc-depth: 2
---

```{r child = "../man/rmd/common-setup.Rmd"}
```

```{r}
#| include: false
knitr::opts_chunk$set(fig.path = "../man/figures/art-040-")
```

In the US, instructional programs are encoded by 6-digit numbers curated by the US Department of Education. The US standard encoding format is a two-digit number followed by a period, followed by a four-digit number, for example, 14.0102. MIDFIELD uses the same numerals, but omits the period, i.e., 140102, and stores the variable as a  character string. 

This article in the MIDFIELD workflow. 

1. Planning  
1. [Initial processing]{.accent}
    - Data sufficiency  
    - Degree seeking  
    - [Identify programs]{.accent} 
1. Blocs  
1. Groupings  
1. Metrics  
1. Displays 


## Definitions

```{r child = "../man/rmd/define-program.Rmd"}
``` 

```{r child = "../man/rmd/define-cip.Rmd"}
``` 

```{r child = "../man/rmd/define-cip6.Rmd"}
```


## Method

We search the `cip` data set included with midfieldr using a variety of techniques to obtain the set of 6-digit CIP codes for the programs under study. We assign custom program names to codes or groups of codes. 


## Taxonomy

Academic programs have three levels of codes and names:

- 6-digit code, a specific program
- 4-digit code, a group of 6-digit programs of comparable content 
- 2-digit code, a grouping of 4-digit groups of related content 

Specialties within a discipline are encoded at the 6-digit level, the discipline itself is represented by one or more 4-digit codes (roughly  corresponding to an academic department), and a collection of disciplines  are represented by one or more 2-digit codes (roughly corresponding to an academic college).  

For example, Geotechnical Engineering (140802) is a specialty in  Civil Engineering (1408) which is a department in the college of Engineering (14). 

```{r}
#| echo: false

library("midfieldr")
df <- filter_cip("^41")
n41 <- nrow(df)
n4102 <- nrow(filter_cip("^4102", cip = df))
n4103 <- nrow(filter_cip("^4103", cip = df))
name41 <- unique(df[, cip2name])

df24 <- filter_cip("^24")
n24 <- nrow(df24)
name24 <- unique(df24[, cip2name])

df51 <- filter_cip("^51")
n51 <- nrow(df51)
name51 <- unique(df51[, cip2name])

df1313 <- filter_cip("^1313")
n1313 <- nrow(df1313)
name1313 <- unique(df1313[, cip2name])
```

To illustrate the taxonomy in a little more detail, we show in the table the programs assigned to the 2-digit code 41, "`r name41`". This 2-digit grouping is subdivided into `r length(unique(df[, cip4]))` groups at the 4-digit level (codes 4100--4199) which are further subdivided into `r n41` programs at the 6-digit level (codes 410000--419999). 

```{r}
#| echo: false
library(gt)
x <- filter_cip("^41")
x[2:9, cip2name := "\U2003\U02193"]
x[c(4, 5, 7, 8), cip4name := "\U2003\U02193"]

x |>
  gt() |>
  tab_caption("Table 1. CIP taxonomy") |>
  tab_options(table.font.size = "small") |>
  opt_stylize(style = 1, color = "gray") |>
  tab_style(
    style = list(cell_fill(color = "#c7eae5")),
    locations = cells_column_labels(columns = everything())
  )
```

A 2-digit program can include anywhere from four 4-digit programs (e.g., code 24 `r name24`) to `r n51` 4-digit programs (e.g., code 51 `r name51`). 

And 4-digit programs include anywhere from one 6-digit program (e.g., code 4100 above) to `r n1313` 6-digit programs (e.g., code 1313 `r name1313`).

Unfortunately, some disciplines can comprise more than one 4-digit code. For example, the programs that comprise the broad discipline of Industrial and Systems Engineering encompass four distinct 4-digit codes: 1427 Systems Engineering, 1435 Industrial Engineering, 1436 Manufacturing Engineering, and 1437 Operations Research. Hence the importance of being able to search all CIP data for programs of interest. 


## Load data

*Start.* &nbsp; If you are writing your own script to follow along, we use these packages in this article:

```{r}
library(midfieldr)
library(data.table)
```

*Loads with midfieldr.* &nbsp; Prepared data, adapted from [@NCES:2010]. View data dictionary via `?cip`. 

- `cip`   


## Inspect the `cip` data

First glance.

```{r}
# Loads with midfieldr
cip
```

All variables in `cip` are character strings, which protects the leading zeros of some CIP codes.

```{r}
# Names and class of the CIP variables
cip[, lapply(.SD, class)]
```

The number of unique programs.  

```{r}
# 2-digit level
sort(unique(cip$cip2))

# 4-digit level
length(unique(cip$cip4))

# 6-digit level
length(unique(cip$cip6))
```

A sample of program names uses a random number generator, so your result will differ from that shown. 

```{r}
#| echo: false
set.seed(20210613)
```

```{r}
# 2-digit name sample
sample(cip[, cip2name], 10)

# 4-digit name sample
sample(cip[, cip4name], 10)

# 6-digit name sample
sample(cip[, cip6name], 10)
```

```{r}
#| echo: false
set.seed(NULL)
```


## `filter_cip()`

Subset the `cip` data frame, retaining rows that match or partially match a vector of character strings. 

*Arguments.*

- **`keep_text`** &nbsp; Character vector of search text for retaining rows, not case-sensitive. Can be empty if `drop_text` is used. 

- **`drop_text`** &nbsp; Character vector of search text for dropping rows, not case-sensitive, default NULL. Argument to be used by name.

- **`cip`** &nbsp; Data frame to be subset, default `cip`. Argument to be used by name.

- **`select`** &nbsp; Character vector of column names to search and return, default all columns. Argument to be used by name. 

*Equivalent usage.* &nbsp; The following implementations yield identical results,

```{r}
# First argument named, CIP argument if used must be named
x <- filter_cip(keep_text = c("engineering"), cip = cip)

# First argument unnamed, use default CIP argument
y <- filter_cip("engineering")

# Demonstrate equivalence
check_equiv_frames(x, y)
```

*Output.* &nbsp; Subset of `cip` with rows matching elements of `keep_text`. Additional subsetting if optional arguments specified. Examples follow. 


## Using a keyword search

Filtering the CIP data for all programs containing the word "engineering" yields `r nrow(filter_cip("engineering"))` observations. 

```{r}
# Filter basics
filter_cip("engineering")
```

The `drop_text` and `select` arguments have to be named explicitly. Columns in `select`  are subset after filtering for `keep_text` and `drop_text`. 

```{r}
# Optional arguments drop_text and select
filter_cip("engineering",
  drop_text = c("related", "technology", "technologies"),
  select = c("cip6", "cip6name")
)
```

Suppose we want to find the CIP codes and names for all programs in Civil Engineering. The search is insensitive to case, so we start with the following code chunk. 

```{r}
#| eval: false
# Example 1 filter using keywords
filter_cip("civil")
```

```{r}
#| echo: false
filter_cip("civil") |>
  gt() |>
  tab_caption("Table 2. Search results") |>
  tab_options(table.font.size = "small") |>
  opt_stylize(style = 1, color = "gray") |>
  tab_style(
    style = list(cell_fill(color = "#c7eae5")),
    locations = cells_column_labels(columns = everything())
  )
```

The search returns some programs with Civilization in their names as well as Engineering Technology. If we wanted Civil Engineering only, we can use a sequence of function calls, where the outcome of the one operation is assigned to the first argument of the next operation.  

The following code chunk could be read as, "Start with the default  `cip` data frame, then keep any rows in which 'civil' is detected, then keep any rows in which 'engineering' is detected, then drop any rows in which 'technology' is detected." The first pass operates on `cip`, but successive passes do not. If used, the `cip` argument must be named. 

```{r}
#| results: hide
# First search
first_pass <- filter_cip("civil")

# Refine the search
second_pass <- filter_cip("engineering", cip = first_pass)

# Refine further
third_pass <- filter_cip(drop_text = "technology", cip = second_pass)
```

```{r}
#| echo: false
third_pass |>
  gt() |>
  tab_caption("Table 3. Search results") |>
  tab_options(table.font.size = "small") |>
  opt_stylize(style = 1, color = "gray") |>
  tab_style(
    style = list(cell_fill(color = "#c7eae5")),
    locations = cells_column_labels(columns = everything())
  )
```

*Equivalent usage.* &nbsp; Seeing that all Civil Engineering programs have the same `cip4name`, we could have used  `keep_text = c("civil engineering")` to narrow the search to rows that match the full phrase. The following implementations yield identical results,

```{r}
# Three passes
x <- filter_cip("civil")
x <- filter_cip("engineering", cip = x)
x <- filter_cip(drop_text = "technology", cip = x)

# Combined search
y <- filter_cip("civil engineering", drop_text = "technology")

# Demonstrate equivalence
check_equiv_frames(x, y)
```


## Using a numerical code search

Suppose we want to study programs relating to German culture, language, and literature. Using  "german" for the `keep_text` value yields 

```{r}
#| eval: false
# Search on text
filter_cip("german")
```

```{r}
#| echo: false
filter_cip("german") |>
  gt() |>
  tab_caption("Table 4. Search results") |>
  tab_options(table.font.size = "small") |>
  opt_stylize(style = 1, color = "gray") |>
  tab_style(
    style = list(cell_fill(color = "#c7eae5")),
    locations = cells_column_labels(columns = everything())
  )
```

From the 6-digit program names we find only two that are of interest, German Studies (050125) and German Language and Literature (160501). We use a character vector to assign these two codes to the `keep_text` argument. 
 
```{r}
#| eval: false
# Search on codes
filter_cip(c("050125", "160501"))
```

```{r }
#| echo: false
filter_cip(c("050125", "160501")) |>
  gt() |>
  tab_caption("Table 5. Search results") |>
  tab_options(table.font.size = "small") |>
  opt_stylize(style = 1, color = "gray") |>
  tab_style(
    style = list(cell_fill(color = "#c7eae5")),
    locations = cells_column_labels(columns = everything())
  )
```

If the 6-digit codes are entered as integers, they produce an error. 
 
```{r}
#| error: true

# Search that produces an error
filter_cip(c(050125, 160501))
```


## Using a regular expression search

Specifying 4-digit codes yields a data frame all 6-digit programs containing the 4-digit string. We use the regular expression notation `^` to match the start of the strings.

```{r}
#| eval: false
# example 3 filter using regular expressions
filter_cip(c("^1410", "^1419"))
```

```{r}
#| echo: false
filter_cip(c("^1410", "^1419")) |>
  gt() |>
  tab_caption("Table 6. Search results") |>
  tab_options(table.font.size = "small") |>
  opt_stylize(style = 1, color = "gray") |>
  tab_style(
    style = list(cell_fill(color = "#c7eae5")),
    locations = cells_column_labels(columns = everything())
  )
```

The 2-digit series represent the most general groupings of related programs. Here, the result includes all History programs. 

```{r}
#| eval: false
# Search on 2-digit code
filter_cip("^54")
```

```{r}
#| echo: false
filter_cip("^54") |>
  gt() |>
  tab_caption("Table 7. Search results") |>
  tab_options(table.font.size = "small") |>
  opt_stylize(style = 1, color = "gray") |>
  tab_style(
    style = list(cell_fill(color = "#c7eae5")),
    locations = cells_column_labels(columns = everything())
  )
```

The series argument can include any combination of 2, 4, and 6-digit codes. It can also be passed to the function as a character vector. 

```{r}
#| eval: false
# Search on vector of codes
codes_we_want <- c("^24", "^4102", "^450202")
filter_cip(codes_we_want)
```

```{r}
#| echo: false
codes_we_want <- c("^24", "^4102", "^450202")
filter_cip(codes_we_want) |>
  gt() |>
  tab_caption("Table 8. Search results") |>
  tab_options(table.font.size = "small") |>
  opt_stylize(style = 1, color = "gray") |>
  tab_style(
    style = list(cell_fill(color = "#c7eae5")),
    locations = cells_column_labels(columns = everything())
  )
```


## When search terms cannot be found

If the `keep_text` argument includes terms that cannot be found in the CIP data frame, the unsuccessful terms are identified in a message and the successful terms produce the usual output. 

For example, the following `keep_text` argument includes three search terms that are not present in the CIP data ("111111", "^55", and "Bogus") and two that are ("050125" and "160501").  

```{r}
#| message: true
# Unsuccessful terms produce a message
sub_cip <- filter_cip(c("050125", "111111", "160501", "Bogus", "^55"))

# But the successful terms are returned
sub_cip
```

However, as seen earlier, if none of the search terms are found, an error occurs. 

```{r}
#| error: true
# When none of the search terms are found
filter_cip(c("111111", "Bogus", "^55"))
```


## CIP data from another source

If you use a CIP data set from another source, it must have the same structure as `cip`: six character columns named as follows,  

```{r}
# Name and class of variables (columns) in cip
unlist(lapply(cip, FUN = class))
```


## Assigning program names

Programs in MIDFIELD data sets are encoded by 6-digit CIP codes. As we've shown, multiple 6-digit codes can be considered specialties within a larger program with a 4-digit code or even a set of distinct 4-digit codes. Thus the program names in `cip` are generally inadequate for grouping and summarizing. User-defined program names are nearly always required.  

> *Most studies require deliberate assignment of user-defined program names to CIP codes or groups of CIP codes.* 

Here we demonstrate the creation of a data frame with all 6-digit CIP codes in a study plus their user-defined names. 

By searching `cip`, we can find that the 4-digit codes for the four engineering programs are: Civil (1408), Electrical (1410), Mechanical (1419), and Industrial/Systems (1427, 1435, 1436, and 1437). 

We obtain their 6-digit CIP codes. The 4-digit names are appropriate here.  Our  task is to create a variable with custom program names. 

```{r}
# Changing the number of rows to print
options(datatable.print.nrows = 15)

# Four engineering programs
four_programs <- filter_cip(c("^1408", "^1410", "^1419", "^1427", "^1435", "^1436", "^1437"))

# Retain the needed columns
four_programs <- four_programs[, .(cip6, cip4name)]
four_programs
```

To make the assignments clear, our approach here will be to assign a new `program` column with NA values, then edit the new column values. 

```{r}
# Assign a new column
four_programs[, program := NA_character_]
four_programs
```

### 1. Use `cip4name %ilike%` to recode one value

The `%like%` function is essentially a wrapper function around the base R `grepl()` function. The `%ilike%` version is case-insensitive. You can view the help page by running  (the back-ticks facilitate a help search for terms starting with a symbol):

```{r}
#| eval: false
# Run in Console
? `%like%`
```

In this approach, we search for one distinctive term only. We're using abbreviations for compact output. 

```{r}
# Recode program using the 4-digit name
four_programs[cip4name %ilike% "electrical", program := "EE"]
four_programs
```

### 2. Use `cip6 %like%` to recode one value

In our second approach, we use the `%like%` function again, but apply it to a CIP code. Here we use the regular expression `^1408` meaning "starts with 1408." 

```{r}
# Recode program using the 4-digit code
four_programs[cip6 %like% "^1408", program := "CE"]
four_programs
```


### 3. Use `program := fcase()` to edit all values 

In this approach, we use the data.table function `fcase()`, an implementation of the SQL CASE WHEN statement. The data.table function `%chin%` is like `%in%`, but for character vectors.

```{r}
# Recode all program values
four_programs[, program := fcase(
  cip6 %like% "^1408", "CE",
  cip6 %like% "^1410", "EE",
  cip6 %like% "^1419", "ME",
  cip6 %chin% c("142701", "143501", "143601", "143701"), "ISE"
)]
four_programs <- four_programs[, .(cip6, program)]
four_programs
```

*Verify prepared data.* &nbsp; `study_programs`, included with midfieldr, contains the case study information developed above. Here we verify that the two data frames have the same content.  

```{r}
# Demonstrate equivalence
check_equiv_frames(four_programs, study_programs)
```


## Reusable code

*Preparation.* &nbsp; To provide a working example, we select the four engineering programs of the case study used throughout the articles (Civil, Electrical, Industrial/Systems, and Mechanical Engineering). We assume a prior search of `cip` yielded the relevant codes used here. Requires editing before reuse with different programs.

```{r}
# Edit as required for different programs
selected_programs <- filter_cip(c("^1408", "^1410", "^1419", "^1427", "^1435", "^1436", "^1437"))
```

*Programs.* &nbsp; A summary code chunk for ready reference. Requires editing before reuse with different programs.  

```{r}
# Recode program labels. Edit as required.
selected_programs[, program := fcase(
  cip6 %like% "^1408", "CE",
  cip6 %like% "^1410", "EE",
  cip6 %like% "^1419", "ME",
  cip6 %chin% c("142701", "143501", "143601", "143701"), "ISE"
)]
selected_programs <- selected_programs[, .(cip6, program)]
```


## References

<div id="refs"></div>

```{r child = "../man/rmd/common-closing.Rmd"}
```