2 Package Design Notes

2.1 Example Dataset Designs

This section collects and/or defines example datasets, concordances and transformations from different domains.

Types of crosswalks/concordances include:

Classification changes - HS, ISIC etc.
Geographic (administrative/survey) area changes – Queensland Government Statistician’s Office

2.1.1 INDSTAT across ISIC standards

Initial reproducible approach: https://cynthiahqy.github.io/indstat-TPP/
Demo rewrite of first transformation step using old version of {conformr}: https://github.com/cynthiahqy/conformr-indstat

2.1.2 WTO/ComTrade Data across Harmonised System versions

2.1.3 ABS labor force statistics across ANZSCO

2.2 Existing Implementations and Approaches

Collection of current approaches to harmonisation

2.2.1 hadley_data-fuel-economy

From @wickham2014 and tidy-data vignette:

A more complicated situation occurs when the dataset structure changes over time. For example, the datasets may contain different variables, the same variables with different names, different file formats, or different conventions for missing values. This may require you to tidy each file to individually (or, if you’re lucky, in small groups) and then combine them once tidied. An example of this type of tidying is illustrated in https://github.com/hadley/data-fuel-economy, which shows the tidying of epa fuel economy data for over 50,000 cars from 1978 to 2008. The raw data is available online, but each year is stored in a separate file and there are four major formats with many minor variations, making tidying this dataset a considerable challenge.

2.2.2 schott_stata-trade-concordance

See: https://sompks4.github.io/sub_data.html

2.2.3 kolczynska_trust-in-institutions

Link to code for crosswalk based harmonisation: https://osf.io/qt2eb/

From @kolczynska2022:

Ex-post harmonization of survey data creates new opportunities for research by extending the geographical and/or time coverage of analyses. Researchers increasingly combine data from different survey projects to analyze them as a single dataset, and while teams engaged in data harmonization continue to expand the information they provide to end users, there are still no commonly agreed standards for the documentation of data processing. Existing harmonization project typically opt for recode scripts that are generally hard to read, modify, and reuse, although some projects make efforts to facilitate verification and reproduction. This paper describes an alternative procedure and a set of simple tools for the exploration, recoding, and documentation of harmonization of survey data, relying on crosswalks. The presented tools are flexible and software-agnostic. The illustrative example uses the programming language R and spreadsheets—both common software choices among social scientists. Harmonization of variables on trust in institutions from four major cross-national survey projects serves as an illustration of the proposed workflow and of opportunities harmonization creates.

2.3 Existing Crosswalk Packages

A collection of R packages that (I think) offer special cases or components of the conformr workflow

2.3.1 countrycode_look-up-tables

2.3.2 concordance_look-up-tables

2.3.3 debkeepr_non-decimal-currencies

By Jesse Sadler:

The debkeepr package provides an interface for working with non-decimal currencies that use tripartite or tetrapartite systems such as that of pounds, shillings, and pence. debkeepr makes it easier to perform arithmetic operations on non-decimal values and facilitates the analysis and visualization of larger sets of non-decimal values such as those found in historical account books. This is accomplished through the implementation of the deb_lsd,deb_tetra, and deb_decimal vector types, which are based on the infrastructure provided by the vctrs package. deb_lsd, deb_tetra, and deb_decimal vectors possess additional metadata to allow them to behave like numeric vectors in many circumstances, while also conforming to the workings of non-decimal currencies.