Submit a Dataset

The goal of this repository is to provide a collection of relevant and engaging sports datasets to be used in class, handouts, assignments, projects, and SCORE modules. We are always looking for new datasets and welcome public submissions, but these datasets will need to meet the following requirements:

Two ways to submit a dataset

There are two ways to contribute a dataset:

  1. Submit a dataset with the necessary information using this Google form. This is the recommended option for contributors who are not experienced with quarto or Rmarkdown files. The form will describe what information is necessary for us to create the appropriate page for your submitted dataset.

  2. Fill out a quarto template file with the dataset information, and then submit either via email or directly with a GitHub pull request.

The rest of this page includes information about submitting a dataset using the quarto template file.

Quarto template file submission steps

This data repository is built using quarto and rendered into a website. You can get the template file in two ways:

  1. Copy the _dataset-template.qmd file from our GitHub repository and save it on your computer. Once you’re done, you can email us the file and the data (see below for email address).
  2. Fork our GitHub repository into your own GitHub account and edit it like any other Git repository. Once you’re done, you can submit a pull request.

Dataset template

The dataset template is a Quarto file. The template asks for:

  • Basic metadata to tag the dataset, such as the statistics & data science topic, a short title, and your name. (All submissions are credited with the name of the submitter.)
  • A description of the sports question to motivate analysis with the dataset.
  • A description of the dataset with documentation about each variable. The file should make it clear what a single row of the dataset represents, and clearly describe each of the variables (including units of measure when relevant).
  • References to the original source of the data. If the source is a code package, then include an appropriate citation with a link to the package. If the source is an academic paper and the data is deposited in a third-party repository (e.g., Figshare), then include references to both the paper and the archived dataset. When available, include the DOIs of the references.

The template is meant to prevent a common problem with course datasets: they get passed down from instructor to instructor, and eventually all information about the original source is lost. The dataset may be presented to students without context or important details (like units), and the instructor may not be able to find the original data or references to answer questions about the data.

Submitting template file

Once you’ve filled in the template, you can either:

  1. Email the template file, the data, and any code for preparing the data to the repository editor (currently Ron Yurko: ryurko AT stat DOT cmu DOT edu).

  2. Submit a pull request to our repository on GitHub. Save the Qmd with a meaningful name and place it in its relevant sports directory (e.g., baseball/ or tennis/). Place the data files in data/. Place any code to prepare the data in the _prep/ folder, inside another folder with a name matching the submitted dataset. Once you’ve committed all of the necessary files (without anything extraneous) then you can submit a pull request.