Submit a Dataset
The goal of this repository is to provide a collection of relevant and engaging sports datasets to be used in class, handouts, assignments, projects, and SCORE modules. We are always looking for new datasets and welcome public submissions, but these datasets will need to meet the following requirements:
- The data must be publicly shareable, and there should not be any licensing restrictions that prevent us from sharing it for educational purposes. By submitting, you certify that the dataset can be made public, and that you accept all responsibility for the publication of the data. The SCORE Network is not responsible for publication of the data.
- If you used code (such as an
R
package) to access the data then include your code with your submission. The code will be publicly available on this repository’s GitHub page for others to view. - There should be an interesting sports question coupled with a statistics and data science topic to motivate the use of the dataset in educational material.
- The dataset should be in a standard format such as a CSV file and be of reasonable size. GitHub has a file size limit of 100 MB, and large files can be inconvenient for students. We recommend compressing files larger than a few megabytes. Note that gzip compression is a good choice since common tools such as
R
andPython
feature ways to read.csv.gz
files directly.
Two ways to submit a dataset
There are two ways to contribute a dataset:
Submit a dataset with the necessary information using this Google form. This is the recommended option for contributors who are not experienced with quarto or Rmarkdown files. The form will describe what information is necessary for us to create the appropriate page for your submitted dataset.
Fill out a quarto template file with the dataset information, and then submit either via email or directly with a GitHub pull request.
The rest of this page includes information about submitting a dataset using the quarto template file.
Quarto template file submission steps
This data repository is built using quarto and rendered into a website. You can get the template file in two ways:
- Copy the
_dataset-template.qmd
file from our GitHub repository and save it on your computer. Once you’re done, you can email us the file and the data (see below for email address). - Fork our GitHub repository into your own GitHub account and edit it like any other Git repository. Once you’re done, you can submit a pull request.
Dataset template
The dataset template is a Quarto file. The template asks for:
- Basic metadata to tag the dataset, such as the statistics & data science topic, a short title, and your name. (All submissions are credited with the name of the submitter.)
- A description of the sports question to motivate analysis with the dataset.
- A description of the dataset with documentation about each variable. The file should make it clear what a single row of the dataset represents, and clearly describe each of the variables (including units of measure when relevant).
- References to the original source of the data. If the source is a code package, then include an appropriate citation with a link to the package. If the source is an academic paper and the data is deposited in a third-party repository (e.g., Figshare), then include references to both the paper and the archived dataset. When available, include the DOIs of the references.
The template is meant to prevent a common problem with course datasets: they get passed down from instructor to instructor, and eventually all information about the original source is lost. The dataset may be presented to students without context or important details (like units), and the instructor may not be able to find the original data or references to answer questions about the data.
Submitting template file
Once you’ve filled in the template, you can either:
Email the template file, the data, and any code for preparing the data to the repository editor (currently Ron Yurko: ryurko AT stat DOT cmu DOT edu).
Submit a pull request to our repository on GitHub. Save the Qmd with a meaningful name and place it in its relevant sports directory (e.g.,
baseball/
ortennis/
). Place the data files indata/
. Place any code to prepare the data in the_prep/
folder, inside another folder with a name matching the submitted dataset. Once you’ve committed all of the necessary files (without anything extraneous) then you can submit a pull request.