DATA 598 A Wi 20: Special Topics In Data Science: Reproducibility for Data Science

DATA 598 A Wi 20: Special Topics In Data Science: Reproducibility for Data Science

Instructors

Professor Ben Marwick (read about my values, ethics & expectations)
Contact details & office location

Office Hours:  make an appointment 
TA: Ms Liying Wang (liying15@uw.edu)
Office Hours: by appointment 

Course description

This course, a requirement for the UW Master of Science in Data Science, introduces students to the principles and tools for computational reproducibility in data science using R. Topics covered include acquiring, cleaning and manipulating data in a reproducible workflow using the tidyverse. Students will use literate programming tools, and explore best practices for organizing data analyses. Students will learn to write documents using R markdown, compile R markdown documents using knitr and related tools, and publish reproducible documents to various common formats. Students will learn strategies and tools for packaging research compendia, dependency management, and containerising projects to provide computational isolation. To complement the learning experience of this class, I recommend subscribing to the email list of the UW Reproducible Research Special Interest Group so you can be notified of relevant activities and visiting speakers.

Learning goals

By the end of this course you should be able to:

  1. Use R to acquire, clean and manipulate data and organise workflows in a reproducible and well-documented way
  2. Write R Markdown documents that include narrative text, code, and the usual elements of professional writing
  3. Use Git and GitHub to collaborate on writing reproducible data science
  4. Make a compendium R package to document and manage dependencies in an analysis
  5. Use Docker and related tools to provide computational isolation for a data science project, and make it accessible to others

This is a 2 credit class. We will meet once a week for 2 hours for a combination of lecture, discussion, and hands-on laboratory work. According to the UW guidelines on credit hours, you should plan to do 2-4 hours of homework per week for this class. A calendar view of the course assessments can be seen here

Schedule of topics

The last day of instruction is Monday 9 March. Any course work due after that date can be completed and submitted remotely. There is no final exam, so you do not need to be on campus after 9 March. 

Week

Topic

Reading

R packages

1

Definitions and debates about, and calls for, reproducibility, R, RStudio, Projects (slides)

*Barba, L. A. (2018). Terminologies for reproducible research. arXiv preprint arXiv:1802.03311.

Bryan, J. (2017) Project-oriented workflow. Tidyverse Blog. https://www.tidyverse.org/articles/2017/12/workflow-vs-script/

here, renv, remotes, pak, fs, sessioninfo, conflicted

2

Git and GitHub for tracking the forking paths of a project and collaboration (slides)

*Ram, K. (2013). Git can facilitate greater reproducibility and increased transparency in science. Source Code for Biology and Medicine, 8(1), 7. DOI: 10.1186/1751-0473-8-7 

Blischak, John D., Emily R. Davenport, and Greg Wilson. "A quick introduction to version control with Git and GitHub." PLoS Computational Biology 12, no. 1 (2016): e1004668.

usethis, git2r, gh, gitty,

3

Introduction to R Markdown for documenting a data science project (public holiday - no slides)

Baumer, B., Cetinkaya-Rundel, M., Bray, A., Loi, L., & Horton, N. J. (2014). R Markdown: Integrating a reproducible analysis tool into introductory statistics. arXiv preprint arXiv:1402.1894.

*Baumer, B., & Udwin, D. (2015). R markdown. Wiley Interdisciplinary Reviews: Computational Statistics, 7(3), 167-177. DOI: 10.1002/wics.1348

rmarkdown, knitr, rticles, bookdown

4

Reproducibly manipulating data with the tidyverse, writing functions  (slides)

*Wickham, H., Averick, M., Bryan, J. et al.  (2019). Welcome to the Tidyverse. Journal of Open Source Software, 4(43), 1686. DOI: 10.21105/joss.01686

Wickham, H., & Grolemund, G. (2016). R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. O'Reilly Media, Inc. https://r4ds.had.co.nz/

tidyverse: dplyr, tidyr, purrr, readr

5

Advanced R Markdown documents: external code, caching, templates, etc.
(slides)

*Xie, Y., Allaire, J. J., & Grolemund, G. (2018). R markdown: The definitive guide. CRC Press. Chapter 2: Basics

Xie, Y., Hill, A. P., & Thomas, A. (2017). Blogdown: Creating websites with R markdown. Chapman and Hall/CRC.

knitr, redoc, pkgdown, blogdown 

6

Packaging data science projects for reproducibility (slides)

*Marwick, Ben, Carl Boettiger, and Lincoln Mullen. Packaging data analytical work reproducibly using R (and friends). The American Statistician 72, no. 1 (2018): 80-88. DOI: 10.1080/00031305.2017.1375986 

Vuorre, M., & Crump, M. J. C. (2020). Sharing and organizing research products as R packages. PsyArXiv preprint PsyArXiv:jks2u. DOI: 10.31234/osf.io/jks2u

Blischak JD, Carbonetto P, and Stephens M. Creating and sharing reproducible research code the workflowr way [version 1; peer review: 3 approved]. F1000Research 2019, 8:1749 DOI: 10.12688/f1000research.20843.1

rrtools, workflowr, drake

7

Advanced packages for data science projects: testing, documentation, checking (public holiday - no slides)

Wickham, H. (2015). R packages: Organize, test, document, and share your code. O'Reilly Media, Inc. http://r-pkgs.had.co.nz/

usethis, testthat, tinytest

8

Continuous integration (slides)

*Beaulieu-Jones, B. K., & Greene, C. S. (2017). Reproducibility of computational workflows is automated using continuous analysis. Nature Biotechnology, 35(4), 342. DOI: 10.1038/nbt.3780

usethis, actions, ghactions, travis, tic. circleci

9

Containerising data science projects (slides)

*Boettiger, C., & Eddelbuettel, D. (2017). An introduction to rocker: Docker containers for R. The R Journal  9:2, pages 527-536.

Nüst, D. et al. (2020) The Rockerverse: Packages and Applications for Containerization with R. arXiv preprint arXiv:2001.10641

containerit, liftr, holepunch

10

Showcasing and archiving projects (slides)

*Lowndes, J. S. S., Best, B. D., Scarborough, C., Afflerbach, J. et al. (2017). Our path to better science in less time using open data science tools. Nature Ecology & Evolution, 1(6), 0160. DOI: 10.1038/s41559-017-0160

Eglen, S., Marwick, B., Halchenko, Y. et al. Toward standard practices for sharing computer code and programs in neuroscience. Nat Neurosci 20, 770–773 (2017). https://doi.org/10.1038/nn.4550

 

Code Ocean, WholeTale, NextJournal,  Zenodo, Figshare

Course assessment and expectations

The course grade consists of the following components and percentages:

There are some opportunities for extra credit:

Your lowest two scores from the Reading annotations assignment sets will be dropped from the final grade calculation. This gives you some flexibility to miss some assignments, for example if you have job interviews or busy work periods, without affecting your final grade. 

Late submissions: Each student receives three free "late days", each of which allows you to submit an assignment up to 24 hours late without penalty. You will need to notify us by stating on your submitted work that you are using a late day. Once you have used up all late days, assignments will have 10% deducted from the grade per day (including weekends). Assignments will not be accepted more than seven days after the due date (which means you’ll get a zero score for that assignment) without prior arrangement. Let me know as soon as you anticipate missing a due date by more than seven days. I review late requests and circumstances on a case by case basis and make decisions accordingly. I generally accommodate unexpected family and medical circumstances, and scheduled religious activities. If you anticipate any disruptions, please let us know so we can help you to plan for success.

You will need a free GitHub account.  You may also find other useful things on this UW Libraries Resource Page on Reproducibility

Communication

We will use CampusWire for announcements and communication about the course. Sign up for an account there with your netid@uw.edu email address. That should give you access—if not, please email me for an access code. The TA and I will post important updates and announcements there. Check often, and be sure to enable your notifications, at least for for the #announcements channel. We recommend you install the Campuswire app on your phone, and ensure that you can receive notifications. 

Our Campuswire discussion forum is a shared space for you to ask any class-related questions. Nothing you post will affect your grade in any way (for better or for worse), so please freely engage by posting questions and by helping your classmates with answers and discussions. Take the time to familiarize yourself with the platform. This is a short introduction to the system: https://www.youtube.com/watch?v=Rz268j1SEq0

Do post questions about lectures, assignments, etc. You can post a reply or a comment to other students' posts. You can choose to post your question anonymously, although the me and the TA will see your identity. Remember that nothing you post will ever affect your grade! Appropriate, professional images and gifs are welcome. Use the #random channel to share relevant news articles that you come across. Collaborate with your classmates to organise your group project.  Use the proper category when posting questions. Use all the tools there that you find helpful. You are welcome to DM me and the TA on Campuswire with questions, comments and feedback about the class. We love to hear from you and value your comments. 

Course policies

Grading

The following grading scale will be used:

Percent = Grade

95 = 4.0 88 = 3.3 81 = 2.6 74 = 1.9 67 = 1.2

94 = 3.9 87 = 3.2 80 = 2.5 73 = 1.8 66 = 1.1

93 = 3.8 86 = 3.1 79 = 2.4 72 = 1.7 65 = 1.0

92 = 3.7 85 = 3.0 78 = 2.3 71 = 1.6 64 = 0.9

91 = 3.6 84 = 2.9 77 = 2.2 70 = 1.5 63 = 0.8

90 = 3.5 83 = 2.8 76 = 2.1 69 = 1.4 60-62 = 0.7

89 = 3.4 82 = 2.7 75 = 2.0 68 = 1.3 <60 = 0.0

Academic misconduct

The university’s policy on plagiarism and academic misconduct is a part of the Student Conduct Code, which cites the definition of academic misconduct in the WAC 478-121. (WAC is an abbreviation for the Washington Administrative Code, the set of state regulations for the university. The entire chapter of the WAC on the student conduct code is here.) According to this section of the WAC, academic misconduct includes: 

“Cheating”—such as “unauthorized assistance in taking quizzes”, “Falsification” “which is the intentional use or submission of falsified data, records, or other information including, but not limited to, records of internship or practicum experiences or attendance at any required event(s), or scholarly research”; and “Plagiarism” which includes “[t]he use, by paraphrase or direct quotation, of the published or unpublished work of another person without full and clear acknowledgment.” 

The UW Libraries have a useful guide for students at http://www.lib.washington.edu/teaching/plagiarism 

Accommodation

Your experience in this class is important to me. If you have already established accommodations with Disability Resources for Students (DRS), please communicate your approved accommodations to me at your earliest convenience so we can discuss your needs in this course. The website for the DRO provides other resources for students and faculty for making accommodations.

Washington state law requires that UW develop a policy for accommodation of student absences or significant hardship due to reasons of faith or conscience, or for organized religious activities. The UW’s policy, including more information about how to request an accommodation, is available at Religious Accommodations Policy (https://registrar.washington.edu/staffandfaculty/religious-accommodations-policy/). Accommodations must be requested within the first two weeks of this course using the Religious Accommodations Request form (https://registrar.washington.edu/students/religious-accommodations-request/). 

Inclusivity

Among the core values of the university are inclusivity and diversity, regardless of race, gender, income, ability, beliefs, and other ways that people distinguish themselves and others. If any assignments and activities are not accessible to you, please contact me so we can make arrangements to include you by making an alternative assignment available. 

Learning often involves the exchange of ideas. To include everyone in the learning process, we expect you will demonstrate respect, politeness, reasonableness, and willingness to listen to others at all times – even when passions run high. Behaviors must support learning, understanding, and scholarship.

Preventing violence is a shared responsibility in which everyone at the UW plays apart. If you experience harassment during your studies, please report it to the SafeCampus website (anonymous reports are possible, washington.edu/safecampus/). SafeCampus provides information on counseling and safety resources, University policies, and violence reporting requirements help us maintain a safe personal, work and learning environment.

Absences

In the cases of absences that result in a student missing a course requirement (class activity, assignment submission, exam, e.g.) and of extended absences, accommodations are left to the discretion of the instructor. Accommodations might include makeup exams, alternate assignments, or alternate weighting of missed work, so long as the grades for other students in the class are not affected by the accommodation.

Technology protocol

A laptop is required in class.

Participation rubric

Exemplary (90%- 100%) 

Proficient (80%-90%) 

Developing (70%-80%) 

Unacceptable (>70%) 

Frequency of participation in class 

Student initiates contributions more than once in each class.

Student initiates contribution once in each class.

Student initiates contribution at least in half of the class

Student does not initiate contribution & needs instructor to solicit input.

Quality of comments 

Comments always respectful, insightful & constructive; uses appropriate terminology. Comments balanced between general impressions, opinions & specific, thoughtful criticisms or contributions.

Comments mostly respectful, insightful & constructive; mostly uses appropriate terminology. Occasionally comments are too general or not relevant

to the discussion.

Comments are sometimes respectful, constructive, with occasional signs of insight. Student does not use appropriate terminology; comments not always relevant to

the discussion.

Comments are disrespectful or uninformative, lacking in appropriate terminology. Heavy reliance on opinion & personal taste, e.g., “I love it”, “I hate it”, “It’s bad” etc.

Listening Skills 

Student listens attentively when others present materials, perspectives, as indicated by comments that build on others’ remarks, i.e., student hears what others say

& contributes to the dialogue.

Student is mostly attentive when others present ideas, materials, as indicated by comments that reflect & build on others’ remarks. Occasionally needs encouragement or reminder from instructor of focus of comment.

Student is often inattentive and needs reminder of focus of class. Occasionally makes disruptive comments while others are speaking

Does not listen to others; regularly talks while others speak or does not pay attention while others speak; detracts from discussion; sleeps, etc.

 

GitHub Organisation for the class
Notes on course planning  
Instructor/TA meeting notes

Course Summary:

Date Details
CC Attribution This course content is offered under a CC Attribution license. Content in this course can be considered under this license unless otherwise noted.