Data Science in R for Students of Humanities - NPFL112
Title: Analýza dat v R pro studenty humanitních oborů
Guaranteed by: Institute of Formal and Applied Linguistics (32-UFAL)
Faculty: Faculty of Mathematics and Physics
Actual: from 2022
Semester: summer
E-Credits: 3
Hours per week, examination: summer s.:0/2, C [HT]
Capacity: unlimited
Min. number of students: unlimited
4EU+: no
Virtual mobility / capacity: no
State of the course: taught
Language: Czech, English
Teaching methods: full-time
Teaching methods: full-time
Additional information: https://ufal.mff.cuni.cz/courses/r-for-humanities/english
Guarantor: Mgr. Silvie Cinková, Ph.D.
Opinion survey results   Examination dates   SS schedule   Noticeboard   
Annotation -
Last update: doc. RNDr. Vladislav Kuboň, Ph.D. (05.06.2018)
The humanities have seen an irreversible paradigm shift towards Digital Humanities, based on automatic quantitative analysis of (big) data. We will teach you: - to clean and structure data into neat tables; - to discover trends, recurring patterns, and outliers - basics of modern data visualization We use the open-source programming language R along with its advanced RStudio IDE and tidyverse, the globally popular collection of professional data-scientific tools.
Course completion requirements -
Last update: Mgr. Silvie Cinková, Ph.D. (22.05.2023)
  • active participation in the lessons (max 3 absences)
  • all homeworks submitted within deadlines
  • if DataCamp is used (free for students) the student is obliged to collect 30 000 XP during our course. These points must come from the assigned R courses.

In case the student has already completed these courses before, they must collect 20 000 XP from other R courses.

Any individual exceptions are up to the teachers.

Literature -
Last update: doc. RNDr. Vladislav Kuboň, Ph.D. (05.06.2018)

Hadley Wickham and Garrett Grolemund. 2017. R for Data Science. O'Reilly. Momentálně zdarma online: http://r4ds.had.co.nz/

Garrett Grolemund. 2014. Hands-On Programming with R. O'Reilly.

Nina Zumel and John Mount. 2014 Practical Data Science with R. Manning.

Julia Silge and David Robinson: Text Mining with R. A tidy approach. 2017. O'Reilly.

Stefan Th. Gries. 2013. Statistics for Linguistics with R. A practical introduction. De Gruyter.

Stefan Th. Gries. 2009. Quantitative Corpus Linguistics with R. De Gruyter. Routledge.

Matthew L. Jockers. 2014. Text Analysis with R for Students of Literature. Springer.

Natalia Levshina. 2015. How to do Linguistics with R. Data exploration and statistical analysis. John Benjamins.

Simon Munzert, Christian Rubba, Peter Meissner, Dominic Nyhuis: Automated Data Collection with R. A Practical Guide to Web Scraping and Text Mining. 2015. Wiley.

Requirements to the exam - Czech
Last update: Mgr. Silvie Cinková, Ph.D. (22.05.2023)

aktivní účast na všech hodinách (výjimky na zvážení učitele), včasné odevzdávání domácích úkolů, důkladné studium a příprava k diskusi u zadávané četby

Syllabus -
Last update: Mgr. Silvie Cinková, Ph.D. (22.05.2023)

1. Basic concepts of R, advantages of R in data analysis as a subdiscipline of programming

2. Tables, vectors, loading a table file, vector as a table column, variable types as vector classes, selection (subsetting) of elements, rows and columns in base R

3. ggplot2 graphics library, mapping variables to aesthetic scales, types of graphs and scales (geom_, scale_ functions)

4. Data wrangling - dplyr library: selection and manipulation of rows (filter, slice, arrange) and columns (select, rename, mutate, if_else, case_when)

5. Data wrangling - groups (group_by, across, rowwise), aggregation (count, summarize)

6. Table joins (SQL-like)

7. "tidy data" concept, conversion between "wider" and "longer" table format for use with dplyr and ggplot2, tidyr (pivot_longer, pivot_wider, unite and separate)

8. Operations on strings, regular expressions incl. "look-around"

9. The concept of iteration in R: vectorization, loop, apply family functions and map family functions from the purrr library in common user situations

10. Text mining with the help of automatic syntactic annotation, interaction with the API of the UDPipe syntactic parser

Favorite datasets: gapminder (https://www.gapminder.org/data/), built-in datasets iris, diamonds, corpora

Entry requirements -
Last update: doc. RNDr. Vladislav Kuboň, Ph.D. (05.06.2018)

English, basic computer literacy, frustration tolerance and discipline for regular homeworks. No programming skills required.

Grade requirements: active participation in all lessons (exceptions are up to teachers), timely submission of homeworks, comprehensive discussion preparation on selected reading (3 - 4 papers/term)