General information
Course type | AMUPIE |
Module title | Reproducible research with Linux |
Language | english |
Module lecturer | dr Tomasz Kowalski |
Lecturer's email | kowalski@amu.edu.pl |
Lecturer position | assistant professor |
Faculty | Faculty of Mathematics and Computer Science |
Semester | 2024/2025 (winter) |
Duration | 60 |
ECTS | 5 |
USOS code | 0000 |
Timetable
Each of the stages outlined in the syllabus can be completed in 2 or 3 weeks at a pace of 4 course hours per week. The timeline can be adapted to suit participants' interests and abilities.
Module aim (aims)
A scientific result is usually reported in a publication, but how do you know it is valid? How can you trust it? And if you are doing research in the same field or on a related question, how can you contribute? How do you join the effort?
It would be great if you could recreate the same result as described in the publication, wouldn't it?
Reproducibility (the repeatability of research results) poses challenges that vary from one field of research to another. However, there are commonalities in data science projects and we will look at best practice and tools for data and code management.
And to empower your project even more, we will be working in Linux, which is the same computing environment you would encounter if you were accessing, say, one of the top one million web servers on the Internet, or any of the top five hundred fastest supercomputers in the world.
Pre-requisites in terms of knowledge, skills and social competences (where relevant)
There are two strict requirements to join this course - computer literacy and the right motivation.
This course has been designed with non-CS students in mind. You don't have to be a programmer or systems administrator but computer literacy is a must.
You will benefit most from this course if you want to get involved in a data science project and experience the Linux environment as a set of productivity tools for open collaboration.
Syllabus
1. More (much more) than your personal computer. A cloud services starter.
Common features of large public providers. Navigating pricing options. Instance access and operations. Computer networks primer.
2. A gentle introduction to Linux.
With Windows and OS X refugees in mind. Multi boot and VMs. Getting software - accessing various software delivery channels. Open source software ecosystem.
3. Moving away from GUIs - processing automation, seamless local and remote work.
Unlocking batch processing potential with text console tools. Achieving reproducibility by user action automation. Maintaining long-running jobs.
4. Managing data - various storage options and implications.
Block storage, object storage, DBMS - capabilities, limitations, pricing, impact on processing code. Data ingress and egress challenges.
5. Managing code - lessons learned from unix data tools.
Managing project notes as code. Versioning everything. Building pipelines. Artefact tracking.
6. Script everything - manage your computing environment as code.
Dependency management up to operating system level. Utilising on-demand infrastructure.
Reading list
Unfortunately, there is no single textbook that covers the whole course.
There are many textbooks on specific topics, but they tend to be rather technical and it would be discouraging to list them all here, as you do not necessarily need to know all the ins and outs of a tool in order to use it quite effectively.
You will be referred to online resources (articles, videos) and textbooks for further reading as the course progresses. This will allow you to explore specific topics in more depth.
However, there is one textbook that deserves to be highlighted here. It is "Bioinformatics Data Skills" by Vince Buffalo (ISBN: 9781449367503). It was released by O'Reilly Media a few years ago, but is far from outdated.
The author's approach to introducing the reader to data processing tools is quite unique. We will emulate the author's philosophy, but shift our focus from specific data sets (in this case genetic data) to Linux itself.