Document config file format

2025-07-14 15:12:30 +02:00 · 2021-04-29 18:55:08 +02:00
parent f776186480
commit 9ec19be113
2 changed files with 139 additions and 0 deletions
--- a/CONFIG.md
+++ b/CONFIG.md
@ -0,0 +1,138 @@
+# Config file format
+
+A config file consists of sections. A section begins with a `[section]` header,
+which is followed by a list of `key = value` or `key: value` pairs. Comments
+must be on their own line and start with `#` or `;`. Multiline values must be
+indented beyond their key. For more details and some examples on the format, see
+the [configparser documentation][1] ([basic interpolation][2] is enabled).
+
+[1]: <https://docs.python.org/3/library/configparser.html#supported-ini-file-structure> "Supported INI File Structure"
+[2]: <https://docs.python.org/3/library/configparser.html#configparser.BasicInterpolation> "BasicInterpolation"
+
+## The `DEFAULT` section
+
+This section contains global configuration values. It can also be used to set
+default values for the other sections.
+
+- `working_dir`: The directory PFERD operates in. Set to an absolute path to
+  make PFERD operate the same regardless of where it is executed. All other
+  paths in the config file are interpreted relative to this path. If this path
+  is relative, it is interpreted relative to the script's working dir. `~` is
+  expanded to the current user's home directory. (Default: `.`)
+
+## The `crawl:*` sections
+
+Sections whose names start with `crawl:` are used to configure crawlers. The
+rest of the section name specifies the name of the crawler.
+
+A crawler synchronizes a remote resource to a local directory. There are
+different types of crawlers for different kinds of resources, e. g. ILIAS
+courses or lecture websites.
+
+Each crawl section represents an instance of a specific type of crawler. The
+`type` option is used to specify the crawler type. The crawler's name is usually
+used as the name for the output directory. New crawlers can be created simply by
+adding a new crawl section to the config file.
+
+Depending on a crawler's type, it may have different options. For more details,
+see the type's documentation below. The following options are common to all
+crawlers:
+
+- `type`: The types are specified in [this section](#crawler-types).
+- `output_dir`: The directory the crawler synchronizes files to. A crawler will
+  never place any files outside of this directory. (Default: crawler's name)
+- `transform`: Rules for renaming and excluding certain files and directories.
+  For more details, see [this section](#transformation-rules). (Default: empty)
+
+## The `auth:*` sections
+
+Sections whose names start with `auth:` are used to configure authenticators. An
+authenticator provides login credentials to one or more crawlers.
+
+Authenticators work similar to crawlers: A section represents an authenticator
+instance, whose name is the rest of the section name. The type is specified by
+the `type` option.
+
+Depending on an authenticator's type, it may have different options. For more
+details, see the type's documentation below. The only option common to all
+authenticators is `type`:
+
+- `type`: The types are specified in [this section](#authenticator-types).
+
+## Crawler types
+
+TODO Fill in as crawlers are implemented
+
+## Authenticator types
+
+TODO Fill in as authenticators are implemented
+
+## Transformation rules
+
+Transformation rules are rules for renaming and excluding files and directories.
+They are specified line-by-line in a crawler's `transform` option. When a
+crawler needs to apply a rule to a path, it goes through this list top-to-bottom
+and choose the first matching rule.
+
+Each line has the format `SOURCE ARROW TARGET` where `TARGET` is optional.
+`SOURCE` is either a normal path without spaces (e. g. `foo/bar`), or a string
+literal delimited by `"` or `'` (e. g. `"foo\" bar/baz"`). Python's string
+escape syntax is supported. Trailing slashes are ignored. `TARGET` can be
+formatted like `SOURCE`, but it can also be a single exclamation mark without
+quotes (`!`). `ARROW` is one of `-->`, `-exact->` and `-re->`.
+
+If a rule's target is `!`, this means that when the rule matches on a path, the
+corresponding file or directory is ignored. If a rule's target is missing, the
+path is matched but not modified.
+
+### The `-->` arrow
+
+The `-->` arrow is a basic renaming operation. If a path begins with `SOURCE`,
+that part of the path is replaced with `TARGET`. This means that the rule
+`foo/bar --> baz` would convert `foo/bar` into `baz`, but also `foo/bar/xyz`
+into `baz/xyz`. The rule `foo --> !` would ignore a directory named `foo` as
+well as all its contents.
+
+### The `-exact->` arrow
+
+The `-exact->` arrow requires the path to match `SOURCE` exactly. This means
+that the rule `foo/bar -exact-> baz` would still convert `foo/bar` into `baz`,
+but `foo/bar/xyz` would be unaffected. Also, `foo -exact-> !` would only ignore
+`foo`, but not its contents (if it has any). The examples below show why this is
+useful.
+
+### The `-re->` arrow
+
+The `-re->` arrow uses regular expressions. `SOURCE` is a regular expression
+that must match the entire path. If this is the case, then the capturing groups
+are available in `TARGET` for formatting.
+
+### Example: Tutorials
+
+You have ILIAS course with lots of tutorials, but are only interested in a
+single one?
+
+```
+tutorials/
+  |- tut_01/
+  |- tut_02/
+  |- tut_03/
+  ...
+```
+
+You can use a mix of normal and exact arrows to get rid of the other ones and
+move the `tutorials/tut_02/` folder to `my_tut/`:
+
+```
+tutorials/tut_02 --> my_tut
+tutorials -exact->
+tutorials --> !
+```
+
+The second rule is required for many crawlers since they use the rules to decide
+which directories to crawl. If it was missing when the crawler looks at
+`tutorials/`, the third rule would match. This means the crawler would not crawl
+the `tutorials/` directory and thus not discover that `tutorials/tut02/`
+existed.
+
+Since the second rule is only relevant for crawling, the `TARGET` is left out.
--- a/README.md
+++ b/README.md
@ -4,6 +4,7 @@

 Other resources:

+- [Config file format](CONFIG.md)
 - [Changelog](CHANGELOG.md)
 - [Development Guide](DEV.md)