Document config file format

2025-12-28 17:12:30 +01:00 · 2021-04-29 18:55:08 +02:00
parent f776186480
commit 9ec19be113
2 changed files with 139 additions and 0 deletions
--- a/CONFIG.md
+++ b/CONFIG.md
@@ -0,0 +1,138 @@
 # Config file format
 A config file consists of sections. A section begins with a `[section]` header,
 which is followed by a list of `key = value` or `key: value` pairs. Comments
 must be on their own line and start with `#` or `;`. Multiline values must be
 indented beyond their key. For more details and some examples on the format, see
 the [configparser documentation][1] ([basic interpolation][2] is enabled).
 [1]: <https://docs.python.org/3/library/configparser.html#supported-ini-file-structure> "Supported INI File Structure"
 [2]: <https://docs.python.org/3/library/configparser.html#configparser.BasicInterpolation> "BasicInterpolation"
 ## The `DEFAULT` section
 This section contains global configuration values. It can also be used to set
 default values for the other sections.
 - `working_dir`: The directory PFERD operates in. Set to an absolute path to
  make PFERD operate the same regardless of where it is executed. All other
  paths in the config file are interpreted relative to this path. If this path
  is relative, it is interpreted relative to the script's working dir. `~` is
  expanded to the current user's home directory. (Default: `.`)
 ## The `crawl:*` sections
 Sections whose names start with `crawl:` are used to configure crawlers. The
 rest of the section name specifies the name of the crawler.
 A crawler synchronizes a remote resource to a local directory. There are
 different types of crawlers for different kinds of resources, e. g. ILIAS
 courses or lecture websites.
 Each crawl section represents an instance of a specific type of crawler. The
 `type` option is used to specify the crawler type. The crawler's name is usually
 used as the name for the output directory. New crawlers can be created simply by
 adding a new crawl section to the config file.
 Depending on a crawler's type, it may have different options. For more details,
 see the type's documentation below. The following options are common to all
 crawlers:
 - `type`: The types are specified in [this section](#crawler-types).
 - `output_dir`: The directory the crawler synchronizes files to. A crawler will
  never place any files outside of this directory. (Default: crawler's name)
 - `transform`: Rules for renaming and excluding certain files and directories.
  For more details, see [this section](#transformation-rules). (Default: empty)
 ## The `auth:*` sections
 Sections whose names start with `auth:` are used to configure authenticators. An
 authenticator provides login credentials to one or more crawlers.
 Authenticators work similar to crawlers: A section represents an authenticator
 instance, whose name is the rest of the section name. The type is specified by
 the `type` option.
 Depending on an authenticator's type, it may have different options. For more
 details, see the type's documentation below. The only option common to all
 authenticators is `type`:
 - `type`: The types are specified in [this section](#authenticator-types).
 ## Crawler types
 TODO Fill in as crawlers are implemented
 ## Authenticator types
 TODO Fill in as authenticators are implemented
 ## Transformation rules
 Transformation rules are rules for renaming and excluding files and directories.
 They are specified line-by-line in a crawler's `transform` option. When a
 crawler needs to apply a rule to a path, it goes through this list top-to-bottom
 and choose the first matching rule.
 Each line has the format `SOURCE ARROW TARGET` where `TARGET` is optional.
 `SOURCE` is either a normal path without spaces (e. g. `foo/bar`), or a string
 literal delimited by `"` or `'` (e. g. `"foo\" bar/baz"`). Python's string
 escape syntax is supported. Trailing slashes are ignored. `TARGET` can be
 formatted like `SOURCE`, but it can also be a single exclamation mark without
 quotes (`!`). `ARROW` is one of `-->`, `-exact->` and `-re->`.
 If a rule's target is `!`, this means that when the rule matches on a path, the
 corresponding file or directory is ignored. If a rule's target is missing, the
 path is matched but not modified.
 ### The `-->` arrow
 The `-->` arrow is a basic renaming operation. If a path begins with `SOURCE`,
 that part of the path is replaced with `TARGET`. This means that the rule
 `foo/bar --> baz` would convert `foo/bar` into `baz`, but also `foo/bar/xyz`
 into `baz/xyz`. The rule `foo --> !` would ignore a directory named `foo` as
 well as all its contents.
 ### The `-exact->` arrow
 The `-exact->` arrow requires the path to match `SOURCE` exactly. This means
 that the rule `foo/bar -exact-> baz` would still convert `foo/bar` into `baz`,
 but `foo/bar/xyz` would be unaffected. Also, `foo -exact-> !` would only ignore
 `foo`, but not its contents (if it has any). The examples below show why this is
 useful.
 ### The `-re->` arrow
 The `-re->` arrow uses regular expressions. `SOURCE` is a regular expression
 that must match the entire path. If this is the case, then the capturing groups
 are available in `TARGET` for formatting.
 ### Example: Tutorials
 You have ILIAS course with lots of tutorials, but are only interested in a
 single one?
 ```
 tutorials/
  |- tut_01/
  |- tut_02/
  |- tut_03/
  ...
 ```
 You can use a mix of normal and exact arrows to get rid of the other ones and
 move the `tutorials/tut_02/` folder to `my_tut/`:
 ```
 tutorials/tut_02 --> my_tut
 tutorials -exact->
 tutorials --> !
 ```
 The second rule is required for many crawlers since they use the rules to decide
 which directories to crawl. If it was missing when the crawler looks at
 `tutorials/`, the third rule would match. This means the crawler would not crawl
 the `tutorials/` directory and thus not discover that `tutorials/tut02/`
 existed.
 Since the second rule is only relevant for crawling, the `TARGET` is left out.
--- a/README.md
+++ b/README.md
@@ -4,6 +4,7 @@
 Other resources:
 - [Config file format](CONFIG.md)
 - [Changelog](CHANGELOG.md)
 - [Development Guide](DEV.md)