mirror of
https://github.com/Garmelon/PFERD.git
synced 2023-12-21 10:23:01 +01:00
Document config file format
This commit is contained in:
parent
f776186480
commit
9ec19be113
138
CONFIG.md
Normal file
138
CONFIG.md
Normal file
@ -0,0 +1,138 @@
|
|||||||
|
# Config file format
|
||||||
|
|
||||||
|
A config file consists of sections. A section begins with a `[section]` header,
|
||||||
|
which is followed by a list of `key = value` or `key: value` pairs. Comments
|
||||||
|
must be on their own line and start with `#` or `;`. Multiline values must be
|
||||||
|
indented beyond their key. For more details and some examples on the format, see
|
||||||
|
the [configparser documentation][1] ([basic interpolation][2] is enabled).
|
||||||
|
|
||||||
|
[1]: <https://docs.python.org/3/library/configparser.html#supported-ini-file-structure> "Supported INI File Structure"
|
||||||
|
[2]: <https://docs.python.org/3/library/configparser.html#configparser.BasicInterpolation> "BasicInterpolation"
|
||||||
|
|
||||||
|
## The `DEFAULT` section
|
||||||
|
|
||||||
|
This section contains global configuration values. It can also be used to set
|
||||||
|
default values for the other sections.
|
||||||
|
|
||||||
|
- `working_dir`: The directory PFERD operates in. Set to an absolute path to
|
||||||
|
make PFERD operate the same regardless of where it is executed. All other
|
||||||
|
paths in the config file are interpreted relative to this path. If this path
|
||||||
|
is relative, it is interpreted relative to the script's working dir. `~` is
|
||||||
|
expanded to the current user's home directory. (Default: `.`)
|
||||||
|
|
||||||
|
## The `crawl:*` sections
|
||||||
|
|
||||||
|
Sections whose names start with `crawl:` are used to configure crawlers. The
|
||||||
|
rest of the section name specifies the name of the crawler.
|
||||||
|
|
||||||
|
A crawler synchronizes a remote resource to a local directory. There are
|
||||||
|
different types of crawlers for different kinds of resources, e. g. ILIAS
|
||||||
|
courses or lecture websites.
|
||||||
|
|
||||||
|
Each crawl section represents an instance of a specific type of crawler. The
|
||||||
|
`type` option is used to specify the crawler type. The crawler's name is usually
|
||||||
|
used as the name for the output directory. New crawlers can be created simply by
|
||||||
|
adding a new crawl section to the config file.
|
||||||
|
|
||||||
|
Depending on a crawler's type, it may have different options. For more details,
|
||||||
|
see the type's documentation below. The following options are common to all
|
||||||
|
crawlers:
|
||||||
|
|
||||||
|
- `type`: The types are specified in [this section](#crawler-types).
|
||||||
|
- `output_dir`: The directory the crawler synchronizes files to. A crawler will
|
||||||
|
never place any files outside of this directory. (Default: crawler's name)
|
||||||
|
- `transform`: Rules for renaming and excluding certain files and directories.
|
||||||
|
For more details, see [this section](#transformation-rules). (Default: empty)
|
||||||
|
|
||||||
|
## The `auth:*` sections
|
||||||
|
|
||||||
|
Sections whose names start with `auth:` are used to configure authenticators. An
|
||||||
|
authenticator provides login credentials to one or more crawlers.
|
||||||
|
|
||||||
|
Authenticators work similar to crawlers: A section represents an authenticator
|
||||||
|
instance, whose name is the rest of the section name. The type is specified by
|
||||||
|
the `type` option.
|
||||||
|
|
||||||
|
Depending on an authenticator's type, it may have different options. For more
|
||||||
|
details, see the type's documentation below. The only option common to all
|
||||||
|
authenticators is `type`:
|
||||||
|
|
||||||
|
- `type`: The types are specified in [this section](#authenticator-types).
|
||||||
|
|
||||||
|
## Crawler types
|
||||||
|
|
||||||
|
TODO Fill in as crawlers are implemented
|
||||||
|
|
||||||
|
## Authenticator types
|
||||||
|
|
||||||
|
TODO Fill in as authenticators are implemented
|
||||||
|
|
||||||
|
## Transformation rules
|
||||||
|
|
||||||
|
Transformation rules are rules for renaming and excluding files and directories.
|
||||||
|
They are specified line-by-line in a crawler's `transform` option. When a
|
||||||
|
crawler needs to apply a rule to a path, it goes through this list top-to-bottom
|
||||||
|
and choose the first matching rule.
|
||||||
|
|
||||||
|
Each line has the format `SOURCE ARROW TARGET` where `TARGET` is optional.
|
||||||
|
`SOURCE` is either a normal path without spaces (e. g. `foo/bar`), or a string
|
||||||
|
literal delimited by `"` or `'` (e. g. `"foo\" bar/baz"`). Python's string
|
||||||
|
escape syntax is supported. Trailing slashes are ignored. `TARGET` can be
|
||||||
|
formatted like `SOURCE`, but it can also be a single exclamation mark without
|
||||||
|
quotes (`!`). `ARROW` is one of `-->`, `-exact->` and `-re->`.
|
||||||
|
|
||||||
|
If a rule's target is `!`, this means that when the rule matches on a path, the
|
||||||
|
corresponding file or directory is ignored. If a rule's target is missing, the
|
||||||
|
path is matched but not modified.
|
||||||
|
|
||||||
|
### The `-->` arrow
|
||||||
|
|
||||||
|
The `-->` arrow is a basic renaming operation. If a path begins with `SOURCE`,
|
||||||
|
that part of the path is replaced with `TARGET`. This means that the rule
|
||||||
|
`foo/bar --> baz` would convert `foo/bar` into `baz`, but also `foo/bar/xyz`
|
||||||
|
into `baz/xyz`. The rule `foo --> !` would ignore a directory named `foo` as
|
||||||
|
well as all its contents.
|
||||||
|
|
||||||
|
### The `-exact->` arrow
|
||||||
|
|
||||||
|
The `-exact->` arrow requires the path to match `SOURCE` exactly. This means
|
||||||
|
that the rule `foo/bar -exact-> baz` would still convert `foo/bar` into `baz`,
|
||||||
|
but `foo/bar/xyz` would be unaffected. Also, `foo -exact-> !` would only ignore
|
||||||
|
`foo`, but not its contents (if it has any). The examples below show why this is
|
||||||
|
useful.
|
||||||
|
|
||||||
|
### The `-re->` arrow
|
||||||
|
|
||||||
|
The `-re->` arrow uses regular expressions. `SOURCE` is a regular expression
|
||||||
|
that must match the entire path. If this is the case, then the capturing groups
|
||||||
|
are available in `TARGET` for formatting.
|
||||||
|
|
||||||
|
### Example: Tutorials
|
||||||
|
|
||||||
|
You have ILIAS course with lots of tutorials, but are only interested in a
|
||||||
|
single one?
|
||||||
|
|
||||||
|
```
|
||||||
|
tutorials/
|
||||||
|
|- tut_01/
|
||||||
|
|- tut_02/
|
||||||
|
|- tut_03/
|
||||||
|
...
|
||||||
|
```
|
||||||
|
|
||||||
|
You can use a mix of normal and exact arrows to get rid of the other ones and
|
||||||
|
move the `tutorials/tut_02/` folder to `my_tut/`:
|
||||||
|
|
||||||
|
```
|
||||||
|
tutorials/tut_02 --> my_tut
|
||||||
|
tutorials -exact->
|
||||||
|
tutorials --> !
|
||||||
|
```
|
||||||
|
|
||||||
|
The second rule is required for many crawlers since they use the rules to decide
|
||||||
|
which directories to crawl. If it was missing when the crawler looks at
|
||||||
|
`tutorials/`, the third rule would match. This means the crawler would not crawl
|
||||||
|
the `tutorials/` directory and thus not discover that `tutorials/tut02/`
|
||||||
|
existed.
|
||||||
|
|
||||||
|
Since the second rule is only relevant for crawling, the `TARGET` is left out.
|
Loading…
Reference in New Issue
Block a user