pferd/CONFIG.md

451 lines
17 KiB
Markdown
Raw Normal View History

2021-04-29 18:55:08 +02:00
# Config file format
A config file consists of sections. A section begins with a `[section]` header,
2021-05-25 14:12:19 +02:00
which is followed by a list of `key = value` pairs. Comments must be on their
own line and start with `#`. Multiline values must be indented beyond their key.
Boolean values can be `yes` or `no`. For more details and some examples on the
2021-10-25 22:32:54 +02:00
format, see the [configparser documentation][1] ([interpolation][2] is
disabled).
2021-04-29 18:55:08 +02:00
[1]: <https://docs.python.org/3/library/configparser.html#supported-ini-file-structure> "Supported INI File Structure"
2021-10-25 22:32:54 +02:00
[2]: <https://docs.python.org/3/library/configparser.html#interpolation-of-values> "Interpolation of values"
2021-04-29 18:55:08 +02:00
## The `DEFAULT` section
This section contains global configuration values. It can also be used to set
default values for the other sections.
- `working_dir`: The directory PFERD operates in. Set to an absolute path to
2021-05-25 14:12:19 +02:00
make PFERD operate the same regardless of where it is executed from. All other
2021-04-29 18:55:08 +02:00
paths in the config file are interpreted relative to this path. If this path
is relative, it is interpreted relative to the script's working dir. `~` is
expanded to the current user's home directory. (Default: `.`)
- `explain`: Whether PFERD should log and explain its actions and decisions in
detail. (Default: `no`)
2021-05-25 14:12:19 +02:00
- `status`: Whether PFERD should print status updates (like `Crawled ...`,
`Added ...`) while running a crawler. (Default: `yes`)
2021-05-23 22:51:42 +02:00
- `report`: Whether PFERD should print a report of added, changed and deleted
local files for all crawlers before exiting. (Default: `yes`)
2021-05-25 14:12:19 +02:00
- `share_cookies`: Whether crawlers should share cookies where applicable. For
example, some crawlers share cookies if they crawl the same website using the
same account. (Default: `yes`)
2021-04-29 18:55:08 +02:00
## The `crawl:*` sections
Sections whose names start with `crawl:` are used to configure crawlers. The
rest of the section name specifies the name of the crawler.
A crawler synchronizes a remote resource to a local directory. There are
2021-10-25 21:34:51 +02:00
different types of crawlers for different kinds of resources, e.g. ILIAS
2021-04-29 18:55:08 +02:00
courses or lecture websites.
Each crawl section represents an instance of a specific type of crawler. The
`type` option is used to specify the crawler type. The crawler's name is usually
2021-05-25 14:12:19 +02:00
used as the output directory. New crawlers can be created simply by adding a new
crawl section to the config file.
2021-04-29 18:55:08 +02:00
Depending on a crawler's type, it may have different options. For more details,
2021-05-25 14:12:19 +02:00
see the type's [documentation](#crawler-types) below. The following options are
common to all crawlers:
2021-04-29 18:55:08 +02:00
2021-05-25 14:12:19 +02:00
- `type`: The available types are specified in [this section](#crawler-types).
2021-06-04 18:33:02 +02:00
- `skip`: Whether the crawler should be skipped during normal execution. The
crawler can still be executed manually using the `--crawler` or `-C` flags.
(Default: `no`)
2021-04-29 18:55:08 +02:00
- `output_dir`: The directory the crawler synchronizes files to. A crawler will
2021-10-25 21:34:51 +02:00
never place any files outside this directory. (Default: the crawler's name)
2021-05-25 14:12:19 +02:00
- `redownload`: When to download a file that is already present locally.
2021-04-30 15:32:56 +02:00
(Default: `never-smart`)
- `never`: If a file is present locally, it is not downloaded again.
- `never-smart`: Like `never`, but PFERD tries to detect if an already
downloaded files has changed via some (unreliable) heuristics.
- `always`: All files are always downloaded, regardless of whether they are
already present locally.
- `always-smart`: Like `always`, but PFERD tries to avoid unnecessary
downloads via some (unreliable) heuristics.
2021-05-05 00:55:55 +02:00
- `on_conflict`: What to do when the local and remote versions of a file or
2021-05-25 14:12:19 +02:00
directory differ, including when a file is replaced by a directory or a
directory by a file. (Default: `prompt`)
2021-05-05 00:55:55 +02:00
- `prompt`: Always ask the user before overwriting or deleting local files
and directories.
- `local-first`: Always keep the local file or directory. Equivalent to
using `prompt` and always choosing "no". Implies that `redownload` is set
to `never`.
- `remote-first`: Always keep the remote file or directory. Equivalent to
using `prompt` and always choosing "yes".
- `no-delete`: Never delete local files, but overwrite local files if the
remote file is different.
2021-04-29 18:55:08 +02:00
- `transform`: Rules for renaming and excluding certain files and directories.
For more details, see [this section](#transformation-rules). (Default: empty)
2021-05-25 14:12:19 +02:00
- `tasks`: The maximum number of concurrent tasks (such as crawling or
downloading). (Default: `1`)
- `downloads`: How many of those tasks can be download tasks at the same time.
Must not be greater than `tasks`. (Default: Same as `tasks`)
- `task_delay`: Time (in seconds) that the crawler should wait between
subsequent tasks. Can be used as a sort of rate limit to avoid unnecessary
2021-05-25 14:12:19 +02:00
load for the crawl target. (Default: `0.0`)
- `windows_paths`: Whether PFERD should find alternative names for paths that
are invalid on Windows. (Default: `yes` on Windows, `no` otherwise)
2021-04-29 18:55:08 +02:00
2021-05-09 01:33:47 +02:00
Some crawlers may also require credentials for authentication. To configure how
the crawler obtains its credentials, the `auth` option is used. It is set to the
full name of an auth section (including the `auth:` prefix).
Here is a simple example:
2021-05-25 17:16:57 +02:00
```ini
2021-05-09 01:33:47 +02:00
[auth:example]
type = simple
username = foo
password = bar
[crawl:something]
type = some-complex-crawler
auth = auth:example
2021-05-25 14:12:19 +02:00
on_conflict = no-delete
tasks = 3
2021-05-09 01:33:47 +02:00
```
2021-04-29 18:55:08 +02:00
## The `auth:*` sections
Sections whose names start with `auth:` are used to configure authenticators. An
authenticator provides a username and a password to one or more crawlers.
2021-04-29 18:55:08 +02:00
Authenticators work similar to crawlers: A section represents an authenticator
2021-05-25 14:12:19 +02:00
instance whose name is the rest of the section name. The type is specified by
2021-04-29 18:55:08 +02:00
the `type` option.
Depending on an authenticator's type, it may have different options. For more
2021-05-25 14:12:19 +02:00
details, see the type's [documentation](#authenticator-types) below. The only
option common to all authenticators is `type`:
2021-04-29 18:55:08 +02:00
- `type`: The types are specified in [this section](#authenticator-types).
## Crawler types
2021-05-09 01:33:47 +02:00
### The `local` crawler
This crawler crawls a local directory. It is really simple and mostly useful for
testing different setups. The various delay options are meant to make the
crawler simulate a slower, network-based crawler.
2021-05-09 01:33:47 +02:00
2021-05-15 17:12:25 +02:00
- `target`: Path to the local directory to crawl. (Required)
2021-05-25 14:12:19 +02:00
- `crawl_delay`: Artificial delay (in seconds) to simulate for crawl requests.
(Default: `0.0`)
- `download_delay`: Artificial delay (in seconds) to simulate for download
requests. (Default: `0.0`)
- `download_speed`: Download speed (in bytes per second) to simulate. (Optional)
2021-04-29 18:55:08 +02:00
2021-10-21 12:01:41 +02:00
### The `kit-ipd` crawler
2021-10-25 21:34:51 +02:00
This crawler crawls a KIT ipd page by url. The root page can be crawled from
2021-10-21 12:01:41 +02:00
outside the KIT network so you will be informed about any new/deleted files,
but downloading files requires you to be within. Adding a show delay between
requests is likely a good idea.
2021-05-25 14:12:19 +02:00
### The `kit-ilias-web` crawler
This crawler crawls the KIT ILIAS instance.
ILIAS is not great at handling too many concurrent requests. To avoid
unnecessary load, please limit `tasks` to `1`.
There is a spike in ILIAS usage at the beginning of lectures, so please don't
run PFERD during those times.
If you're automatically running PFERD periodically (e. g. via cron or a systemd
timer), please randomize the start time or at least don't use the full hour. For
systemd timers, this can be accomplished using the `RandomizedDelaySec` option.
Also, please schedule the script to run in periods of low activity. Running the
script once per day should be fine.
- `target`: The ILIAS element to crawl. (Required)
- `desktop`: Crawl your personal desktop
- `<course id>`: Crawl the course with the given id
- `<url>`: Crawl a given element by URL (preferably the permanent URL linked
at the bottom of its ILIAS page)
- `auth`: Name of auth section to use for login. (Required)
- `tfa_auth`: Name of auth section to use for two-factor authentication. Only
uses the auth section's password. (Default: Anonymous `tfa` authenticator)
- `links`: How to represent external links. (Default: `fancy`)
- `ignore`: Don't download links.
- `plaintext`: A text file containing only the URL.
- `fancy`: A HTML file looking like the ILIAS link element.
- `internet-shortcut`: An internet shortcut file (`.url` file).
- `link_redirect_delay`: Time (in seconds) until `fancy` link files will
redirect to the actual URL. Set to a negative value to disable the automatic
redirect. (Default: `-1`)
- `videos`: Whether to download videos. (Default: `no`)
- `http_timeout`: The timeout (in seconds) for all HTTP requests. (Default:
`20.0`)
2021-05-23 23:40:28 +02:00
2021-04-29 18:55:08 +02:00
## Authenticator types
2021-05-13 19:55:04 +02:00
### The `simple` authenticator
With this authenticator, the username and password can be set directly in the
config file. If the username or password are not specified, the user is prompted
via the terminal.
- `username`: The username. (Optional)
- `password`: The password. (Optional)
2021-04-29 18:55:08 +02:00
2021-05-31 17:55:56 +02:00
### The `credential-file` authenticator
This authenticator reads a username and a password from a credential file.
- `path`: Path to the credential file. (Required)
The credential file has exactly two lines (trailing newline optional). The first
2021-05-31 17:55:56 +02:00
line starts with `username=` and contains the username, the second line starts
with `password=` and contains the password. The username and password may
contain any characters except a line break.
```
username=AzureDiamond
password=hunter2
```
2021-05-25 14:12:19 +02:00
### The `keyring` authenticator
This authenticator uses the system keyring to store passwords. The username can
be set directly in the config file. If the username is not specified, the user
is prompted via the terminal. If the keyring contains no entry or the entry is
incorrect, the user is prompted for a password via the terminal and the password
is stored in the keyring.
- `username`: The username. (Optional)
- `keyring_name`: The service name PFERD uses for storing credentials. (Default:
`PFERD`)
2021-05-15 18:27:16 +02:00
### The `tfa` authenticator
This authenticator prompts the user on the console for a two-factor
authentication token. The token is provided as password and it is not cached.
This authenticator does not support usernames.
2021-04-29 18:55:08 +02:00
## Transformation rules
Transformation rules are rules for renaming and excluding files and directories.
They are specified line-by-line in a crawler's `transform` option. When a
crawler needs to apply a rule to a path, it goes through this list top-to-bottom
and applies the first matching rule.
2021-04-29 18:55:08 +02:00
2021-05-26 11:47:51 +02:00
To see this process in action, you can use the `--debug-transforms` or flag or
the `--explain` flag.
Each rule has the format `SOURCE ARROW TARGET` (e. g. `foo/bar --> foo/baz`).
The arrow specifies how the source and target are interpreted. The different
kinds of arrows are documented below.
2021-04-29 18:55:08 +02:00
`SOURCE` and `TARGET` are either a bunch of characters without spaces (e. g.
`foo/bar`) or string literals (e. g, `"foo/b a r"`). The former syntax has no
concept of escaping characters, so the backslash is just another character. The
string literals however support Python's escape syntax (e. g.
`"foo\\bar\tbaz"`). This also means that in string literals, backslashes must be
escaped.
`TARGET` can additionally be a single exclamation mark `!` (*not* `"!"`). When a
rule with a `!` as target matches a path, the corresponding file or directory is
ignored by the crawler instead of renamed.
`TARGET` can also be omitted entirely. When a rule without target matches a
path, the path is returned unmodified. This is useful to prevent rules further
down from matching instead.
Each arrow's behaviour can be modified slightly by changing the arrow's head
from `>` to `>>`. When a rule with a `>>` arrow head matches a path, it doesn't
return immediately like a normal arrow. Instead, it replaces the current path
with its output and continues on to the next rule. In effect, this means that
multiple rules can be applied sequentially.
2021-04-29 18:55:08 +02:00
### The `-->` arrow
The `-->` arrow is a basic renaming operation for files and directories. If a
path matches `SOURCE`, it is renamed to `TARGET`.
Example: `foo/bar --> baz`
- Doesn't match `foo`, `a/foo/bar` or `foo/baz`
- Converts `foo/bar` into `baz`
- Converts `foo/bar/wargl` into `bar/wargl`
Example: `foo/bar --> !`
- Doesn't match `foo`, `a/foo/bar` or `foo/baz`
- Ignores `foo/bar` and any of its children
2021-04-29 18:55:08 +02:00
2021-05-15 15:06:45 +02:00
### The `-name->` arrow
The `-name->` arrow lets you rename files and directories by their name,
regardless of where they appear in the file tree. Because of this, its `SOURCE`
must not contain multiple path segments, only a single name. This restriction
does not apply to its `TARGET`.
Example: `foo -name-> bar/baz`
- Doesn't match `a/foobar/b` or `x/Foo/y/z`
- Converts `hello/foo` into `hello/bar/baz`
- Converts `foo/world` into `bar/baz/world`
- Converts `a/foo/b/c/foo` into `a/bar/baz/b/c/bar/baz`
Example: `foo -name-> !`
- Doesn't match `a/foobar/b` or `x/Foo/y/z`
- Ignores any path containing a segment `foo`
2021-05-15 15:06:45 +02:00
2021-04-29 18:55:08 +02:00
### The `-exact->` arrow
The `-exact->` arrow requires the path to match `SOURCE` exactly. The examples
below show why this is useful.
Example: `foo/bar -exact-> baz`
- Doesn't match `foo`, `a/foo/bar` or `foo/baz`
- Converts `foo/bar` into `baz`
- Doesn't match `foo/bar/wargl`
Example: `foo/bar -exact-> !`
- Doesn't match `foo`, `a/foo/bar` or `foo/baz`
- Ignores only `foo/bar`, not its children
2021-04-29 18:55:08 +02:00
### The `-re->` arrow
The `-re->` arrow is like the `-->` arrow but with regular expressions. `SOURCE`
is a regular expression and `TARGET` an f-string based template. If a path
matches `SOURCE`, the output path is created using `TARGET` as template.
`SOURCE` is automatically anchored.
2021-04-29 18:55:08 +02:00
`TARGET` uses Python's [format string syntax][3]. The *n*-th capturing group can
2021-10-25 21:34:51 +02:00
be referred to as `{g<n>}` (e.g. `{g3}`). `{g0}` refers to the original path.
If capturing group *n*'s contents are a valid integer, the integer value is
2021-10-25 21:34:51 +02:00
available as `{i<n>}` (e.g. `{i3}`). If capturing group *n*'s contents are a
valid float, the float value is available as `{f<n>}` (e.g. `{f3}`). If a
capturing group is not present (e.g. when matching the string `cd` with the
2021-05-27 13:56:01 +02:00
regex `(ab)?cd`), the corresponding variables are not defined.
Python's format string syntax has rich options for formatting its arguments. For
example, to left-pad the capturing group 3 with the digit `0` to width 5, you
can use `{i3:05}`.
PFERD even allows you to write entire expressions inside the curly braces, for
example `{g2.lower()}` or `{g3.replace(' ', '_')}`.
Example: `f(oo+)/be?ar -re-> B{g1.upper()}H/fear`
- Doesn't match `a/foo/bar`, `foo/abc/bar`, `afoo/bar` or `foo/bars`
- Converts `foo/bar` into `BOOH/fear`
- Converts `fooooo/bear` into `BOOOOOH/fear`
- Converts `foo/bar/baz` into `BOOH/fear/baz`
[3]: <https://docs.python.org/3/library/string.html#format-string-syntax> "Format String Syntax"
2021-05-15 15:06:45 +02:00
### The `-name-re->` arrow
The `-name-re>` arrow is like a combination of the `-name->` and `-re->` arrows.
Example: `(.*)\.jpeg -name-re-> {g1}.jpg`
- Doesn't match `foo/bar.png`, `baz.JPEG` or `hello,jpeg`
- Converts `foo/bar.jpeg` into `foo/bar.jpg`
- Converts `foo.jpeg/bar/baz.jpeg` into `foo.jpg/bar/baz.jpg`
Example: `\..+ -name-re-> !`
- Doesn't match `.`, `test`, `a.b`
- Ignores all files and directories starting with `.`.
### The `-exact-re->` arrow
The `-exact-re>` arrow is like a combination of the `-exact->` and `-re->`
arrows.
Example: `f(oo+)/be?ar -exactre-> B{g1.upper()}H/fear`
- Doesn't match `a/foo/bar`, `foo/abc/bar`, `afoo/bar` or `foo/bars`
- Converts `foo/bar` into `BOOH/fear`
- Converts `fooooo/bear` into `BOOOOOH/fear`
- Doesn't match `foo/bar/baz`
2021-05-15 15:06:45 +02:00
2021-04-29 18:55:08 +02:00
### Example: Tutorials
2021-05-05 00:55:55 +02:00
You have an ILIAS course with lots of tutorials, but are only interested in a
2021-05-15 15:06:45 +02:00
single one.
2021-04-29 18:55:08 +02:00
```
tutorials/
|- tut_01/
|- tut_02/
|- tut_03/
...
```
You can use a mix of normal and exact arrows to get rid of the other ones and
move the `tutorials/tut_02/` folder to `my_tut/`:
```
tutorials/tut_02 --> my_tut
tutorials -exact->
tutorials --> !
```
The second rule is required for many crawlers since they use the rules to decide
which directories to crawl. If it was missing when the crawler looks at
`tutorials/`, the third rule would match. This means the crawler would not crawl
the `tutorials/` directory and thus not discover that `tutorials/tut02/` exists.
2021-04-29 18:55:08 +02:00
Since the second rule is only relevant for crawling, the `TARGET` is left out.
### Example: Lecture slides
You have a course with slides like `Lecture 3: Linear functions.PDF` and you
would like to rename them to `03_linear_functions.pdf`.
```
Lectures/
|- Lecture 1: Introduction.PDF
|- Lecture 2: Vectors and matrices.PDF
|- Lecture 3: Linear functions.PDF
...
```
2021-05-05 00:55:55 +02:00
To do this, you can use the most powerful of arrows: The regex arrow.
```
"Lectures/Lecture (\\d+): (.*)\\.PDF" -re-> "Lectures/{i1:02}_{g2.lower().replace(' ', '_')}.pdf"
```
Note the escaped backslashes on the `SOURCE` side.
2021-05-15 15:06:45 +02:00
### Example: Crawl a Python project
2021-05-15 15:06:45 +02:00
You are crawling a Python project and want to ignore all hidden files (files
2021-05-15 15:06:45 +02:00
whose name starts with a `.`), all `__pycache__` directories and all markdown
files (for some weird reason).
```
.gitignore
.mypy_cache/
.venv/
CONFIG.md
PFERD/
|- __init__.py
|- __main__.py
|- __pycache__/
|- authenticator.py
|- config.py
...
README.md
...
```
For this task, the name arrows can be used.
2021-05-15 15:06:45 +02:00
```
\..* -name-re-> !
__pycache__ -name-> !
.*\.md -name-re-> !
2021-05-15 15:06:45 +02:00
```
2021-06-12 14:57:29 +02:00
### Example: Clean up names
You want to convert all paths into lowercase and replace spaces with underscores
before applying any rules. This can be achieved using the `>>` arrow heads.
```
(.*) -re->> "{g1.lower().replace(' ', '_')}"
<other rules go here>
```