2021-04-29 18:55:08 +02:00
|
|
|
# Config file format
|
|
|
|
|
|
|
|
A config file consists of sections. A section begins with a `[section]` header,
|
2021-05-25 14:12:19 +02:00
|
|
|
which is followed by a list of `key = value` pairs. Comments must be on their
|
|
|
|
own line and start with `#`. Multiline values must be indented beyond their key.
|
|
|
|
Boolean values can be `yes` or `no`. For more details and some examples on the
|
|
|
|
format, see the [configparser documentation][1] ([basic interpolation][2] is
|
|
|
|
enabled).
|
2021-04-29 18:55:08 +02:00
|
|
|
|
|
|
|
[1]: <https://docs.python.org/3/library/configparser.html#supported-ini-file-structure> "Supported INI File Structure"
|
|
|
|
[2]: <https://docs.python.org/3/library/configparser.html#configparser.BasicInterpolation> "BasicInterpolation"
|
|
|
|
|
|
|
|
## The `DEFAULT` section
|
|
|
|
|
|
|
|
This section contains global configuration values. It can also be used to set
|
|
|
|
default values for the other sections.
|
|
|
|
|
|
|
|
- `working_dir`: The directory PFERD operates in. Set to an absolute path to
|
2021-05-25 14:12:19 +02:00
|
|
|
make PFERD operate the same regardless of where it is executed from. All other
|
2021-04-29 18:55:08 +02:00
|
|
|
paths in the config file are interpreted relative to this path. If this path
|
|
|
|
is relative, it is interpreted relative to the script's working dir. `~` is
|
|
|
|
expanded to the current user's home directory. (Default: `.`)
|
2021-05-19 17:48:51 +02:00
|
|
|
- `explain`: Whether PFERD should log and explain its actions and decisions in
|
|
|
|
detail. (Default: `no`)
|
2021-05-25 14:12:19 +02:00
|
|
|
- `status`: Whether PFERD should print status updates (like `Crawled ...`,
|
|
|
|
`Added ...`) while running a crawler. (Default: `yes`)
|
2021-05-23 22:51:42 +02:00
|
|
|
- `report`: Whether PFERD should print a report of added, changed and deleted
|
|
|
|
local files for all crawlers before exiting. (Default: `yes`)
|
2021-05-25 14:12:19 +02:00
|
|
|
- `share_cookies`: Whether crawlers should share cookies where applicable. For
|
|
|
|
example, some crawlers share cookies if they crawl the same website using the
|
|
|
|
same account. (Default: `yes`)
|
2021-04-29 18:55:08 +02:00
|
|
|
|
|
|
|
## The `crawl:*` sections
|
|
|
|
|
|
|
|
Sections whose names start with `crawl:` are used to configure crawlers. The
|
|
|
|
rest of the section name specifies the name of the crawler.
|
|
|
|
|
|
|
|
A crawler synchronizes a remote resource to a local directory. There are
|
|
|
|
different types of crawlers for different kinds of resources, e. g. ILIAS
|
|
|
|
courses or lecture websites.
|
|
|
|
|
|
|
|
Each crawl section represents an instance of a specific type of crawler. The
|
|
|
|
`type` option is used to specify the crawler type. The crawler's name is usually
|
2021-05-25 14:12:19 +02:00
|
|
|
used as the output directory. New crawlers can be created simply by adding a new
|
|
|
|
crawl section to the config file.
|
2021-04-29 18:55:08 +02:00
|
|
|
|
|
|
|
Depending on a crawler's type, it may have different options. For more details,
|
2021-05-25 14:12:19 +02:00
|
|
|
see the type's [documentation](#crawler-types) below. The following options are
|
|
|
|
common to all crawlers:
|
2021-04-29 18:55:08 +02:00
|
|
|
|
2021-05-25 14:12:19 +02:00
|
|
|
- `type`: The available types are specified in [this section](#crawler-types).
|
2021-04-29 18:55:08 +02:00
|
|
|
- `output_dir`: The directory the crawler synchronizes files to. A crawler will
|
2021-05-15 17:23:33 +02:00
|
|
|
never place any files outside of this directory. (Default: the crawler's name)
|
2021-05-25 14:12:19 +02:00
|
|
|
- `redownload`: When to download a file that is already present locally.
|
2021-04-30 15:32:56 +02:00
|
|
|
(Default: `never-smart`)
|
|
|
|
- `never`: If a file is present locally, it is not downloaded again.
|
|
|
|
- `never-smart`: Like `never`, but PFERD tries to detect if an already
|
|
|
|
downloaded files has changed via some (unreliable) heuristics.
|
|
|
|
- `always`: All files are always downloaded, regardless of whether they are
|
|
|
|
already present locally.
|
|
|
|
- `always-smart`: Like `always`, but PFERD tries to avoid unnecessary
|
|
|
|
downloads via some (unreliable) heuristics.
|
2021-05-05 00:55:55 +02:00
|
|
|
- `on_conflict`: What to do when the local and remote versions of a file or
|
2021-05-25 14:12:19 +02:00
|
|
|
directory differ, including when a file is replaced by a directory or a
|
|
|
|
directory by a file. (Default: `prompt`)
|
2021-05-05 00:55:55 +02:00
|
|
|
- `prompt`: Always ask the user before overwriting or deleting local files
|
|
|
|
and directories.
|
|
|
|
- `local-first`: Always keep the local file or directory. Equivalent to
|
|
|
|
using `prompt` and always choosing "no". Implies that `redownload` is set
|
|
|
|
to `never`.
|
|
|
|
- `remote-first`: Always keep the remote file or directory. Equivalent to
|
|
|
|
using `prompt` and always choosing "yes".
|
|
|
|
- `no-delete`: Never delete local files, but overwrite local files if the
|
|
|
|
remote file is different.
|
2021-04-29 18:55:08 +02:00
|
|
|
- `transform`: Rules for renaming and excluding certain files and directories.
|
|
|
|
For more details, see [this section](#transformation-rules). (Default: empty)
|
2021-05-25 14:12:19 +02:00
|
|
|
- `tasks`: The maximum number of concurrent tasks (such as crawling or
|
|
|
|
downloading). (Default: `1`)
|
|
|
|
- `downloads`: How many of those tasks can be download tasks at the same time.
|
|
|
|
Must not be greater than `tasks`. (Default: Same as `tasks`)
|
|
|
|
- `task_delay`: Time (in seconds) that the crawler should wait between
|
2021-05-15 13:21:38 +02:00
|
|
|
subsequent tasks. Can be used as a sort of rate limit to avoid unnecessary
|
2021-05-25 14:12:19 +02:00
|
|
|
load for the crawl target. (Default: `0.0`)
|
2021-05-25 11:58:01 +02:00
|
|
|
- `windows_paths`: Whether PFERD should find alternative names for paths that
|
|
|
|
are invalid on Windows. (Default: `yes` on Windows, `no` otherwise)
|
2021-04-29 18:55:08 +02:00
|
|
|
|
2021-05-09 01:33:47 +02:00
|
|
|
Some crawlers may also require credentials for authentication. To configure how
|
|
|
|
the crawler obtains its credentials, the `auth` option is used. It is set to the
|
|
|
|
full name of an auth section (including the `auth:` prefix).
|
|
|
|
|
|
|
|
Here is a simple example:
|
|
|
|
|
2021-05-25 17:16:57 +02:00
|
|
|
```ini
|
2021-05-09 01:33:47 +02:00
|
|
|
[auth:example]
|
|
|
|
type = simple
|
|
|
|
username = foo
|
|
|
|
password = bar
|
|
|
|
|
|
|
|
[crawl:something]
|
|
|
|
type = some-complex-crawler
|
|
|
|
auth = auth:example
|
2021-05-25 14:12:19 +02:00
|
|
|
on_conflict = no-delete
|
|
|
|
tasks = 3
|
2021-05-09 01:33:47 +02:00
|
|
|
```
|
|
|
|
|
2021-04-29 18:55:08 +02:00
|
|
|
## The `auth:*` sections
|
|
|
|
|
|
|
|
Sections whose names start with `auth:` are used to configure authenticators. An
|
2021-05-15 18:24:03 +02:00
|
|
|
authenticator provides a username and a password to one or more crawlers.
|
2021-04-29 18:55:08 +02:00
|
|
|
|
|
|
|
Authenticators work similar to crawlers: A section represents an authenticator
|
2021-05-25 14:12:19 +02:00
|
|
|
instance whose name is the rest of the section name. The type is specified by
|
2021-04-29 18:55:08 +02:00
|
|
|
the `type` option.
|
|
|
|
|
|
|
|
Depending on an authenticator's type, it may have different options. For more
|
2021-05-25 14:12:19 +02:00
|
|
|
details, see the type's [documentation](#authenticator-types) below. The only
|
|
|
|
option common to all authenticators is `type`:
|
2021-04-29 18:55:08 +02:00
|
|
|
|
|
|
|
- `type`: The types are specified in [this section](#authenticator-types).
|
|
|
|
|
|
|
|
## Crawler types
|
|
|
|
|
2021-05-09 01:33:47 +02:00
|
|
|
### The `local` crawler
|
|
|
|
|
|
|
|
This crawler crawls a local directory. It is really simple and mostly useful for
|
2021-05-14 21:41:24 +02:00
|
|
|
testing different setups. The various delay options are meant to make the
|
|
|
|
crawler simulate a slower, network-based crawler.
|
2021-05-09 01:33:47 +02:00
|
|
|
|
2021-05-15 17:12:25 +02:00
|
|
|
- `target`: Path to the local directory to crawl. (Required)
|
2021-05-25 14:12:19 +02:00
|
|
|
- `crawl_delay`: Artificial delay (in seconds) to simulate for crawl requests.
|
|
|
|
(Default: `0.0`)
|
|
|
|
- `download_delay`: Artificial delay (in seconds) to simulate for download
|
|
|
|
requests. (Default: `0.0`)
|
2021-05-14 21:41:24 +02:00
|
|
|
- `download_speed`: Download speed (in bytes per second) to simulate. (Optional)
|
2021-04-29 18:55:08 +02:00
|
|
|
|
2021-05-25 14:12:19 +02:00
|
|
|
### The `kit-ilias-web` crawler
|
|
|
|
|
|
|
|
This crawler crawls the KIT ILIAS instance.
|
|
|
|
|
|
|
|
ILIAS is not great at handling too many concurrent requests. To avoid
|
|
|
|
unnecessary load, please limit `tasks` to `1`.
|
|
|
|
|
|
|
|
There is a spike in ILIAS usage at the beginning of lectures, so please don't
|
|
|
|
run PFERD during those times.
|
|
|
|
|
|
|
|
If you're automatically running PFERD periodically (e. g. via cron or a systemd
|
|
|
|
timer), please randomize the start time or at least don't use the full hour. For
|
|
|
|
systemd timers, this can be accomplished using the `RandomizedDelaySec` option.
|
|
|
|
Also, please schedule the script to run in periods of low activity. Running the
|
|
|
|
script once per day should be fine.
|
|
|
|
|
|
|
|
- `target`: The ILIAS element to crawl. (Required)
|
|
|
|
- `desktop`: Crawl your personal desktop
|
|
|
|
- `<course id>`: Crawl the course with the given id
|
|
|
|
- `<url>`: Crawl a given element by URL (preferably the permanent URL linked
|
|
|
|
at the bottom of its ILIAS page)
|
|
|
|
- `auth`: Name of auth section to use for login. (Required)
|
|
|
|
- `tfa_auth`: Name of auth section to use for two-factor authentication. Only
|
|
|
|
uses the auth section's password. (Default: Anonymous `tfa` authenticator)
|
|
|
|
- `links`: How to represent external links. (Default: `fancy`)
|
|
|
|
- `ignore`: Don't download links.
|
|
|
|
- `plaintext`: A text file containing only the URL.
|
|
|
|
- `fancy`: A HTML file looking like the ILIAS link element.
|
|
|
|
- `internet-shortcut`: An internet shortcut file (`.url` file).
|
|
|
|
- `link_redirect_delay`: Time (in seconds) until `fancy` link files will
|
|
|
|
redirect to the actual URL. Set to a negative value to disable the automatic
|
|
|
|
redirect. (Default: `-1`)
|
|
|
|
- `videos`: Whether to download videos. (Default: `no`)
|
|
|
|
- `http_timeout`: The timeout (in seconds) for all HTTP requests. (Default:
|
|
|
|
`20.0`)
|
2021-05-23 23:40:28 +02:00
|
|
|
|
2021-04-29 18:55:08 +02:00
|
|
|
## Authenticator types
|
|
|
|
|
2021-05-13 19:55:04 +02:00
|
|
|
### The `simple` authenticator
|
|
|
|
|
|
|
|
With this authenticator, the username and password can be set directly in the
|
|
|
|
config file. If the username or password are not specified, the user is prompted
|
|
|
|
via the terminal.
|
|
|
|
|
2021-05-14 21:41:24 +02:00
|
|
|
- `username`: The username. (Optional)
|
|
|
|
- `password`: The password. (Optional)
|
2021-04-29 18:55:08 +02:00
|
|
|
|
2021-05-25 14:12:19 +02:00
|
|
|
### The `keyring` authenticator
|
|
|
|
|
|
|
|
This authenticator uses the system keyring to store passwords. The username can
|
|
|
|
be set directly in the config file. If the username is not specified, the user
|
|
|
|
is prompted via the terminal. If the keyring contains no entry or the entry is
|
|
|
|
incorrect, the user is prompted for a password via the terminal and the password
|
|
|
|
is stored in the keyring.
|
|
|
|
|
|
|
|
- `username`: The username. (Optional)
|
|
|
|
- `keyring_name`: The service name PFERD uses for storing credentials. (Default:
|
|
|
|
`PFERD`)
|
|
|
|
|
2021-05-15 18:27:16 +02:00
|
|
|
### The `tfa` authenticator
|
|
|
|
|
|
|
|
This authenticator prompts the user on the console for a two-factor
|
|
|
|
authentication token. The token is provided as password and it is not cached.
|
|
|
|
This authenticator does not support usernames.
|
|
|
|
|
2021-04-29 18:55:08 +02:00
|
|
|
## Transformation rules
|
|
|
|
|
|
|
|
Transformation rules are rules for renaming and excluding files and directories.
|
|
|
|
They are specified line-by-line in a crawler's `transform` option. When a
|
|
|
|
crawler needs to apply a rule to a path, it goes through this list top-to-bottom
|
|
|
|
and choose the first matching rule.
|
|
|
|
|
2021-05-26 11:47:51 +02:00
|
|
|
To see this process in action, you can use the `--debug-transforms` or flag or
|
|
|
|
the `--explain` flag.
|
|
|
|
|
2021-04-29 18:55:08 +02:00
|
|
|
Each line has the format `SOURCE ARROW TARGET` where `TARGET` is optional.
|
|
|
|
`SOURCE` is either a normal path without spaces (e. g. `foo/bar`), or a string
|
|
|
|
literal delimited by `"` or `'` (e. g. `"foo\" bar/baz"`). Python's string
|
|
|
|
escape syntax is supported. Trailing slashes are ignored. `TARGET` can be
|
|
|
|
formatted like `SOURCE`, but it can also be a single exclamation mark without
|
2021-05-27 13:20:37 +02:00
|
|
|
quotes (`!`). `ARROW` is one of `-->`, `-name->`, `-exact->`, `-re->` and
|
2021-05-15 15:06:45 +02:00
|
|
|
`-name-re->`
|
2021-04-29 18:55:08 +02:00
|
|
|
|
|
|
|
If a rule's target is `!`, this means that when the rule matches on a path, the
|
|
|
|
corresponding file or directory is ignored. If a rule's target is missing, the
|
|
|
|
path is matched but not modified.
|
|
|
|
|
|
|
|
### The `-->` arrow
|
|
|
|
|
|
|
|
The `-->` arrow is a basic renaming operation. If a path begins with `SOURCE`,
|
|
|
|
that part of the path is replaced with `TARGET`. This means that the rule
|
|
|
|
`foo/bar --> baz` would convert `foo/bar` into `baz`, but also `foo/bar/xyz`
|
|
|
|
into `baz/xyz`. The rule `foo --> !` would ignore a directory named `foo` as
|
|
|
|
well as all its contents.
|
|
|
|
|
2021-05-15 15:06:45 +02:00
|
|
|
### The `-name->` arrow
|
|
|
|
|
2021-05-27 13:20:37 +02:00
|
|
|
The `-name->` arrow lets you rename files and directories by their name,
|
|
|
|
regardless of where they appear in the file tree. Because of this, its `SOURCE`
|
|
|
|
must not contain multiple path segments, only a single name. This restriction
|
|
|
|
does not apply to its `TARGET`. The `-name->` arrow is not applied recursively
|
|
|
|
to its own output to prevent infinite loops.
|
|
|
|
|
|
|
|
For example, the rule `foo -name-> bar/baz` would convert `a/foo` into
|
|
|
|
`a/bar/baz` and `a/foo/b/c/foo` into `a/bar/baz/b/c/bar/baz`. The rule `foo
|
|
|
|
-name-> !` would ignore all directories and files named `foo`.
|
2021-05-15 15:06:45 +02:00
|
|
|
|
2021-04-29 18:55:08 +02:00
|
|
|
### The `-exact->` arrow
|
|
|
|
|
|
|
|
The `-exact->` arrow requires the path to match `SOURCE` exactly. This means
|
|
|
|
that the rule `foo/bar -exact-> baz` would still convert `foo/bar` into `baz`,
|
|
|
|
but `foo/bar/xyz` would be unaffected. Also, `foo -exact-> !` would only ignore
|
|
|
|
`foo`, but not its contents (if it has any). The examples below show why this is
|
|
|
|
useful.
|
|
|
|
|
|
|
|
### The `-re->` arrow
|
|
|
|
|
|
|
|
The `-re->` arrow uses regular expressions. `SOURCE` is a regular expression
|
|
|
|
that must match the entire path. If this is the case, then the capturing groups
|
|
|
|
are available in `TARGET` for formatting.
|
|
|
|
|
2021-04-29 20:13:46 +02:00
|
|
|
`TARGET` uses Python's [format string syntax][3]. The *n*-th capturing group can
|
|
|
|
be referred to as `{g<n>}` (e. g. `{g3}`). `{g0}` refers to the original path.
|
|
|
|
If capturing group *n*'s contents are a valid integer, the integer value is
|
|
|
|
available as `{i<n>}` (e. g. `{i3}`). If capturing group *n*'s contents are a
|
|
|
|
valid float, the float value is available as `{f<n>}` (e. g. `{f3}`).
|
|
|
|
|
|
|
|
Python's format string syntax has rich options for formatting its arguments. For
|
|
|
|
example, to left-pad the capturing group 3 with the digit `0` to width 5, you
|
|
|
|
can use `{i3:05}`.
|
|
|
|
|
|
|
|
PFERD even allows you to write entire expressions inside the curly braces, for
|
|
|
|
example `{g2.lower()}` or `{g3.replace(' ', '_')}`.
|
|
|
|
|
|
|
|
[3]: <https://docs.python.org/3/library/string.html#format-string-syntax> "Format String Syntax"
|
|
|
|
|
2021-05-15 15:06:45 +02:00
|
|
|
### The `-name-re->` arrow
|
|
|
|
|
2021-05-27 13:20:37 +02:00
|
|
|
The `-name-re>` arrow is like a combination of the `-name->` and `-re->` arrows.
|
|
|
|
Instead of the `SOURCE` being the name of a directory or file, it's a regex that
|
|
|
|
is matched against the names of directories and files. `TARGET` works like the
|
|
|
|
`-re->` arrow's target.
|
|
|
|
|
|
|
|
For example, the arrow `(.*)\.jpeg -name-re-> {g1}.jpg` will rename all `.jpeg`
|
|
|
|
extensions into `.jpg`. The arrow `\..+ -name-re-> !` will ignore all files and
|
|
|
|
directories starting with `.`.
|
2021-05-15 15:06:45 +02:00
|
|
|
|
2021-04-29 18:55:08 +02:00
|
|
|
### Example: Tutorials
|
|
|
|
|
2021-05-05 00:55:55 +02:00
|
|
|
You have an ILIAS course with lots of tutorials, but are only interested in a
|
2021-05-15 15:06:45 +02:00
|
|
|
single one.
|
2021-04-29 18:55:08 +02:00
|
|
|
|
|
|
|
```
|
|
|
|
tutorials/
|
|
|
|
|- tut_01/
|
|
|
|
|- tut_02/
|
|
|
|
|- tut_03/
|
|
|
|
...
|
|
|
|
```
|
|
|
|
|
|
|
|
You can use a mix of normal and exact arrows to get rid of the other ones and
|
|
|
|
move the `tutorials/tut_02/` folder to `my_tut/`:
|
|
|
|
|
|
|
|
```
|
|
|
|
tutorials/tut_02 --> my_tut
|
|
|
|
tutorials -exact->
|
|
|
|
tutorials --> !
|
|
|
|
```
|
|
|
|
|
|
|
|
The second rule is required for many crawlers since they use the rules to decide
|
|
|
|
which directories to crawl. If it was missing when the crawler looks at
|
|
|
|
`tutorials/`, the third rule would match. This means the crawler would not crawl
|
|
|
|
the `tutorials/` directory and thus not discover that `tutorials/tut02/`
|
|
|
|
existed.
|
|
|
|
|
|
|
|
Since the second rule is only relevant for crawling, the `TARGET` is left out.
|
2021-04-29 20:13:46 +02:00
|
|
|
|
|
|
|
### Example: Lecture slides
|
|
|
|
|
|
|
|
You have a course with slides like `Lecture 3: Linear functions.PDF` and you
|
|
|
|
would like to rename them to `03_linear_functions.pdf`.
|
|
|
|
|
|
|
|
```
|
|
|
|
Lectures/
|
|
|
|
|- Lecture 1: Introduction.PDF
|
|
|
|
|- Lecture 2: Vectors and matrices.PDF
|
|
|
|
|- Lecture 3: Linear functions.PDF
|
|
|
|
...
|
|
|
|
```
|
|
|
|
|
2021-05-05 00:55:55 +02:00
|
|
|
To do this, you can use the most powerful of arrows: The regex arrow.
|
2021-04-29 20:13:46 +02:00
|
|
|
|
|
|
|
```
|
|
|
|
"Lectures/Lecture (\\d+): (.*)\\.PDF" -re-> "Lectures/{i1:02}_{g2.lower().replace(' ', '_')}.pdf"
|
|
|
|
```
|
|
|
|
|
|
|
|
Note the escaped backslashes on the `SOURCE` side.
|
2021-05-15 15:06:45 +02:00
|
|
|
|
|
|
|
### Example: Crawl a python project
|
|
|
|
|
|
|
|
You are crawling a python project and want to ignore all hidden files (files
|
|
|
|
whose name starts with a `.`), all `__pycache__` directories and all markdown
|
|
|
|
files (for some weird reason).
|
|
|
|
|
|
|
|
```
|
|
|
|
.gitignore
|
|
|
|
.mypy_cache/
|
|
|
|
.venv/
|
|
|
|
CONFIG.md
|
|
|
|
PFERD/
|
|
|
|
|- __init__.py
|
|
|
|
|- __main__.py
|
|
|
|
|- __pycache__/
|
|
|
|
|- authenticator.py
|
|
|
|
|- config.py
|
|
|
|
...
|
|
|
|
README.md
|
|
|
|
...
|
|
|
|
```
|
|
|
|
|
|
|
|
For this task, the name arrows can be used. They are variants of the normal
|
|
|
|
arrows that only look at the file name instead of the entire path.
|
|
|
|
|
|
|
|
```
|
2021-05-15 15:13:34 +02:00
|
|
|
\..* -name-re-> !
|
|
|
|
__pycache__ -name-> !
|
|
|
|
.*\.md -name-re-> !
|
2021-05-15 15:06:45 +02:00
|
|
|
```
|