mirror of
https://github.com/Garmelon/PFERD.git
synced 2023-12-21 10:23:01 +01:00
455 lines
18 KiB
Markdown
455 lines
18 KiB
Markdown
# Config file format
|
|
|
|
A config file consists of sections. A section begins with a `[section]` header,
|
|
which is followed by a list of `key = value` pairs. Comments must be on their
|
|
own line and start with `#`. Multiline values must be indented beyond their key.
|
|
Boolean values can be `yes` or `no`. For more details and some examples on the
|
|
format, see the [configparser documentation][1] ([interpolation][2] is
|
|
disabled).
|
|
|
|
[1]: <https://docs.python.org/3/library/configparser.html#supported-ini-file-structure> "Supported INI File Structure"
|
|
[2]: <https://docs.python.org/3/library/configparser.html#interpolation-of-values> "Interpolation of values"
|
|
|
|
## The `DEFAULT` section
|
|
|
|
This section contains global configuration values. It can also be used to set
|
|
default values for the other sections.
|
|
|
|
- `working_dir`: The directory PFERD operates in. Set to an absolute path to
|
|
make PFERD operate the same regardless of where it is executed from. All other
|
|
paths in the config file are interpreted relative to this path. If this path
|
|
is relative, it is interpreted relative to the script's working dir. `~` is
|
|
expanded to the current user's home directory. (Default: `.`)
|
|
- `explain`: Whether PFERD should log and explain its actions and decisions in
|
|
detail. (Default: `no`)
|
|
- `status`: Whether PFERD should print status updates (like `Crawled ...`,
|
|
`Added ...`) while running a crawler. (Default: `yes`)
|
|
- `report`: Whether PFERD should print a report of added, changed and deleted
|
|
local files for all crawlers before exiting. (Default: `yes`)
|
|
- `share_cookies`: Whether crawlers should share cookies where applicable. For
|
|
example, some crawlers share cookies if they crawl the same website using the
|
|
same account. (Default: `yes`)
|
|
|
|
## The `crawl:*` sections
|
|
|
|
Sections whose names start with `crawl:` are used to configure crawlers. The
|
|
rest of the section name specifies the name of the crawler.
|
|
|
|
A crawler synchronizes a remote resource to a local directory. There are
|
|
different types of crawlers for different kinds of resources, e.g. ILIAS
|
|
courses or lecture websites.
|
|
|
|
Each crawl section represents an instance of a specific type of crawler. The
|
|
`type` option is used to specify the crawler type. The crawler's name is usually
|
|
used as the output directory. New crawlers can be created simply by adding a new
|
|
crawl section to the config file.
|
|
|
|
Depending on a crawler's type, it may have different options. For more details,
|
|
see the type's [documentation](#crawler-types) below. The following options are
|
|
common to all crawlers:
|
|
|
|
- `type`: The available types are specified in [this section](#crawler-types).
|
|
- `skip`: Whether the crawler should be skipped during normal execution. The
|
|
crawler can still be executed manually using the `--crawler` or `-C` flags.
|
|
(Default: `no`)
|
|
- `output_dir`: The directory the crawler synchronizes files to. A crawler will
|
|
never place any files outside this directory. (Default: the crawler's name)
|
|
- `redownload`: When to download a file that is already present locally.
|
|
(Default: `never-smart`)
|
|
- `never`: If a file is present locally, it is not downloaded again.
|
|
- `never-smart`: Like `never`, but PFERD tries to detect if an already
|
|
downloaded files has changed via some (unreliable) heuristics.
|
|
- `always`: All files are always downloaded, regardless of whether they are
|
|
already present locally.
|
|
- `always-smart`: Like `always`, but PFERD tries to avoid unnecessary
|
|
downloads via some (unreliable) heuristics.
|
|
- `on_conflict`: What to do when the local and remote versions of a file or
|
|
directory differ, including when a file is replaced by a directory or a
|
|
directory by a file. (Default: `prompt`)
|
|
- `prompt`: Always ask the user before overwriting or deleting local files
|
|
and directories.
|
|
- `local-first`: Always keep the local file or directory. Equivalent to
|
|
using `prompt` and always choosing "no". Implies that `redownload` is set
|
|
to `never`.
|
|
- `remote-first`: Always keep the remote file or directory. Equivalent to
|
|
using `prompt` and always choosing "yes".
|
|
- `no-delete`: Never delete local files, but overwrite local files if the
|
|
remote file is different.
|
|
- `transform`: Rules for renaming and excluding certain files and directories.
|
|
For more details, see [this section](#transformation-rules). (Default: empty)
|
|
- `tasks`: The maximum number of concurrent tasks (such as crawling or
|
|
downloading). (Default: `1`)
|
|
- `downloads`: How many of those tasks can be download tasks at the same time.
|
|
Must not be greater than `tasks`. (Default: Same as `tasks`)
|
|
- `task_delay`: Time (in seconds) that the crawler should wait between
|
|
subsequent tasks. Can be used as a sort of rate limit to avoid unnecessary
|
|
load for the crawl target. (Default: `0.0`)
|
|
- `windows_paths`: Whether PFERD should find alternative names for paths that
|
|
are invalid on Windows. (Default: `yes` on Windows, `no` otherwise)
|
|
- `alias`: List of strings that are considered as an alias when invoking with
|
|
the `--crawler` or `-C` flag. If there is more then one crawl section with
|
|
the same alias all are selected. Thereby you can group different crawlers.
|
|
|
|
Some crawlers may also require credentials for authentication. To configure how
|
|
the crawler obtains its credentials, the `auth` option is used. It is set to the
|
|
full name of an auth section (including the `auth:` prefix).
|
|
|
|
Here is a simple example:
|
|
|
|
```ini
|
|
[auth:example]
|
|
type = simple
|
|
username = foo
|
|
password = bar
|
|
|
|
[crawl:something]
|
|
alias = [sth, some]
|
|
type = some-complex-crawler
|
|
auth = auth:example
|
|
on_conflict = no-delete
|
|
tasks = 3
|
|
```
|
|
|
|
## The `auth:*` sections
|
|
|
|
Sections whose names start with `auth:` are used to configure authenticators. An
|
|
authenticator provides a username and a password to one or more crawlers.
|
|
|
|
Authenticators work similar to crawlers: A section represents an authenticator
|
|
instance whose name is the rest of the section name. The type is specified by
|
|
the `type` option.
|
|
|
|
Depending on an authenticator's type, it may have different options. For more
|
|
details, see the type's [documentation](#authenticator-types) below. The only
|
|
option common to all authenticators is `type`:
|
|
|
|
- `type`: The types are specified in [this section](#authenticator-types).
|
|
|
|
## Crawler types
|
|
|
|
### The `local` crawler
|
|
|
|
This crawler crawls a local directory. It is really simple and mostly useful for
|
|
testing different setups. The various delay options are meant to make the
|
|
crawler simulate a slower, network-based crawler.
|
|
|
|
- `target`: Path to the local directory to crawl. (Required)
|
|
- `crawl_delay`: Artificial delay (in seconds) to simulate for crawl requests.
|
|
(Default: `0.0`)
|
|
- `download_delay`: Artificial delay (in seconds) to simulate for download
|
|
requests. (Default: `0.0`)
|
|
- `download_speed`: Download speed (in bytes per second) to simulate. (Optional)
|
|
|
|
### The `kit-ipd` crawler
|
|
|
|
This crawler crawls a KIT ipd page by url. The root page can be crawled from
|
|
outside the KIT network so you will be informed about any new/deleted files,
|
|
but downloading files requires you to be within. Adding a show delay between
|
|
requests is likely a good idea.
|
|
|
|
### The `kit-ilias-web` crawler
|
|
|
|
This crawler crawls the KIT ILIAS instance.
|
|
|
|
ILIAS is not great at handling too many concurrent requests. To avoid
|
|
unnecessary load, please limit `tasks` to `1`.
|
|
|
|
There is a spike in ILIAS usage at the beginning of lectures, so please don't
|
|
run PFERD during those times.
|
|
|
|
If you're automatically running PFERD periodically (e. g. via cron or a systemd
|
|
timer), please randomize the start time or at least don't use the full hour. For
|
|
systemd timers, this can be accomplished using the `RandomizedDelaySec` option.
|
|
Also, please schedule the script to run in periods of low activity. Running the
|
|
script once per day should be fine.
|
|
|
|
- `target`: The ILIAS element to crawl. (Required)
|
|
- `desktop`: Crawl your personal desktop
|
|
- `<course id>`: Crawl the course with the given id
|
|
- `<url>`: Crawl a given element by URL (preferably the permanent URL linked
|
|
at the bottom of its ILIAS page)
|
|
- `auth`: Name of auth section to use for login. (Required)
|
|
- `tfa_auth`: Name of auth section to use for two-factor authentication. Only
|
|
uses the auth section's password. (Default: Anonymous `tfa` authenticator)
|
|
- `links`: How to represent external links. (Default: `fancy`)
|
|
- `ignore`: Don't download links.
|
|
- `plaintext`: A text file containing only the URL.
|
|
- `fancy`: A HTML file looking like the ILIAS link element.
|
|
- `internet-shortcut`: An internet shortcut file (`.url` file).
|
|
- `link_redirect_delay`: Time (in seconds) until `fancy` link files will
|
|
redirect to the actual URL. Set to a negative value to disable the automatic
|
|
redirect. (Default: `-1`)
|
|
- `videos`: Whether to download videos. (Default: `no`)
|
|
- `http_timeout`: The timeout (in seconds) for all HTTP requests. (Default:
|
|
`20.0`)
|
|
|
|
## Authenticator types
|
|
|
|
### The `simple` authenticator
|
|
|
|
With this authenticator, the username and password can be set directly in the
|
|
config file. If the username or password are not specified, the user is prompted
|
|
via the terminal.
|
|
|
|
- `username`: The username. (Optional)
|
|
- `password`: The password. (Optional)
|
|
|
|
### The `credential-file` authenticator
|
|
|
|
This authenticator reads a username and a password from a credential file.
|
|
|
|
- `path`: Path to the credential file. (Required)
|
|
|
|
The credential file has exactly two lines (trailing newline optional). The first
|
|
line starts with `username=` and contains the username, the second line starts
|
|
with `password=` and contains the password. The username and password may
|
|
contain any characters except a line break.
|
|
|
|
```
|
|
username=AzureDiamond
|
|
password=hunter2
|
|
```
|
|
|
|
### The `keyring` authenticator
|
|
|
|
This authenticator uses the system keyring to store passwords. The username can
|
|
be set directly in the config file. If the username is not specified, the user
|
|
is prompted via the terminal. If the keyring contains no entry or the entry is
|
|
incorrect, the user is prompted for a password via the terminal and the password
|
|
is stored in the keyring.
|
|
|
|
- `username`: The username. (Optional)
|
|
- `keyring_name`: The service name PFERD uses for storing credentials. (Default:
|
|
`PFERD`)
|
|
|
|
### The `tfa` authenticator
|
|
|
|
This authenticator prompts the user on the console for a two-factor
|
|
authentication token. The token is provided as password and it is not cached.
|
|
This authenticator does not support usernames.
|
|
|
|
## Transformation rules
|
|
|
|
Transformation rules are rules for renaming and excluding files and directories.
|
|
They are specified line-by-line in a crawler's `transform` option. When a
|
|
crawler needs to apply a rule to a path, it goes through this list top-to-bottom
|
|
and applies the first matching rule.
|
|
|
|
To see this process in action, you can use the `--debug-transforms` or flag or
|
|
the `--explain` flag.
|
|
|
|
Each rule has the format `SOURCE ARROW TARGET` (e. g. `foo/bar --> foo/baz`).
|
|
The arrow specifies how the source and target are interpreted. The different
|
|
kinds of arrows are documented below.
|
|
|
|
`SOURCE` and `TARGET` are either a bunch of characters without spaces (e. g.
|
|
`foo/bar`) or string literals (e. g, `"foo/b a r"`). The former syntax has no
|
|
concept of escaping characters, so the backslash is just another character. The
|
|
string literals however support Python's escape syntax (e. g.
|
|
`"foo\\bar\tbaz"`). This also means that in string literals, backslashes must be
|
|
escaped.
|
|
|
|
`TARGET` can additionally be a single exclamation mark `!` (*not* `"!"`). When a
|
|
rule with a `!` as target matches a path, the corresponding file or directory is
|
|
ignored by the crawler instead of renamed.
|
|
|
|
`TARGET` can also be omitted entirely. When a rule without target matches a
|
|
path, the path is returned unmodified. This is useful to prevent rules further
|
|
down from matching instead.
|
|
|
|
Each arrow's behaviour can be modified slightly by changing the arrow's head
|
|
from `>` to `>>`. When a rule with a `>>` arrow head matches a path, it doesn't
|
|
return immediately like a normal arrow. Instead, it replaces the current path
|
|
with its output and continues on to the next rule. In effect, this means that
|
|
multiple rules can be applied sequentially.
|
|
|
|
### The `-->` arrow
|
|
|
|
The `-->` arrow is a basic renaming operation for files and directories. If a
|
|
path matches `SOURCE`, it is renamed to `TARGET`.
|
|
|
|
Example: `foo/bar --> baz`
|
|
- Doesn't match `foo`, `a/foo/bar` or `foo/baz`
|
|
- Converts `foo/bar` into `baz`
|
|
- Converts `foo/bar/wargl` into `bar/wargl`
|
|
|
|
Example: `foo/bar --> !`
|
|
- Doesn't match `foo`, `a/foo/bar` or `foo/baz`
|
|
- Ignores `foo/bar` and any of its children
|
|
|
|
### The `-name->` arrow
|
|
|
|
The `-name->` arrow lets you rename files and directories by their name,
|
|
regardless of where they appear in the file tree. Because of this, its `SOURCE`
|
|
must not contain multiple path segments, only a single name. This restriction
|
|
does not apply to its `TARGET`.
|
|
|
|
Example: `foo -name-> bar/baz`
|
|
- Doesn't match `a/foobar/b` or `x/Foo/y/z`
|
|
- Converts `hello/foo` into `hello/bar/baz`
|
|
- Converts `foo/world` into `bar/baz/world`
|
|
- Converts `a/foo/b/c/foo` into `a/bar/baz/b/c/bar/baz`
|
|
|
|
Example: `foo -name-> !`
|
|
- Doesn't match `a/foobar/b` or `x/Foo/y/z`
|
|
- Ignores any path containing a segment `foo`
|
|
|
|
### The `-exact->` arrow
|
|
|
|
The `-exact->` arrow requires the path to match `SOURCE` exactly. The examples
|
|
below show why this is useful.
|
|
|
|
Example: `foo/bar -exact-> baz`
|
|
- Doesn't match `foo`, `a/foo/bar` or `foo/baz`
|
|
- Converts `foo/bar` into `baz`
|
|
- Doesn't match `foo/bar/wargl`
|
|
|
|
Example: `foo/bar -exact-> !`
|
|
- Doesn't match `foo`, `a/foo/bar` or `foo/baz`
|
|
- Ignores only `foo/bar`, not its children
|
|
|
|
### The `-re->` arrow
|
|
|
|
The `-re->` arrow is like the `-->` arrow but with regular expressions. `SOURCE`
|
|
is a regular expression and `TARGET` an f-string based template. If a path
|
|
matches `SOURCE`, the output path is created using `TARGET` as template.
|
|
`SOURCE` is automatically anchored.
|
|
|
|
`TARGET` uses Python's [format string syntax][3]. The *n*-th capturing group can
|
|
be referred to as `{g<n>}` (e.g. `{g3}`). `{g0}` refers to the original path.
|
|
If capturing group *n*'s contents are a valid integer, the integer value is
|
|
available as `{i<n>}` (e.g. `{i3}`). If capturing group *n*'s contents are a
|
|
valid float, the float value is available as `{f<n>}` (e.g. `{f3}`). If a
|
|
capturing group is not present (e.g. when matching the string `cd` with the
|
|
regex `(ab)?cd`), the corresponding variables are not defined.
|
|
|
|
Python's format string syntax has rich options for formatting its arguments. For
|
|
example, to left-pad the capturing group 3 with the digit `0` to width 5, you
|
|
can use `{i3:05}`.
|
|
|
|
PFERD even allows you to write entire expressions inside the curly braces, for
|
|
example `{g2.lower()}` or `{g3.replace(' ', '_')}`.
|
|
|
|
Example: `f(oo+)/be?ar -re-> B{g1.upper()}H/fear`
|
|
- Doesn't match `a/foo/bar`, `foo/abc/bar`, `afoo/bar` or `foo/bars`
|
|
- Converts `foo/bar` into `BOOH/fear`
|
|
- Converts `fooooo/bear` into `BOOOOOH/fear`
|
|
- Converts `foo/bar/baz` into `BOOH/fear/baz`
|
|
|
|
[3]: <https://docs.python.org/3/library/string.html#format-string-syntax> "Format String Syntax"
|
|
|
|
### The `-name-re->` arrow
|
|
|
|
The `-name-re>` arrow is like a combination of the `-name->` and `-re->` arrows.
|
|
|
|
Example: `(.*)\.jpeg -name-re-> {g1}.jpg`
|
|
- Doesn't match `foo/bar.png`, `baz.JPEG` or `hello,jpeg`
|
|
- Converts `foo/bar.jpeg` into `foo/bar.jpg`
|
|
- Converts `foo.jpeg/bar/baz.jpeg` into `foo.jpg/bar/baz.jpg`
|
|
|
|
Example: `\..+ -name-re-> !`
|
|
- Doesn't match `.`, `test`, `a.b`
|
|
- Ignores all files and directories starting with `.`.
|
|
|
|
### The `-exact-re->` arrow
|
|
|
|
The `-exact-re>` arrow is like a combination of the `-exact->` and `-re->`
|
|
arrows.
|
|
|
|
Example: `f(oo+)/be?ar -exactre-> B{g1.upper()}H/fear`
|
|
- Doesn't match `a/foo/bar`, `foo/abc/bar`, `afoo/bar` or `foo/bars`
|
|
- Converts `foo/bar` into `BOOH/fear`
|
|
- Converts `fooooo/bear` into `BOOOOOH/fear`
|
|
- Doesn't match `foo/bar/baz`
|
|
|
|
### Example: Tutorials
|
|
|
|
You have an ILIAS course with lots of tutorials, but are only interested in a
|
|
single one.
|
|
|
|
```
|
|
tutorials/
|
|
|- tut_01/
|
|
|- tut_02/
|
|
|- tut_03/
|
|
...
|
|
```
|
|
|
|
You can use a mix of normal and exact arrows to get rid of the other ones and
|
|
move the `tutorials/tut_02/` folder to `my_tut/`:
|
|
|
|
```
|
|
tutorials/tut_02 --> my_tut
|
|
tutorials -exact->
|
|
tutorials --> !
|
|
```
|
|
|
|
The second rule is required for many crawlers since they use the rules to decide
|
|
which directories to crawl. If it was missing when the crawler looks at
|
|
`tutorials/`, the third rule would match. This means the crawler would not crawl
|
|
the `tutorials/` directory and thus not discover that `tutorials/tut02/` exists.
|
|
|
|
Since the second rule is only relevant for crawling, the `TARGET` is left out.
|
|
|
|
### Example: Lecture slides
|
|
|
|
You have a course with slides like `Lecture 3: Linear functions.PDF` and you
|
|
would like to rename them to `03_linear_functions.pdf`.
|
|
|
|
```
|
|
Lectures/
|
|
|- Lecture 1: Introduction.PDF
|
|
|- Lecture 2: Vectors and matrices.PDF
|
|
|- Lecture 3: Linear functions.PDF
|
|
...
|
|
```
|
|
|
|
To do this, you can use the most powerful of arrows: The regex arrow.
|
|
|
|
```
|
|
"Lectures/Lecture (\\d+): (.*)\\.PDF" -re-> "Lectures/{i1:02}_{g2.lower().replace(' ', '_')}.pdf"
|
|
```
|
|
|
|
Note the escaped backslashes on the `SOURCE` side.
|
|
|
|
### Example: Crawl a Python project
|
|
|
|
You are crawling a Python project and want to ignore all hidden files (files
|
|
whose name starts with a `.`), all `__pycache__` directories and all markdown
|
|
files (for some weird reason).
|
|
|
|
```
|
|
.gitignore
|
|
.mypy_cache/
|
|
.venv/
|
|
CONFIG.md
|
|
PFERD/
|
|
|- __init__.py
|
|
|- __main__.py
|
|
|- __pycache__/
|
|
|- authenticator.py
|
|
|- config.py
|
|
...
|
|
README.md
|
|
...
|
|
```
|
|
|
|
For this task, the name arrows can be used.
|
|
|
|
```
|
|
\..* -name-re-> !
|
|
__pycache__ -name-> !
|
|
.*\.md -name-re-> !
|
|
```
|
|
|
|
### Example: Clean up names
|
|
|
|
You want to convert all paths into lowercase and replace spaces with underscores
|
|
before applying any rules. This can be achieved using the `>>` arrow heads.
|
|
|
|
```
|
|
(.*) -re->> "{g1.lower().replace(' ', '_')}"
|
|
|
|
<other rules go here>
|
|
```
|