mirror of https://github.com/Garmelon/PFERD.git synced 2023-12-21 10:23:01 +01:00

Joscha afbd03f777 Fix docs

2022-05-05 14:35:42 +02:00

18 KiB

Raw Blame History

Config file format

A config file consists of sections. A section begins with a [section] header, which is followed by a list of key = value pairs. Comments must be on their own line and start with #. Multiline values must be indented beyond their key. Boolean values can be yes or no. For more details and some examples on the format, see the configparser documentation (interpolation is disabled).

The `DEFAULT` section

This section contains global configuration values. It can also be used to set default values for the other sections.

working_dir: The directory PFERD operates in. Set to an absolute path to make PFERD operate the same regardless of where it is executed from. All other paths in the config file are interpreted relative to this path. If this path is relative, it is interpreted relative to the script's working dir. ~ is expanded to the current user's home directory. (Default: .)
explain: Whether PFERD should log and explain its actions and decisions in detail. (Default: no)
status: Whether PFERD should print status updates (like Crawled ..., Added ...) while running a crawler. (Default: yes)
report: Whether PFERD should print a report of added, changed and deleted local files for all crawlers before exiting. (Default: yes)
share_cookies: Whether crawlers should share cookies where applicable. For example, some crawlers share cookies if they crawl the same website using the same account. (Default: yes)

The `crawl:*` sections

Sections whose names start with crawl: are used to configure crawlers. The rest of the section name specifies the name of the crawler.

A crawler synchronizes a remote resource to a local directory. There are different types of crawlers for different kinds of resources, e.g. ILIAS courses or lecture websites.

Each crawl section represents an instance of a specific type of crawler. The type option is used to specify the crawler type. The crawler's name is usually used as the output directory. New crawlers can be created simply by adding a new crawl section to the config file.

Depending on a crawler's type, it may have different options. For more details, see the type's documentation below. The following options are common to all crawlers:

type: The available types are specified in this section.
skip: Whether the crawler should be skipped during normal execution. The crawler can still be executed manually using the --crawler or -C flags. (Default: no)
output_dir: The directory the crawler synchronizes files to. A crawler will never place any files outside this directory. (Default: the crawler's name)
redownload: When to download a file that is already present locally. (Default: never-smart)
- never: If a file is present locally, it is not downloaded again.
- never-smart: Like never, but PFERD tries to detect if an already downloaded files has changed via some (unreliable) heuristics.
- always: All files are always downloaded, regardless of whether they are already present locally.
- always-smart: Like always, but PFERD tries to avoid unnecessary downloads via some (unreliable) heuristics.
on_conflict: What to do when the local and remote versions of a file or directory differ, including when a file is replaced by a directory or a directory by a file. (Default: prompt)
- prompt: Always ask the user before overwriting or deleting local files and directories.
- local-first: Always keep the local file or directory. Equivalent to using prompt and always choosing "no". Implies that redownload is set to never.
- remote-first: Always keep the remote file or directory. Equivalent to using prompt and always choosing "yes".
- no-delete: Never delete local files, but overwrite local files if the remote file is different.
transform: Rules for renaming and excluding certain files and directories. For more details, see this section. (Default: empty)
tasks: The maximum number of concurrent tasks (such as crawling or downloading). (Default: 1)
downloads: How many of those tasks can be download tasks at the same time. Must not be greater than tasks. (Default: Same as tasks)
task_delay: Time (in seconds) that the crawler should wait between subsequent tasks. Can be used as a sort of rate limit to avoid unnecessary load for the crawl target. (Default: 0.0)
windows_paths: Whether PFERD should find alternative names for paths that are invalid on Windows. (Default: yes on Windows, no otherwise)

Some crawlers may also require credentials for authentication. To configure how the crawler obtains its credentials, the auth option is used. It is set to the full name of an auth section (including the auth: prefix).

Here is a simple example:

[auth:example]
type = simple
username = foo
password = bar

[crawl:something]
type = some-complex-crawler
auth = auth:example
on_conflict = no-delete
tasks = 3

The `auth:*` sections

Sections whose names start with auth: are used to configure authenticators. An authenticator provides a username and a password to one or more crawlers.

Authenticators work similar to crawlers: A section represents an authenticator instance whose name is the rest of the section name. The type is specified by the type option.

Depending on an authenticator's type, it may have different options. For more details, see the type's documentation below. The only option common to all authenticators is type:

type: The types are specified in this section.

Crawler types

The `local` crawler

This crawler crawls a local directory. It is really simple and mostly useful for testing different setups. The various delay options are meant to make the crawler simulate a slower, network-based crawler.

target: Path to the local directory to crawl. (Required)
crawl_delay: Artificial delay (in seconds) to simulate for crawl requests. (Default: 0.0)
download_delay: Artificial delay (in seconds) to simulate for download requests. (Default: 0.0)
download_speed: Download speed (in bytes per second) to simulate. (Optional)

The `kit-ipd` crawler

This crawler crawls a KIT-IPD page by url. The root page can be crawled from outside the KIT network so you will be informed about any new/deleted files, but downloading files requires you to be within. Adding a show delay between requests is likely a good idea.

target: URL to a KIT-IPD page
link_regex: A regex that is matched against the href part of links. If it matches, the given link is downloaded as a file. This is used to extract files from KIT-IPD pages. (Default: ^.*/[^/]*\.(?:pdf|zip|c|cpp|java)$)

The `kit-ilias-web` crawler

This crawler crawls the KIT ILIAS instance.

ILIAS is not great at handling too many concurrent requests. To avoid unnecessary load, please limit tasks to 1.

There is a spike in ILIAS usage at the beginning of lectures, so please don't run PFERD during those times.

If you're automatically running PFERD periodically (e. g. via cron or a systemd timer), please randomize the start time or at least don't use the full hour. For systemd timers, this can be accomplished using the RandomizedDelaySec option. Also, please schedule the script to run in periods of low activity. Running the script once per day should be fine.

target: The ILIAS element to crawl. (Required)
- desktop: Crawl your personal desktop
- <course id>: Crawl the course with the given id
- <url>: Crawl a given element by URL (preferably the permanent URL linked at the bottom of its ILIAS page)
auth: Name of auth section to use for login. (Required)
tfa_auth: Name of auth section to use for two-factor authentication. Only uses the auth section's password. (Default: Anonymous tfa authenticator)
links: How to represent external links. (Default: fancy)
- ignore: Don't download links.
- plaintext: A text file containing only the URL.
- fancy: A HTML file looking like the ILIAS link element.
- internet-shortcut: An internet shortcut file (.url file).
link_redirect_delay: Time (in seconds) until fancy link files will redirect to the actual URL. Set to a negative value to disable the automatic redirect. (Default: -1)
videos: Whether to download videos. (Default: no)
http_timeout: The timeout (in seconds) for all HTTP requests. (Default: 20.0)

Authenticator types

The `simple` authenticator

With this authenticator, the username and password can be set directly in the config file. If the username or password are not specified, the user is prompted via the terminal.

username: The username. (Optional)
password: The password. (Optional)

The `credential-file` authenticator

This authenticator reads a username and a password from a credential file.

path: Path to the credential file. (Required)

The credential file has exactly two lines (trailing newline optional). The first line starts with username= and contains the username, the second line starts with password= and contains the password. The username and password may contain any characters except a line break.

username=AzureDiamond
password=hunter2

The `keyring` authenticator

This authenticator uses the system keyring to store passwords. The username can be set directly in the config file. If the username is not specified, the user is prompted via the terminal. If the keyring contains no entry or the entry is incorrect, the user is prompted for a password via the terminal and the password is stored in the keyring.

username: The username. (Optional)
keyring_name: The service name PFERD uses for storing credentials. (Default: PFERD)

The `tfa` authenticator

This authenticator prompts the user on the console for a two-factor authentication token. The token is provided as password and it is not cached. This authenticator does not support usernames.

Transformation rules

Transformation rules are rules for renaming and excluding files and directories. They are specified line-by-line in a crawler's transform option. When a crawler needs to apply a rule to a path, it goes through this list top-to-bottom and applies the first matching rule.

To see this process in action, you can use the --debug-transforms or flag or the --explain flag.

Each rule has the format SOURCE ARROW TARGET (e. g. foo/bar --> foo/baz). The arrow specifies how the source and target are interpreted. The different kinds of arrows are documented below.

SOURCE and TARGET are either a bunch of characters without spaces (e. g. foo/bar) or string literals (e. g, "foo/b a r"). The former syntax has no concept of escaping characters, so the backslash is just another character. The string literals however support Python's escape syntax (e. g. "foo\\bar\tbaz"). This also means that in string literals, backslashes must be escaped.

TARGET can additionally be a single exclamation mark ! (not "!"). When a rule with a ! as target matches a path, the corresponding file or directory is ignored by the crawler instead of renamed.

TARGET can also be omitted entirely. When a rule without target matches a path, the path is returned unmodified. This is useful to prevent rules further down from matching instead.

Each arrow's behaviour can be modified slightly by changing the arrow's head from > to >>. When a rule with a >> arrow head matches a path, it doesn't return immediately like a normal arrow. Instead, it replaces the current path with its output and continues on to the next rule. In effect, this means that multiple rules can be applied sequentially.

The `-->` arrow

The --> arrow is a basic renaming operation for files and directories. If a path matches SOURCE, it is renamed to TARGET.

Example: foo/bar --> baz

Doesn't match foo, a/foo/bar or foo/baz
Converts foo/bar into baz
Converts foo/bar/wargl into bar/wargl

Example: foo/bar --> !

Doesn't match foo, a/foo/bar or foo/baz
Ignores foo/bar and any of its children

The `-name->` arrow

The -name-> arrow lets you rename files and directories by their name, regardless of where they appear in the file tree. Because of this, its SOURCE must not contain multiple path segments, only a single name. This restriction does not apply to its TARGET.

Example: foo -name-> bar/baz

Doesn't match a/foobar/b or x/Foo/y/z
Converts hello/foo into hello/bar/baz
Converts foo/world into bar/baz/world
Converts a/foo/b/c/foo into a/bar/baz/b/c/bar/baz

Example: foo -name-> !

Doesn't match a/foobar/b or x/Foo/y/z
Ignores any path containing a segment foo

The `-exact->` arrow

The -exact-> arrow requires the path to match SOURCE exactly. The examples below show why this is useful.

Example: foo/bar -exact-> baz

Doesn't match foo, a/foo/bar or foo/baz
Converts foo/bar into baz
Doesn't match foo/bar/wargl

Example: foo/bar -exact-> !

Doesn't match foo, a/foo/bar or foo/baz
Ignores only foo/bar, not its children

The `-re->` arrow

The -re-> arrow is like the --> arrow but with regular expressions. SOURCE is a regular expression and TARGET an f-string based template. If a path matches SOURCE, the output path is created using TARGET as template. SOURCE is automatically anchored.

TARGET uses Python's format string syntax. The n-th capturing group can be referred to as {g<n>} (e.g. {g3}). {g0} refers to the original path. If capturing group n's contents are a valid integer, the integer value is available as {i<n>} (e.g. {i3}). If capturing group n's contents are a valid float, the float value is available as {f<n>} (e.g. {f3}). If a capturing group is not present (e.g. when matching the string cd with the regex (ab)?cd), the corresponding variables are not defined.

Python's format string syntax has rich options for formatting its arguments. For example, to left-pad the capturing group 3 with the digit 0 to width 5, you can use {i3:05}.

PFERD even allows you to write entire expressions inside the curly braces, for example {g2.lower()} or {g3.replace(' ', '_')}.

Example: f(oo+)/be?ar -re-> B{g1.upper()}H/fear

Doesn't match a/foo/bar, foo/abc/bar, afoo/bar or foo/bars
Converts foo/bar into BOOH/fear
Converts fooooo/bear into BOOOOOH/fear
Converts foo/bar/baz into BOOH/fear/baz

The `-name-re->` arrow

The -name-re> arrow is like a combination of the -name-> and -re-> arrows.

Example: (.*)\.jpeg -name-re-> {g1}.jpg

Doesn't match foo/bar.png, baz.JPEG or hello,jpeg
Converts foo/bar.jpeg into foo/bar.jpg
Converts foo.jpeg/bar/baz.jpeg into foo.jpg/bar/baz.jpg

Example: \..+ -name-re-> !

Doesn't match ., test, a.b
Ignores all files and directories starting with ..

The `-exact-re->` arrow

The -exact-re> arrow is like a combination of the -exact-> and -re-> arrows.

Example: f(oo+)/be?ar -exactre-> B{g1.upper()}H/fear

Doesn't match a/foo/bar, foo/abc/bar, afoo/bar or foo/bars
Converts foo/bar into BOOH/fear
Converts fooooo/bear into BOOOOOH/fear
Doesn't match foo/bar/baz

Example: Tutorials

You have an ILIAS course with lots of tutorials, but are only interested in a single one.

tutorials/
  |- tut_01/
  |- tut_02/
  |- tut_03/
  ...

You can use a mix of normal and exact arrows to get rid of the other ones and move the tutorials/tut_02/ folder to my_tut/:

tutorials/tut_02 --> my_tut
tutorials -exact->
tutorials --> !

The second rule is required for many crawlers since they use the rules to decide which directories to crawl. If it was missing when the crawler looks at tutorials/, the third rule would match. This means the crawler would not crawl the tutorials/ directory and thus not discover that tutorials/tut02/ exists.

Since the second rule is only relevant for crawling, the TARGET is left out.

Example: Lecture slides

You have a course with slides like Lecture 3: Linear functions.PDF and you would like to rename them to 03_linear_functions.pdf.

Lectures/
  |- Lecture 1: Introduction.PDF
  |- Lecture 2: Vectors and matrices.PDF
  |- Lecture 3: Linear functions.PDF
  ...

To do this, you can use the most powerful of arrows: The regex arrow.

"Lectures/Lecture (\\d+): (.*)\\.PDF" -re-> "Lectures/{i1:02}_{g2.lower().replace(' ', '_')}.pdf"

Note the escaped backslashes on the SOURCE side.

Example: Crawl a Python project

You are crawling a Python project and want to ignore all hidden files (files whose name starts with a .), all __pycache__ directories and all markdown files (for some weird reason).

.gitignore
.mypy_cache/
.venv/
CONFIG.md
PFERD/
  |- __init__.py
  |- __main__.py
  |- __pycache__/
  |- authenticator.py
  |- config.py
  ...
README.md
...

For this task, the name arrows can be used.

\..*        -name-re-> !
__pycache__ -name->    !
.*\.md      -name-re-> !

Example: Clean up names

You want to convert all paths into lowercase and replace spaces with underscores before applying any rules. This can be achieved using the >> arrow heads.

(.*) -re->> "{g1.lower().replace(' ', '_')}"

<other rules go here>

18 KiB Raw Blame History

Config file format

The DEFAULT section

The crawl:* sections

The auth:* sections

Crawler types

The local crawler

The kit-ipd crawler

The kit-ilias-web crawler

Authenticator types

The simple authenticator

The credential-file authenticator

The keyring authenticator

The tfa authenticator