Update transform rule documentation

It's still missing an example that uses rules with ">>" arrows.
This commit is contained in:
Joscha 2021-06-09 22:22:40 +02:00
parent 61d902d715
commit f28bbe6b0c

128
CONFIG.md
View File

@ -222,56 +222,87 @@ This authenticator does not support usernames.
Transformation rules are rules for renaming and excluding files and directories. Transformation rules are rules for renaming and excluding files and directories.
They are specified line-by-line in a crawler's `transform` option. When a They are specified line-by-line in a crawler's `transform` option. When a
crawler needs to apply a rule to a path, it goes through this list top-to-bottom crawler needs to apply a rule to a path, it goes through this list top-to-bottom
and choose the first matching rule. and applies the first matching rule.
To see this process in action, you can use the `--debug-transforms` or flag or To see this process in action, you can use the `--debug-transforms` or flag or
the `--explain` flag. the `--explain` flag.
Each line has the format `SOURCE ARROW TARGET` where `TARGET` is optional. Each rule has the format `SOURCE ARROW TARGET` (e. g. `foo/bar --> foo/baz`).
`SOURCE` is either a normal path without spaces (e. g. `foo/bar`), or a string The arrow specifies how the source and target are interpreted. The different
literal delimited by `"` or `'` (e. g. `"foo\" bar/baz"`). Python's string kinds of arrows are documented below.
escape syntax is supported. Trailing slashes are ignored. `TARGET` can be
formatted like `SOURCE`, but it can also be a single exclamation mark without
quotes (`!`). `ARROW` is one of `-->`, `-name->`, `-exact->`, `-re->` and
`-name-re->`
If a rule's target is `!`, this means that when the rule matches on a path, the `SOURCE` and `TARGET` are either a bunch of characters without spaces (e. g.
corresponding file or directory is ignored. If a rule's target is missing, the `foo/bar`) or string literals (e. g, `"foo/b a r"`). The former syntax has no
path is matched but not modified. concept of escaping characters, so the backslash is just another character. The
string literals however support Python's escape syntax (e. g.
`"foo\\bar\tbaz"`). This also means that in string literals, backslashes must be
escaped.
`TARGET` can additionally be a single exclamation mark `!` (*not* `"!"`). When a
rule with a `!` as target matches a path, the corresponding file or directory is
ignored by the crawler instead of renamed.
`TARGET` can also be omitted entirely. When a rule without target matches a
path, the path is returned unmodified. This is useful to prevent rules further
down from matching instead.
Each arrow's behaviour can be modified slightly by changing the arrow's head
from `>` to `>>`. When a rule with a `>>` arrow head matches a path, it doesn't
return immediately like a normal arrow. Instead, it replaces the current path
with its output and continues on to the next rule. In effect, this means that
multiple rules can be applied sequentially.
### The `-->` arrow ### The `-->` arrow
The `-->` arrow is a basic renaming operation. If a path begins with `SOURCE`, The `-->` arrow is a basic renaming operation for files and directories. If a
that part of the path is replaced with `TARGET`. This means that the rule path matches `SOURCE`, it is renamed to `TARGET`.
`foo/bar --> baz` would convert `foo/bar` into `baz`, but also `foo/bar/xyz`
into `baz/xyz`. The rule `foo --> !` would ignore a directory named `foo` as Example: `foo/bar --> baz`
well as all its contents. - Doesn't match `foo`, `a/foo/bar` or `foo/baz`
- Converts `foo/bar` into `baz`
- Converts `foo/bar/wargl` into `bar/wargl`
Example: `foo/bar --> !`
- Doesn't match `foo`, `a/foo/bar` or `foo/baz`
- Ignores `foo/bar` and any of its children
### The `-name->` arrow ### The `-name->` arrow
The `-name->` arrow lets you rename files and directories by their name, The `-name->` arrow lets you rename files and directories by their name,
regardless of where they appear in the file tree. Because of this, its `SOURCE` regardless of where they appear in the file tree. Because of this, its `SOURCE`
must not contain multiple path segments, only a single name. This restriction must not contain multiple path segments, only a single name. This restriction
does not apply to its `TARGET`. The `-name->` arrow is not applied recursively does not apply to its `TARGET`.
to its own output to prevent infinite loops.
For example, the rule `foo -name-> bar/baz` would convert `a/foo` into Example: `foo -name-> bar/baz`
`a/bar/baz` and `a/foo/b/c/foo` into `a/bar/baz/b/c/bar/baz`. The rule `foo - Doesn't match `a/foobar/b` or `x/Foo/y/z`
-name-> !` would ignore all directories and files named `foo`. - Converts `hello/foo` into `hello/bar/baz`
- Converts `foo/world` into `bar/baz/world`
- Converts `a/foo/b/c/foo` into `a/bar/baz/b/c/bar/baz`
Example: `foo -name-> !`
- Doesn't match `a/foobar/b` or `x/Foo/y/z`
- Ignores any path containing a segment `foo`
### The `-exact->` arrow ### The `-exact->` arrow
The `-exact->` arrow requires the path to match `SOURCE` exactly. This means The `-exact->` arrow requires the path to match `SOURCE` exactly. The examples
that the rule `foo/bar -exact-> baz` would still convert `foo/bar` into `baz`, below show why this is useful.
but `foo/bar/xyz` would be unaffected. Also, `foo -exact-> !` would only ignore
`foo`, but not its contents (if it has any). The examples below show why this is Example: `foo/bar -exact-> baz`
useful. - Doesn't match `foo`, `a/foo/bar` or `foo/baz`
- Converts `foo/bar` into `baz`
- Doesn't match `foo/bar/wargl`
Example: `foo/bar -exact-> !`
- Doesn't match `foo`, `a/foo/bar` or `foo/baz`
- Ignores only `foo/bar`, not its children
### The `-re->` arrow ### The `-re->` arrow
The `-re->` arrow uses regular expressions. `SOURCE` is a regular expression The `-re->` arrow is like the `-->` arrow but with regular expressions. `SOURCE`
that must match the entire path. If this is the case, then the capturing groups is a regular expression and `TARGET` an f-string based template. If a path
are available in `TARGET` for formatting. matches `SOURCE`, the output path is created using `TARGET` as template.
`SOURCE` is automatically anchored.
`TARGET` uses Python's [format string syntax][3]. The *n*-th capturing group can `TARGET` uses Python's [format string syntax][3]. The *n*-th capturing group can
be referred to as `{g<n>}` (e. g. `{g3}`). `{g0}` refers to the original path. be referred to as `{g<n>}` (e. g. `{g3}`). `{g0}` refers to the original path.
@ -288,18 +319,36 @@ can use `{i3:05}`.
PFERD even allows you to write entire expressions inside the curly braces, for PFERD even allows you to write entire expressions inside the curly braces, for
example `{g2.lower()}` or `{g3.replace(' ', '_')}`. example `{g2.lower()}` or `{g3.replace(' ', '_')}`.
Example: `f(oo+)/be?ar -re-> B{g1.upper()}H/fear`
- Doesn't match `a/foo/bar`, `foo/abc/bar`, `afoo/bar` or `foo/bars`
- Converts `foo/bar` into `BOOH/fear`
- Converts `fooooo/bear` into `BOOOOOH/fear`
- Converts `foo/bar/baz` into `BOOH/fear/baz`
[3]: <https://docs.python.org/3/library/string.html#format-string-syntax> "Format String Syntax" [3]: <https://docs.python.org/3/library/string.html#format-string-syntax> "Format String Syntax"
### The `-name-re->` arrow ### The `-name-re->` arrow
The `-name-re>` arrow is like a combination of the `-name->` and `-re->` arrows. The `-name-re>` arrow is like a combination of the `-name->` and `-re->` arrows.
Instead of the `SOURCE` being the name of a directory or file, it's a regex that
is matched against the names of directories and files. `TARGET` works like the
`-re->` arrow's target.
For example, the arrow `(.*)\.jpeg -name-re-> {g1}.jpg` will rename all `.jpeg` Example: `(.*)\.jpeg -name-re-> {g1}.jpg`
extensions into `.jpg`. The arrow `\..+ -name-re-> !` will ignore all files and - Doesn't match `foo/bar.png`, `baz.JPEG` or `hello,jpeg`
directories starting with `.`. - Converts `foo/bar.jpeg` into `foo/bar.jpg`
- Converts `foo.jpeg/bar/baz.jpeg` into `foo.jpg/bar/baz.jpg`
Example: `\..+ -name-re-> !`
- Doesn't match `.`, `test`, `a.b`
- Ignores all files and directories starting with `.`.
### The `-exact-re->` arrow
The `-exact-re>` arrow is like a combination of the `-exact->` and `-re->` arrows.
Example: `f(oo+)/be?ar -exactre-> B{g1.upper()}H/fear`
- Doesn't match `a/foo/bar`, `foo/abc/bar`, `afoo/bar` or `foo/bars`
- Converts `foo/bar` into `BOOH/fear`
- Converts `fooooo/bear` into `BOOOOOH/fear`
- Doesn't match `foo/bar/baz`
### Example: Tutorials ### Example: Tutorials
@ -327,7 +376,7 @@ The second rule is required for many crawlers since they use the rules to decide
which directories to crawl. If it was missing when the crawler looks at which directories to crawl. If it was missing when the crawler looks at
`tutorials/`, the third rule would match. This means the crawler would not crawl `tutorials/`, the third rule would match. This means the crawler would not crawl
the `tutorials/` directory and thus not discover that `tutorials/tut02/` the `tutorials/` directory and thus not discover that `tutorials/tut02/`
existed. exists.
Since the second rule is only relevant for crawling, the `TARGET` is left out. Since the second rule is only relevant for crawling, the `TARGET` is left out.
@ -352,9 +401,9 @@ To do this, you can use the most powerful of arrows: The regex arrow.
Note the escaped backslashes on the `SOURCE` side. Note the escaped backslashes on the `SOURCE` side.
### Example: Crawl a python project ### Example: Crawl a Python project
You are crawling a python project and want to ignore all hidden files (files You are crawling a Python project and want to ignore all hidden files (files
whose name starts with a `.`), all `__pycache__` directories and all markdown whose name starts with a `.`), all `__pycache__` directories and all markdown
files (for some weird reason). files (for some weird reason).
@ -374,8 +423,7 @@ README.md
... ...
``` ```
For this task, the name arrows can be used. They are variants of the normal For this task, the name arrows can be used.
arrows that only look at the file name instead of the entire path.
``` ```
\..* -name-re-> ! \..* -name-re-> !