Bump version to 3.1.0

Crawl all video stages in one crawl bar
This ensures folders are not renamed, as they are crawled twice
2023-12-21 10:23:01 +01:00 · 2021-06-13 17:23:18 +02:00 · 2021-06-13 17:18:45 +02:00 · 2021-06-13 16:33:29 +02:00 · 2021-06-13 15:44:47 +02:00 · 2021-06-13 15:06:50 +02:00
14 changed files with 512 additions and 324 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@ -22,6 +22,27 @@ ambiguous situations.

 ## Unreleased

+## 3.1.0 - 2021-06-13
+
+If your config file doesn't do weird things with transforms, it should continue
+to work. If your `-re->` arrows behave weirdly, try replacing them with
+`-exact-re->` arrows. If you're on Windows, you might need to switch from `\`
+path separators to `/` in your regex rules.
+
+### Added
+- `skip` option for crawlers
+- Rules with `>>` instead of `>` as arrow head
+- `-exact-re->` arrow (behaves like `-re->` did previously)
+
+### Changed
+- The `-re->` arrow can now rename directories (like `-->`)
+- Use `/` instead of `\` as path separator for (regex) rules on Windows
+- Use the label to the left for exercises instead of the button name to
+  determine the folder name
+
+### Fixed
+- Video pagination handling in ILIAS crawler
+
 ## 3.0.1 - 2021-06-01

 ### Added
--- a/CONFIG.md
+++ b/CONFIG.md
@ -49,6 +49,9 @@ see the type's [documentation](#crawler-types) below. The following options are
 common to all crawlers:

 - `type`: The available types are specified in [this section](#crawler-types).
+- `skip`: Whether the crawler should be skipped during normal execution. The
+  crawler can still be executed manually using the `--crawler` or `-C` flags.
+  (Default: `no`)
 - `output_dir`: The directory the crawler synchronizes files to. A crawler will
  never place any files outside of this directory. (Default: the crawler's name)
 - `redownload`: When to download a file that is already present locally.
@ -182,8 +185,11 @@ via the terminal.

 ### The `credential-file` authenticator

-This authenticator reads a username and a password from a credential file. The
-credential file has exactly two lines (trailing newline optional). The first
+This authenticator reads a username and a password from a credential file.
+
+- `path`: Path to the credential file. (Required)
+
+The credential file has exactly two lines (trailing newline optional). The first
 line starts with `username=` and contains the username, the second line starts
 with `password=` and contains the password. The username and password may
 contain any characters except a line break.
@ -216,56 +222,87 @@ This authenticator does not support usernames.
 Transformation rules are rules for renaming and excluding files and directories.
 They are specified line-by-line in a crawler's `transform` option. When a
 crawler needs to apply a rule to a path, it goes through this list top-to-bottom
-and choose the first matching rule.
+and applies the first matching rule.

 To see this process in action, you can use the `--debug-transforms` or flag or
 the `--explain` flag.

-Each line has the format `SOURCE ARROW TARGET` where `TARGET` is optional.
-`SOURCE` is either a normal path without spaces (e. g. `foo/bar`), or a string
-literal delimited by `"` or `'` (e. g. `"foo\" bar/baz"`). Python's string
-escape syntax is supported. Trailing slashes are ignored. `TARGET` can be
-formatted like `SOURCE`, but it can also be a single exclamation mark without
-quotes (`!`). `ARROW` is one of `-->`, `-name->`, `-exact->`, `-re->` and
-`-name-re->`
+Each rule has the format `SOURCE ARROW TARGET` (e. g. `foo/bar --> foo/baz`).
+The arrow specifies how the source and target are interpreted. The different
+kinds of arrows are documented below.

-If a rule's target is `!`, this means that when the rule matches on a path, the
-corresponding file or directory is ignored. If a rule's target is missing, the
-path is matched but not modified.
+`SOURCE` and `TARGET` are either a bunch of characters without spaces (e. g.
+`foo/bar`) or string literals (e. g, `"foo/b a r"`). The former syntax has no
+concept of escaping characters, so the backslash is just another character. The
+string literals however support Python's escape syntax (e. g.
+`"foo\\bar\tbaz"`). This also means that in string literals, backslashes must be
+escaped.
+
+`TARGET` can additionally be a single exclamation mark `!` (*not* `"!"`). When a
+rule with a `!` as target matches a path, the corresponding file or directory is
+ignored by the crawler instead of renamed.
+
+`TARGET` can also be omitted entirely. When a rule without target matches a
+path, the path is returned unmodified. This is useful to prevent rules further
+down from matching instead.
+
+Each arrow's behaviour can be modified slightly by changing the arrow's head
+from `>` to `>>`. When a rule with a `>>` arrow head matches a path, it doesn't
+return immediately like a normal arrow. Instead, it replaces the current path
+with its output and continues on to the next rule. In effect, this means that
+multiple rules can be applied sequentially.

 ### The `-->` arrow

-The `-->` arrow is a basic renaming operation. If a path begins with `SOURCE`,
-that part of the path is replaced with `TARGET`. This means that the rule
-`foo/bar --> baz` would convert `foo/bar` into `baz`, but also `foo/bar/xyz`
-into `baz/xyz`. The rule `foo --> !` would ignore a directory named `foo` as
-well as all its contents.
+The `-->` arrow is a basic renaming operation for files and directories. If a
+path matches `SOURCE`, it is renamed to `TARGET`.
+
+Example: `foo/bar --> baz`
+- Doesn't match `foo`, `a/foo/bar` or `foo/baz`
+- Converts `foo/bar` into `baz`
+- Converts `foo/bar/wargl` into `bar/wargl`
+
+Example: `foo/bar --> !`
+- Doesn't match `foo`, `a/foo/bar` or `foo/baz`
+- Ignores `foo/bar` and any of its children

 ### The `-name->` arrow

 The `-name->` arrow lets you rename files and directories by their name,
 regardless of where they appear in the file tree. Because of this, its `SOURCE`
 must not contain multiple path segments, only a single name. This restriction
-does not apply to its `TARGET`. The `-name->` arrow is not applied recursively
-to its own output to prevent infinite loops.
+does not apply to its `TARGET`.

-For example, the rule `foo -name-> bar/baz` would convert `a/foo` into
-`a/bar/baz` and `a/foo/b/c/foo` into `a/bar/baz/b/c/bar/baz`. The rule `foo
-name-> !` would ignore all directories and files named `foo`.
+Example: `foo -name-> bar/baz`
+- Doesn't match `a/foobar/b` or `x/Foo/y/z`
+- Converts `hello/foo` into `hello/bar/baz`
+- Converts `foo/world` into `bar/baz/world`
+- Converts `a/foo/b/c/foo` into `a/bar/baz/b/c/bar/baz`
+
+Example: `foo -name-> !`
+- Doesn't match `a/foobar/b` or `x/Foo/y/z`
+- Ignores any path containing a segment `foo`

 ### The `-exact->` arrow

-The `-exact->` arrow requires the path to match `SOURCE` exactly. This means
-that the rule `foo/bar -exact-> baz` would still convert `foo/bar` into `baz`,
-but `foo/bar/xyz` would be unaffected. Also, `foo -exact-> !` would only ignore
-`foo`, but not its contents (if it has any). The examples below show why this is
-useful.
+The `-exact->` arrow requires the path to match `SOURCE` exactly. The examples
+below show why this is useful.
+
+Example: `foo/bar -exact-> baz`
+- Doesn't match `foo`, `a/foo/bar` or `foo/baz`
+- Converts `foo/bar` into `baz`
+- Doesn't match `foo/bar/wargl`
+
+Example: `foo/bar -exact-> !`
+- Doesn't match `foo`, `a/foo/bar` or `foo/baz`
+- Ignores only `foo/bar`, not its children

 ### The `-re->` arrow

-The `-re->` arrow uses regular expressions. `SOURCE` is a regular expression
-that must match the entire path. If this is the case, then the capturing groups
-are available in `TARGET` for formatting.
+The `-re->` arrow is like the `-->` arrow but with regular expressions. `SOURCE`
+is a regular expression and `TARGET` an f-string based template. If a path
+matches `SOURCE`, the output path is created using `TARGET` as template.
+`SOURCE` is automatically anchored.

 `TARGET` uses Python's [format string syntax][3]. The *n*-th capturing group can
 be referred to as `{g<n>}` (e. g. `{g3}`). `{g0}` refers to the original path.
@ -282,18 +319,37 @@ can use `{i3:05}`.
 PFERD even allows you to write entire expressions inside the curly braces, for
 example `{g2.lower()}` or `{g3.replace(' ', '_')}`.

+Example: `f(oo+)/be?ar -re-> B{g1.upper()}H/fear`
+- Doesn't match `a/foo/bar`, `foo/abc/bar`, `afoo/bar` or `foo/bars`
+- Converts `foo/bar` into `BOOH/fear`
+- Converts `fooooo/bear` into `BOOOOOH/fear`
+- Converts `foo/bar/baz` into `BOOH/fear/baz`
+
 [3]: <https://docs.python.org/3/library/string.html#format-string-syntax> "Format String Syntax"

 ### The `-name-re->` arrow

 The `-name-re>` arrow is like a combination of the `-name->` and `-re->` arrows.
-Instead of the `SOURCE` being the name of a directory or file, it's a regex that
-is matched against the names of directories and files. `TARGET` works like the
-`-re->` arrow's target.

-For example, the arrow `(.*)\.jpeg -name-re-> {g1}.jpg` will rename all `.jpeg`
-extensions into `.jpg`. The arrow `\..+ -name-re-> !` will ignore all files and
-directories starting with `.`.
+Example: `(.*)\.jpeg -name-re-> {g1}.jpg`
+- Doesn't match `foo/bar.png`, `baz.JPEG` or `hello,jpeg`
+- Converts `foo/bar.jpeg` into `foo/bar.jpg`
+- Converts `foo.jpeg/bar/baz.jpeg` into `foo.jpg/bar/baz.jpg`
+
+Example: `\..+ -name-re-> !`
+- Doesn't match `.`, `test`, `a.b`
+- Ignores all files and directories starting with `.`.
+
+### The `-exact-re->` arrow
+
+The `-exact-re>` arrow is like a combination of the `-exact->` and `-re->`
+arrows.
+
+Example: `f(oo+)/be?ar -exactre-> B{g1.upper()}H/fear`
+- Doesn't match `a/foo/bar`, `foo/abc/bar`, `afoo/bar` or `foo/bars`
+- Converts `foo/bar` into `BOOH/fear`
+- Converts `fooooo/bear` into `BOOOOOH/fear`
+- Doesn't match `foo/bar/baz`

 ### Example: Tutorials

@ -320,8 +376,7 @@ tutorials --> !
 The second rule is required for many crawlers since they use the rules to decide
 which directories to crawl. If it was missing when the crawler looks at
 `tutorials/`, the third rule would match. This means the crawler would not crawl
-the `tutorials/` directory and thus not discover that `tutorials/tut02/`
-existed.
+the `tutorials/` directory and thus not discover that `tutorials/tut02/` exists.

 Since the second rule is only relevant for crawling, the `TARGET` is left out.

@ -346,9 +401,9 @@ To do this, you can use the most powerful of arrows: The regex arrow.

 Note the escaped backslashes on the `SOURCE` side.

-### Example: Crawl a python project
+### Example: Crawl a Python project

-You are crawling a python project and want to ignore all hidden files (files
+You are crawling a Python project and want to ignore all hidden files (files
 whose name starts with a `.`), all `__pycache__` directories and all markdown
 files (for some weird reason).

@ -368,11 +423,21 @@ README.md
 ...
 ```

-For this task, the name arrows can be used. They are variants of the normal
-arrows that only look at the file name instead of the entire path.
+For this task, the name arrows can be used.

 ```
 \..*        -name-re-> !
 __pycache__ -name->    !
 .*\.md      -name-re-> !
 ```
+
+### Example: Clean up names
+
+You want to convert all paths into lowercase and replace spaces with underscores
+before applying any rules. This can be achieved using the `>>` arrow heads.
+
+```
+(.*) -re->> "{g1.lower().replace(' ', '_')}"
+
+<other rules go here>
+```
--- a/PFERD/main.py
+++ b/PFERD/main.py
@ -147,7 +147,6 @@ def main() -> None:
        log.unlock()
        log.explain_topic("Interrupted, exiting immediately")
        log.explain("Open files and connections are left for the OS to clean up")
-        log.explain("Temporary files are not cleaned up")
        pferd.print_report()
        # TODO Clean up tmp files
        # And when those files *do* actually get cleaned up properly,
--- a/PFERD/auth/authenticator.py
+++ b/PFERD/auth/authenticator.py
@ -13,7 +13,11 @@ class AuthError(Exception):


 class AuthSection(Section):
-    pass
+    def type(self) -> str:
+        value = self.s.get("type")
+        if value is None:
+            self.missing_value("type")
+        return value


 class Authenticator(ABC):
--- a/PFERD/crawl/init.py
+++ b/PFERD/crawl/init.py
@ -3,7 +3,7 @@ from typing import Callable, Dict

 from ..auth import Authenticator
 from ..config import Config
-from .crawler import Crawler, CrawlError  # noqa: F401
+from .crawler import Crawler, CrawlError, CrawlerSection  # noqa: F401
 from .ilias import KitIliasWebCrawler, KitIliasWebCrawlerSection
 from .local_crawler import LocalCrawler, LocalCrawlerSection

--- a/PFERD/crawl/crawler.py
+++ b/PFERD/crawl/crawler.py
@ -132,6 +132,15 @@ class DownloadToken(ReusableAsyncContextManager[Tuple[ProgressBar, FileSink]]):


 class CrawlerSection(Section):
+    def type(self) -> str:
+        value = self.s.get("type")
+        if value is None:
+            self.missing_value("type")
+        return value
+
+    def skip(self) -> bool:
+        return self.s.getboolean("skip", fallback=False)
+
    def output_dir(self, name: str) -> Path:
        # TODO Use removeprefix() after switching to 3.9
        if name.startswith("crawl:"):
--- a/PFERD/crawl/ilias/kit_ilias_html.py
+++ b/PFERD/crawl/ilias/kit_ilias_html.py
@ -62,9 +62,11 @@ class IliasPage:
        log.explain("Page is a normal folder, searching for elements")
        return self._find_normal_entries()

-    def get_next_stage_url(self) -> Optional[str]:
+    def get_next_stage_element(self) -> Optional[IliasPageElement]:
        if self._is_ilias_opencast_embedding():
-            return self.get_child_elements()[0].url
+            return self.get_child_elements()[0]
+        if self._page_type == IliasElementType.VIDEO_FOLDER_MAYBE_PAGINATED:
+            return self._find_video_entries_paginated()[0]
        return None

    def _is_video_player(self) -> bool:
@ -293,7 +295,13 @@ class IliasPage:

            # Add each listing as a new
            for listing in file_listings:
-                file_name = _sanitize_path_name(listing.getText().strip())
+                parent_container: Tag = listing.findParent(
+                    "div", attrs={"class": lambda x: x and "form-group" in x}
+                )
+                label_container: Tag = parent_container.find(
+                    attrs={"class": lambda x: x and "control-label" in x}
+                )
+                file_name = _sanitize_path_name(label_container.getText().strip())
                url = self._abs_url_from_link(listing)
                log.explain(f"Found exercise detail {file_name!r} at {url}")
                results.append(IliasPageElement(
@ -474,7 +482,7 @@ class IliasPage:
            return None

        if "opencast" in str(img_tag["alt"]).lower():
-            return IliasElementType.VIDEO_FOLDER
+            return IliasElementType.VIDEO_FOLDER_MAYBE_PAGINATED

        if str(img_tag["src"]).endswith("icon_exc.svg"):
            return IliasElementType.EXERCISE
--- a/PFERD/crawl/ilias/kit_ilias_web_crawler.py
+++ b/PFERD/crawl/ilias/kit_ilias_web_crawler.py
@ -248,13 +248,18 @@ instance's greatest bottleneck.
            elements.clear()
            async with cl:
                next_stage_url: Optional[str] = url
+                current_parent = parent

                while next_stage_url:
                    soup = await self._get_page(next_stage_url)
                    log.explain_topic(f"Parsing HTML page for {fmt_path(path)}")
                    log.explain(f"URL: {next_stage_url}")
-                    page = IliasPage(soup, url, parent)
-                    next_stage_url = page.get_next_stage_url()
+                    page = IliasPage(soup, next_stage_url, current_parent)
+                    if next_element := page.get_next_stage_element():
+                        current_parent = next_element
+                        next_stage_url = next_element.url
+                    else:
+                        next_stage_url = None

                elements.extend(page.get_child_elements())

--- a/PFERD/pferd.py
+++ b/PFERD/pferd.py
@ -3,9 +3,9 @@ from typing import Dict, List, Optional

 from rich.markup import escape

-from .auth import AUTHENTICATORS, Authenticator, AuthError
+from .auth import AUTHENTICATORS, Authenticator, AuthError, AuthSection
 from .config import Config, ConfigOptionError
-from .crawl import CRAWLERS, Crawler, CrawlError, KitIliasWebCrawler
+from .crawl import CRAWLERS, Crawler, CrawlError, CrawlerSection, KitIliasWebCrawler
 from .logging import log
 from .utils import fmt_path

@ -26,19 +26,22 @@ class Pferd:
        self._authenticators: Dict[str, Authenticator] = {}
        self._crawlers: Dict[str, Crawler] = {}

-    def _find_crawlers_to_run(self, config: Config, cli_crawlers: Optional[List[str]]) -> List[str]:
-        log.explain_topic("Deciding which crawlers to run")
-        crawl_sections = [name for name, _ in config.crawl_sections()]
+    def _find_config_crawlers(self, config: Config) -> List[str]:
+        crawl_sections = []
+
+        for name, section in config.crawl_sections():
+            if CrawlerSection(section).skip():
+                log.explain(f"Skipping {name!r}")
+            else:
+                crawl_sections.append(name)

-        if cli_crawlers is None:
-            log.explain("No crawlers specified on CLI")
-            log.explain("Running all crawlers specified in config")
        return crawl_sections

+    def _find_cli_crawlers(self, config: Config, cli_crawlers: List[str]) -> List[str]:
        if len(cli_crawlers) != len(set(cli_crawlers)):
            raise PferdLoadError("Some crawlers were selected multiple times")

-        log.explain("Crawlers specified on CLI")
+        crawl_sections = [name for name, _ in config.crawl_sections()]

        crawlers_to_run = []  # With crawl: prefix
        unknown_names = []  # Without crawl: prefix
@ -62,10 +65,22 @@ class Pferd:

        return crawlers_to_run

+    def _find_crawlers_to_run(self, config: Config, cli_crawlers: Optional[List[str]]) -> List[str]:
+        log.explain_topic("Deciding which crawlers to run")
+
+        if cli_crawlers is None:
+            log.explain("No crawlers specified on CLI")
+            log.explain("Running crawlers specified in config")
+            return self._find_config_crawlers(config)
+        else:
+            log.explain("Crawlers specified on CLI")
+            return self._find_cli_crawlers(config, cli_crawlers)
+
    def _load_authenticators(self) -> None:
        for name, section in self._config.auth_sections():
            log.print(f"[bold bright_cyan]Loading[/] {escape(name)}")
-            auth_type = section.get("type")
+
+            auth_type = AuthSection(section).type()
            authenticator_constructor = AUTHENTICATORS.get(auth_type)
            if authenticator_constructor is None:
                raise ConfigOptionError(name, "type", f"Unknown authenticator type: {auth_type!r}")
@ -80,7 +95,7 @@ class Pferd:
        for name, section in self._config.crawl_sections():
            log.print(f"[bold bright_cyan]Loading[/] {escape(name)}")

-            crawl_type = section.get("type")
+            crawl_type = CrawlerSection(section).type()
            crawler_constructor = CRAWLERS.get(crawl_type)
            if crawler_constructor is None:
                raise ConfigOptionError(name, "type", f"Unknown crawler type: {crawl_type!r}")
--- a/PFERD/transformer.py
+++ b/PFERD/transformer.py
@ -1,151 +1,164 @@
-# I'm sorry that this code has become a bit dense and unreadable. While
-# reading, it is important to remember what True and False mean. I'd love to
-# have some proper sum-types for the inputs and outputs, they'd make this code
-# a lot easier to understand.
-
 import ast
 import re
 from abc import ABC, abstractmethod
+from dataclasses import dataclass
+from enum import Enum
 from pathlib import PurePath
-from typing import Dict, Optional, Sequence, Union
+from typing import Callable, Dict, List, Optional, Sequence, TypeVar, Union

 from .logging import log
-from .utils import fmt_path
+from .utils import fmt_path, str_path


-class Rule(ABC):
-    @abstractmethod
-    def transform(self, path: PurePath) -> Union[PurePath, bool]:
-        """
-        Try to apply this rule to the path. Returns another path if the rule
-        was successfully applied, True if the rule matched but resulted in an
-        exclamation mark, and False if the rule didn't match at all.
-        """
+class ArrowHead(Enum):
+    NORMAL = 0
+    SEQUENCE = 1

+
+class Ignore:
    pass


-# These rules all use a Union[T, bool] for their right side. They are passed a
-# T if the arrow's right side was a normal string, True if it was an
-# exclamation mark and False if it was missing entirely.
+class Empty:
+    pass

-class NormalRule(Rule):
-    def __init__(self, left: PurePath, right: Union[PurePath, bool]):

-        self._left = left
-        self._right = right
+RightSide = Union[str, Ignore, Empty]

-    def _match_prefix(self, path: PurePath) -> Optional[PurePath]:
-        left_parts = list(reversed(self._left.parts))
-        path_parts = list(reversed(path.parts))

-        if len(left_parts) > len(path_parts):
+@dataclass
+class Transformed:
+    path: PurePath
+
+
+class Ignored:
+    pass
+
+
+TransformResult = Optional[Union[Transformed, Ignored]]
+
+
+@dataclass
+class Rule:
+    left: str
+    name: str
+    head: ArrowHead
+    right: RightSide
+
+    def right_result(self, path: PurePath) -> Union[str, Transformed, Ignored]:
+        if isinstance(self.right, str):
+            return self.right
+        elif isinstance(self.right, Ignore):
+            return Ignored()
+        elif isinstance(self.right, Empty):
+            return Transformed(path)
+        else:
+            raise RuntimeError(f"Right side has invalid type {type(self.right)}")
+
+
+class Transformation(ABC):
+    def __init__(self, rule: Rule):
+        self.rule = rule
+
+    @abstractmethod
+    def transform(self, path: PurePath) -> TransformResult:
+        pass
+
+
+class ExactTf(Transformation):
+    def transform(self, path: PurePath) -> TransformResult:
+        if path != PurePath(self.rule.left):
            return None

-        while left_parts and path_parts:
-            left_part = left_parts.pop()
-            path_part = path_parts.pop()
+        right = self.rule.right_result(path)
+        if not isinstance(right, str):
+            return right

-            if left_part != path_part:
+        return Transformed(PurePath(right))
+
+
+class ExactReTf(Transformation):
+    def transform(self, path: PurePath) -> TransformResult:
+        match = re.fullmatch(self.rule.left, str_path(path))
+        if not match:
            return None

-        if left_parts:
-            return None
+        right = self.rule.right_result(path)
+        if not isinstance(right, str):
+            return right

-        path_parts.reverse()
-        return PurePath(*path_parts)
-
-    def transform(self, path: PurePath) -> Union[PurePath, bool]:
-        if rest := self._match_prefix(path):
-            if isinstance(self._right, bool):
-                return self._right or path
-            else:
-                return self._right / rest
-
-        return False
-
-
-class ExactRule(Rule):
-    def __init__(self, left: PurePath, right: Union[PurePath, bool]):
-        self._left = left
-        self._right = right
-
-    def transform(self, path: PurePath) -> Union[PurePath, bool]:
-        if path == self._left:
-            if isinstance(self._right, bool):
-                return self._right or path
-            else:
-                return self._right
-
-        return False
-
-
-class NameRule(Rule):
-    def __init__(self, subrule: Rule):
-        self._subrule = subrule
-
-    def transform(self, path: PurePath) -> Union[PurePath, bool]:
-        matched = False
-        result = PurePath()
-
-        for part in path.parts:
-            part_result = self._subrule.transform(PurePath(part))
-            if isinstance(part_result, PurePath):
-                matched = True
-                result /= part_result
-            elif part_result:
-                # If any subrule call ignores its path segment, the entire path
-                # should be ignored
-                return True
-            else:
-                # The subrule doesn't modify this segment, but maybe other
-                # segments
-                result /= part
-
-        if matched:
-            return result
-        else:
-            # The subrule has modified no segments, so this name version of it
-            # doesn't match
-            return False
-
-
-class ReRule(Rule):
-    def __init__(self, left: str, right: Union[str, bool]):
-        self._left = left
-        self._right = right
-
-    def transform(self, path: PurePath) -> Union[PurePath, bool]:
-        if match := re.fullmatch(self._left, str(path)):
-            if isinstance(self._right, bool):
-                return self._right or path
-
-            vars: Dict[str, Union[str, int, float]] = {}
-
-            # For some reason, mypy thinks that "groups" has type List[str].
-            # But since elements of "match.groups()" can be None, mypy is
-            # wrong.
+        # For some reason, mypy thinks that "groups" has type List[str]. But
+        # since elements of "match.groups()" can be None, mypy is wrong.
        groups: Sequence[Optional[str]] = [match[0]] + list(match.groups())
+
+        locals_dir: Dict[str, Union[str, int, float]] = {}
        for i, group in enumerate(groups):
            if group is None:
                continue

-                vars[f"g{i}"] = group
+            locals_dir[f"g{i}"] = group

            try:
-                    vars[f"i{i}"] = int(group)
+                locals_dir[f"i{i}"] = int(group)
            except ValueError:
                pass

            try:
-                    vars[f"f{i}"] = float(group)
+                locals_dir[f"f{i}"] = float(group)
            except ValueError:
                pass

-            result = eval(f"f{self._right!r}", vars)
-            return PurePath(result)
+        result = eval(f"f{right!r}", {}, locals_dir)
+        return Transformed(PurePath(result))

-        return False
+
+class RenamingParentsTf(Transformation):
+    def __init__(self, sub_tf: Transformation):
+        super().__init__(sub_tf.rule)
+        self.sub_tf = sub_tf
+
+    def transform(self, path: PurePath) -> TransformResult:
+        for i in range(len(path.parts), -1, -1):
+            parent = PurePath(*path.parts[:i])
+            child = PurePath(*path.parts[i:])
+
+            transformed = self.sub_tf.transform(parent)
+            if not transformed:
+                continue
+            elif isinstance(transformed, Transformed):
+                return Transformed(transformed.path / child)
+            elif isinstance(transformed, Ignored):
+                return transformed
+            else:
+                raise RuntimeError(f"Invalid transform result of type {type(transformed)}: {transformed}")
+
+        return None
+
+
+class RenamingPartsTf(Transformation):
+    def __init__(self, sub_tf: Transformation):
+        super().__init__(sub_tf.rule)
+        self.sub_tf = sub_tf
+
+    def transform(self, path: PurePath) -> TransformResult:
+        result = PurePath()
+        any_part_matched = False
+        for part in path.parts:
+            transformed = self.sub_tf.transform(PurePath(part))
+            if not transformed:
+                result /= part
+            elif isinstance(transformed, Transformed):
+                result /= transformed.path
+                any_part_matched = True
+            elif isinstance(transformed, Ignored):
+                return transformed
+            else:
+                raise RuntimeError(f"Invalid transform result of type {type(transformed)}: {transformed}")
+
+        if any_part_matched:
+            return Transformed(result)
+        else:
+            return None


 class RuleParseError(Exception):
@ -162,18 +175,15 @@ class RuleParseError(Exception):
        log.error_contd(f"{spaces}^--- {self.reason}")


+T = TypeVar("T")
+
+
 class Line:
    def __init__(self, line: str, line_nr: int):
        self._line = line
        self._line_nr = line_nr
        self._index = 0

-    def get(self) -> Optional[str]:
-        if self._index < len(self._line):
-            return self._line[self._index]
-
-        return None
-
    @property
    def line(self) -> str:
        return self._line
@ -190,155 +200,192 @@ class Line:
    def index(self, index: int) -> None:
        self._index = index

-    def advance(self) -> None:
-        self._index += 1
+    @property
+    def rest(self) -> str:
+        return self.line[self.index:]

-    def expect(self, string: str) -> None:
-        for char in string:
-            if self.get() == char:
-                self.advance()
+    def peek(self, amount: int = 1) -> str:
+        return self.rest[:amount]
+
+    def take(self, amount: int = 1) -> str:
+        string = self.peek(amount)
+        self.index += len(string)
+        return string
+
+    def expect(self, string: str) -> str:
+        if self.peek(len(string)) == string:
+            return self.take(len(string))
        else:
-                raise RuleParseError(self, f"Expected {char!r}")
+            raise RuleParseError(self, f"Expected {string!r}")
+
+    def expect_with(self, string: str, value: T) -> T:
+        self.expect(string)
+        return value
+
+    def one_of(self, parsers: List[Callable[[], T]], description: str) -> T:
+        for parser in parsers:
+            index = self.index
+            try:
+                return parser()
+            except RuleParseError:
+                self.index = index
+
+        raise RuleParseError(self, description)
+
+
+# RULE = LEFT SPACE '-' NAME '-' HEAD (SPACE RIGHT)?
+# SPACE = ' '+
+# NAME = '' | 'exact' | 'name' | 're' | 'exact-re' | 'name-re'
+# HEAD = '>' | '>>'
+# LEFT = STR | QUOTED_STR
+# RIGHT = STR | QUOTED_STR | '!'
+
+
+def parse_zero_or_more_spaces(line: Line) -> None:
+    while line.peek() == " ":
+        line.take()
+
+
+def parse_one_or_more_spaces(line: Line) -> None:
+    line.expect(" ")
+    parse_zero_or_more_spaces(line)
+
+
+def parse_str(line: Line) -> str:
+    result = []
+    while c := line.peek():
+        if c == " ":
+            break
+        else:
+            line.take()
+            result.append(c)
+
+    if result:
+        return "".join(result)
+    else:
+        raise RuleParseError(line, "Expected non-space character")


 QUOTATION_MARKS = {'"', "'"}


-def parse_string_literal(line: Line) -> str:
+def parse_quoted_str(line: Line) -> str:
    escaped = False

    # Points to first character of string literal
    start_index = line.index

-    quotation_mark = line.get()
+    quotation_mark = line.peek()
    if quotation_mark not in QUOTATION_MARKS:
-        # This should never happen as long as this function is only called from
-        # parse_string.
-        raise RuleParseError(line, "Invalid quotation mark")
-    line.advance()
+        raise RuleParseError(line, "Expected quotation mark")
+    line.take()

-    while c := line.get():
+    while c := line.peek():
        if escaped:
            escaped = False
-            line.advance()
+            line.take()
        elif c == quotation_mark:
-            line.advance()
+            line.take()
            stop_index = line.index
            literal = line.line[start_index:stop_index]
+            try:
                return ast.literal_eval(literal)
+            except SyntaxError as e:
+                line.index = start_index
+                raise RuleParseError(line, str(e)) from e
        elif c == "\\":
            escaped = True
-            line.advance()
+            line.take()
        else:
-            line.advance()
+            line.take()

    raise RuleParseError(line, "Expected end of string literal")


-def parse_until_space_or_eol(line: Line) -> str:
-    result = []
-    while c := line.get():
-        if c == " ":
-            break
-        result.append(c)
-        line.advance()
-
-    return "".join(result)
-
-
-def parse_string(line: Line) -> Union[str, bool]:
-    if line.get() in QUOTATION_MARKS:
-        return parse_string_literal(line)
+def parse_left(line: Line) -> str:
+    if line.peek() in QUOTATION_MARKS:
+        return parse_quoted_str(line)
    else:
-        string = parse_until_space_or_eol(line)
+        return parse_str(line)
+
+
+def parse_right(line: Line) -> Union[str, Ignore]:
+    c = line.peek()
+    if c in QUOTATION_MARKS:
+        return parse_quoted_str(line)
+    else:
+        string = parse_str(line)
        if string == "!":
-            return True
+            return Ignore()
        return string


-def parse_arrow(line: Line) -> str:
-    line.expect("-")
-
-    name = []
-    while True:
-        c = line.get()
-        if not c:
-            raise RuleParseError(line, "Expected rest of arrow")
-        elif c == "-":
-            line.advance()
-            c = line.get()
-            if not c:
-                raise RuleParseError(line, "Expected rest of arrow")
-            elif c == ">":
-                line.advance()
-                break  # End of arrow
-            else:
-                name.append("-")
-                continue
-        else:
-            name.append(c)
-
-        line.advance()
-
-    return "".join(name)
+def parse_arrow_name(line: Line) -> str:
+    return line.one_of([
+        lambda: line.expect("exact-re"),
+        lambda: line.expect("exact"),
+        lambda: line.expect("name-re"),
+        lambda: line.expect("name"),
+        lambda: line.expect("re"),
+        lambda: line.expect(""),
+    ], "Expected arrow name")


-def parse_whitespace(line: Line) -> None:
-    line.expect(" ")
-    while line.get() == " ":
-        line.advance()
+def parse_arrow_head(line: Line) -> ArrowHead:
+    return line.one_of([
+        lambda: line.expect_with(">>", ArrowHead.SEQUENCE),
+        lambda: line.expect_with(">", ArrowHead.NORMAL),
+    ], "Expected arrow head")


 def parse_eol(line: Line) -> None:
-    if line.get() is not None:
+    if line.peek():
        raise RuleParseError(line, "Expected end of line")


 def parse_rule(line: Line) -> Rule:
-    # Parse left side
-    leftindex = line.index
-    left = parse_string(line)
-    if isinstance(left, bool):
-        line.index = leftindex
-        raise RuleParseError(line, "Left side can't be '!'")
-    leftpath = PurePath(left)
+    parse_zero_or_more_spaces(line)
+    left = parse_left(line)

-    # Parse arrow
-    parse_whitespace(line)
-    arrowindex = line.index
-    arrowname = parse_arrow(line)
+    parse_one_or_more_spaces(line)

-    # Parse right side
-    if line.get():
-        parse_whitespace(line)
-        right = parse_string(line)
-    else:
-        right = False
-    rightpath: Union[PurePath, bool]
-    if isinstance(right, bool):
-        rightpath = right
-    else:
-        rightpath = PurePath(right)
+    line.expect("-")
+    name = parse_arrow_name(line)
+    line.expect("-")
+    head = parse_arrow_head(line)

+    index = line.index
+    right: RightSide
+    try:
+        parse_zero_or_more_spaces(line)
+        parse_eol(line)
+        right = Empty()
+    except RuleParseError:
+        line.index = index
+        parse_one_or_more_spaces(line)
+        right = parse_right(line)
        parse_eol(line)

-    # Dispatch
-    if arrowname == "":
-        return NormalRule(leftpath, rightpath)
-    elif arrowname == "name":
-        if len(leftpath.parts) > 1:
-            line.index = leftindex
-            raise RuleParseError(line, "SOURCE must be a single name, not multiple segments")
-        return NameRule(ExactRule(leftpath, rightpath))
-    elif arrowname == "exact":
-        return ExactRule(leftpath, rightpath)
-    elif arrowname == "re":
-        return ReRule(left, right)
-    elif arrowname == "name-re":
-        return NameRule(ReRule(left, right))
+    return Rule(left, name, head, right)
+
+
+def parse_transformation(line: Line) -> Transformation:
+    rule = parse_rule(line)
+
+    if rule.name == "":
+        return RenamingParentsTf(ExactTf(rule))
+    elif rule.name == "exact":
+        return ExactTf(rule)
+    elif rule.name == "name":
+        return RenamingPartsTf(ExactTf(rule))
+    elif rule.name == "re":
+        return RenamingParentsTf(ExactReTf(rule))
+    elif rule.name == "exact-re":
+        return ExactReTf(rule)
+    elif rule.name == "name-re":
+        return RenamingPartsTf(ExactReTf(rule))
    else:
-        line.index = arrowindex + 1  # For nicer error message
-        raise RuleParseError(line, f"Invalid arrow name {arrowname!r}")
+        raise RuntimeError(f"Invalid arrow name {rule.name!r}")


 class Transformer:
@ -347,32 +394,40 @@ class Transformer:
        May throw a RuleParseException.
        """

-        self._rules = []
+        self._tfs = []
        for i, line in enumerate(rules.split("\n")):
            line = line.strip()
            if line:
-                rule = parse_rule(Line(line, i))
-                self._rules.append((line, rule))
+                tf = parse_transformation(Line(line, i))
+                self._tfs.append((line, tf))

    def transform(self, path: PurePath) -> Optional[PurePath]:
-        for i, (line, rule) in enumerate(self._rules):
+        for i, (line, tf) in enumerate(self._tfs):
            log.explain(f"Testing rule {i+1}: {line}")

            try:
-                result = rule.transform(path)
+                result = tf.transform(path)
            except Exception as e:
                log.warn(f"Error while testing rule {i+1}: {line}")
                log.warn_contd(str(e))
                continue

-            if isinstance(result, PurePath):
-                log.explain(f"Match found, transformed path to {fmt_path(result)}")
-                return result
-            elif result:  # Exclamation mark
-                log.explain("Match found, path ignored")
-                return None
-            else:
+            if not result:
                continue

-        log.explain("No rule matched, path is unchanged")
+            if isinstance(result, Ignored):
+                log.explain("Match found, path ignored")
+                return None
+
+            if tf.rule.head == ArrowHead.NORMAL:
+                log.explain(f"Match found, transformed path to {fmt_path(result.path)}")
+                path = result.path
+                break
+            elif tf.rule.head == ArrowHead.SEQUENCE:
+                log.explain(f"Match found, updated path to {fmt_path(result.path)}")
+                path = result.path
+            else:
+                raise RuntimeError(f"Invalid transform result of type {type(result)}: {result}")
+
+        log.explain(f"Final result: {fmt_path(path)}")
        return path
--- a/PFERD/utils.py
+++ b/PFERD/utils.py
@ -91,8 +91,14 @@ def url_set_query_params(url: str, params: Dict[str, str]) -> str:
    return result


+def str_path(path: PurePath) -> str:
+    if not path.parts:
+        return "."
+    return "/".join(path.parts)
+
+
 def fmt_path(path: PurePath) -> str:
-    return repr(str(path))
+    return repr(str_path(path))


 def fmt_real_path(path: Path) -> str:
--- a/PFERD/version.py
+++ b/PFERD/version.py
@ -1,2 +1,2 @@
 NAME = "PFERD"
-VERSION = "3.0.1"
+VERSION = "3.1.0"
--- a/README.md
+++ b/README.md
@ -28,9 +28,9 @@ The use of [venv](https://docs.python.org/3/library/venv.html) is recommended.

 ## Basic usage

-PFERD can be run directly from the command line with no config file.
-Run `pferd -h` to get an overview of available commands and options.
-Run `pferd <command> -h` to see which options a command has.
+PFERD can be run directly from the command line with no config file. Run `pferd
+-h` to get an overview of available commands and options. Run `pferd <command>
+-h` to see which options a command has.

 For example, you can download your personal desktop from the KIT ILIAS like
 this:
@ -116,17 +116,18 @@ transform =
  Online-Tests --> !
  Vorlesungswerbung --> !

+  # Rename folders
+  Lehrbücher --> Vorlesung
+  # Note the ">>" arrow head which lets us apply further rules to files moved to "Übung"
+  Übungsunterlagen -->> Übung
+
  # Move exercises to own folder. Rename them to "Blatt-XX.pdf" to make them sort properly
-  "Übungsunterlagen/(\d+). Übungsblatt.pdf" -re-> Blätter/Blatt-{i1:02}.pdf
+  "Übung/(\d+). Übungsblatt.pdf" -re-> Blätter/Blatt-{i1:02}.pdf
  # Move solutions to own folder. Rename them to "Blatt-XX-Lösung.pdf" to make them sort properly
-  "Übungsunterlagen/(\d+). Übungsblatt.*Musterlösung.pdf" -re-> Blätter/Blatt-{i1:02}-Lösung.pdf
+  "Übung/(\d+). Übungsblatt.*Musterlösung.pdf" -re-> Blätter/Blatt-{i1:02}-Lösung.pdf

  # The course has nested folders with the same name - flatten them
-  "Übungsunterlagen/(.+?)/\\1/(.*)" -re-> Übung/{g1}/{g2}
-
-  # Rename remaining folders
-  Übungsunterlagen --> Übung
-  Lehrbücher --> Vorlesung
+  "Übung/(.+?)/\\1" -re-> Übung/{g1}

 [crawl:Bar]
 type = kit-ilias-web
--- a/scripts/setup
+++ b/scripts/setup
@ -12,6 +12,6 @@ pip install --upgrade setuptools
 # Installing PFERD itself
 pip install --editable .

-# Installing various tools
-pip install --upgrade mypy flake8 autopep8 isort
-pip install --upgrade pyinstaller
+# Installing tools and type hints
+pip install --upgrade mypy flake8 autopep8 isort pyinstaller
+pip install --upgrade types-chardet types-certifi
Author	SHA1	Message	Date
Joscha	75fde870c2	Bump version to 3.1.0	2021-06-13 17:23:18 +02:00
I-Al-Istannen	6e4d423c81	Crawl all video stages in one crawl bar This ensures folders are not renamed, as they are crawled twice	2021-06-13 17:18:45 +02:00
Joscha	57aef26217	Fix name arrows I seem to have (re-)implemented them incorrectly and never tested them.	2021-06-13 16:33:29 +02:00
I-Al-Istannen	70ec64a48b	Fix wrong base URL for multi-stage pages	2021-06-13 15:44:47 +02:00
Joscha	70b33ecfd9	Add migration notes to changelog Also clean up some other formatting for consistency	2021-06-13 15:06:50 +02:00
Joscha	601e4b936b	Use new arrow logic in README example config	2021-06-12 15:00:52 +02:00
Joscha	a292c4c437	Add example for ">>" arrow heads	2021-06-12 14:57:29 +02:00
Joscha	bc65ea7ab6	Fix mypy complaining about missing type hints	2021-06-09 22:45:52 +02:00
Joscha	f28bbe6b0c	Update transform rule documentation It's still missing an example that uses rules with ">>" arrows.	2021-06-09 22:45:52 +02:00
Joscha	61d902d715	Overhaul transform logic -re-> arrows now rename their parent directories (like -->) and don't require a full match (like -exact->). Their old behaviour is available as -exact-re->. Also, this change adds the ">>" arrow head, which modifies the current path and continues to the next rule when it matches.	2021-06-09 22:45:52 +02:00
I-Al-Istannen	8ab462fb87	Use the exercise label instead of the button name as path	2021-06-04 19:24:23 +02:00
Joscha	df3ad3d890	Add 'skip' option to crawlers	2021-06-04 18:47:13 +02:00
Joscha	fc31100a0f	Always use '/' as path separator for regex rules Previously, regex-matching paths on windows would, in some cases, require four backslashes ('\\\\') to escape a single path separator. That's just too much. With this commit, regex transforms now use '/' instead of '\' as path separator, meaning rules can more easily be shared between platforms (although they are not guaranteed to be 100% compatible since on Windows, '\' is still recognized as a path separator). To make rules more intuitive to write, local relative paths are now also printed with '/' as path separator on Windows. Since Windows also accepts '/' as path separator, this change doesn't really affect other rules that parse their sides as paths.	2021-06-04 18:12:45 +02:00
Joscha	31b6311e99	Remove incorrect tmp file explain message	2021-06-01 19:03:06 +02:00
Joscha	1fc8e9eb7a	Document credential file authenticator config options	2021-06-01 10:01:14 +00:00