Simplify IPD crawler link regex

2025-07-15 15:32:36 +02:00 · 2022-05-08 17:39:18 +02:00
parent 2f0e04ce13
commit 616b0480f7
3 changed files with 5 additions and 4 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -24,11 +24,12 @@ ambiguous situations.

 ### Changed
 - Add `cpp` extension to default `link_regex` of IPD crawler
- Mention hrefs in IPD crawler for users of `link_regex` option
+- Mention hrefs in IPD crawler's `--explain` output for users of `link_regex` option
+- Simplify default IPD crawler `link_regex`

 ### Fixed
 - IPD crawler crashes on some sites
- Meeting name normalization for yesterday, today and tomorrow fails
+- Meeting name normalization for yesterday, today and tomorrow
 - Crawling of meeting file previews

 ## 3.4.0 - 2022-05-01
--- a/CONFIG.md
+++ b/CONFIG.md
@@ -146,7 +146,7 @@ requests is likely a good idea.
 - `target`: URL to a KIT-IPD page
 - `link_regex`: A regex that is matched against the `href` part of links. If it
  matches, the given link is downloaded as a file. This is used to extract
-  files from KIT-IPD pages. (Default: `^.*/[^/]*\.(?:pdf|zip|c|cpp|java)$`)
+  files from KIT-IPD pages. (Default: `^.*?[^/]+\.(pdf|zip|c|cpp|java)$`)

 ### The `kit-ilias-web` crawler

--- a/PFERD/crawl/kit_ipd_crawler.py
+++ b/PFERD/crawl/kit_ipd_crawler.py
@@ -27,7 +27,7 @@ class KitIpdCrawlerSection(HttpCrawlerSection):
        return target

    def link_regex(self) -> Pattern[str]:
-        regex = self.s.get("link_regex", r"^.*/[^/]*\.(?:pdf|zip|c|cpp|java)$")
+        regex = self.s.get("link_regex", r"^.*?[^/]+\.(pdf|zip|c|cpp|java)$")
        return re.compile(regex)