diff --git a/CHANGELOG.md b/CHANGELOG.md index 4249287..e2d3840 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -24,11 +24,12 @@ ambiguous situations. ### Changed - Add `cpp` extension to default `link_regex` of IPD crawler -- Mention hrefs in IPD crawler for users of `link_regex` option +- Mention hrefs in IPD crawler's `--explain` output for users of `link_regex` option +- Simplify default IPD crawler `link_regex` ### Fixed - IPD crawler crashes on some sites -- Meeting name normalization for yesterday, today and tomorrow fails +- Meeting name normalization for yesterday, today and tomorrow - Crawling of meeting file previews ## 3.4.0 - 2022-05-01 diff --git a/CONFIG.md b/CONFIG.md index 1355c34..f572a80 100644 --- a/CONFIG.md +++ b/CONFIG.md @@ -146,7 +146,7 @@ requests is likely a good idea. - `target`: URL to a KIT-IPD page - `link_regex`: A regex that is matched against the `href` part of links. If it matches, the given link is downloaded as a file. This is used to extract - files from KIT-IPD pages. (Default: `^.*/[^/]*\.(?:pdf|zip|c|cpp|java)$`) + files from KIT-IPD pages. (Default: `^.*?[^/]+\.(pdf|zip|c|cpp|java)$`) ### The `kit-ilias-web` crawler diff --git a/PFERD/crawl/kit_ipd_crawler.py b/PFERD/crawl/kit_ipd_crawler.py index 78fe0b1..d9fac32 100644 --- a/PFERD/crawl/kit_ipd_crawler.py +++ b/PFERD/crawl/kit_ipd_crawler.py @@ -27,7 +27,7 @@ class KitIpdCrawlerSection(HttpCrawlerSection): return target def link_regex(self) -> Pattern[str]: - regex = self.s.get("link_regex", r"^.*/[^/]*\.(?:pdf|zip|c|cpp|java)$") + regex = self.s.get("link_regex", r"^.*?[^/]+\.(pdf|zip|c|cpp|java)$") return re.compile(regex)