Simplify IPD crawler link regex

This commit is contained in:
Joscha 2022-05-08 17:39:18 +02:00
parent 2f0e04ce13
commit 616b0480f7
3 changed files with 5 additions and 4 deletions

View File

@ -24,11 +24,12 @@ ambiguous situations.
### Changed ### Changed
- Add `cpp` extension to default `link_regex` of IPD crawler - Add `cpp` extension to default `link_regex` of IPD crawler
- Mention hrefs in IPD crawler for users of `link_regex` option - Mention hrefs in IPD crawler's `--explain` output for users of `link_regex` option
- Simplify default IPD crawler `link_regex`
### Fixed ### Fixed
- IPD crawler crashes on some sites - IPD crawler crashes on some sites
- Meeting name normalization for yesterday, today and tomorrow fails - Meeting name normalization for yesterday, today and tomorrow
- Crawling of meeting file previews - Crawling of meeting file previews
## 3.4.0 - 2022-05-01 ## 3.4.0 - 2022-05-01

View File

@ -146,7 +146,7 @@ requests is likely a good idea.
- `target`: URL to a KIT-IPD page - `target`: URL to a KIT-IPD page
- `link_regex`: A regex that is matched against the `href` part of links. If it - `link_regex`: A regex that is matched against the `href` part of links. If it
matches, the given link is downloaded as a file. This is used to extract matches, the given link is downloaded as a file. This is used to extract
files from KIT-IPD pages. (Default: `^.*/[^/]*\.(?:pdf|zip|c|cpp|java)$`) files from KIT-IPD pages. (Default: `^.*?[^/]+\.(pdf|zip|c|cpp|java)$`)
### The `kit-ilias-web` crawler ### The `kit-ilias-web` crawler

View File

@ -27,7 +27,7 @@ class KitIpdCrawlerSection(HttpCrawlerSection):
return target return target
def link_regex(self) -> Pattern[str]: def link_regex(self) -> Pattern[str]:
regex = self.s.get("link_regex", r"^.*/[^/]*\.(?:pdf|zip|c|cpp|java)$") regex = self.s.get("link_regex", r"^.*?[^/]+\.(pdf|zip|c|cpp|java)$")
return re.compile(regex) return re.compile(regex)