mirror of
https://github.com/Garmelon/PFERD.git
synced 2023-12-21 10:23:01 +01:00
Simplify IPD crawler link regex
This commit is contained in:
parent
2f0e04ce13
commit
616b0480f7
@ -24,11 +24,12 @@ ambiguous situations.
|
||||
|
||||
### Changed
|
||||
- Add `cpp` extension to default `link_regex` of IPD crawler
|
||||
- Mention hrefs in IPD crawler for users of `link_regex` option
|
||||
- Mention hrefs in IPD crawler's `--explain` output for users of `link_regex` option
|
||||
- Simplify default IPD crawler `link_regex`
|
||||
|
||||
### Fixed
|
||||
- IPD crawler crashes on some sites
|
||||
- Meeting name normalization for yesterday, today and tomorrow fails
|
||||
- Meeting name normalization for yesterday, today and tomorrow
|
||||
- Crawling of meeting file previews
|
||||
|
||||
## 3.4.0 - 2022-05-01
|
||||
|
@ -146,7 +146,7 @@ requests is likely a good idea.
|
||||
- `target`: URL to a KIT-IPD page
|
||||
- `link_regex`: A regex that is matched against the `href` part of links. If it
|
||||
matches, the given link is downloaded as a file. This is used to extract
|
||||
files from KIT-IPD pages. (Default: `^.*/[^/]*\.(?:pdf|zip|c|cpp|java)$`)
|
||||
files from KIT-IPD pages. (Default: `^.*?[^/]+\.(pdf|zip|c|cpp|java)$`)
|
||||
|
||||
### The `kit-ilias-web` crawler
|
||||
|
||||
|
@ -27,7 +27,7 @@ class KitIpdCrawlerSection(HttpCrawlerSection):
|
||||
return target
|
||||
|
||||
def link_regex(self) -> Pattern[str]:
|
||||
regex = self.s.get("link_regex", r"^.*/[^/]*\.(?:pdf|zip|c|cpp|java)$")
|
||||
regex = self.s.get("link_regex", r"^.*?[^/]+\.(pdf|zip|c|cpp|java)$")
|
||||
return re.compile(regex)
|
||||
|
||||
|
||||
|
Loading…
Reference in New Issue
Block a user