mirror of
https://github.com/Garmelon/PFERD.git
synced 2023-12-21 10:23:01 +01:00
Simplify IPD crawler link regex
This commit is contained in:
parent
2f0e04ce13
commit
616b0480f7
@ -24,11 +24,12 @@ ambiguous situations.
|
|||||||
|
|
||||||
### Changed
|
### Changed
|
||||||
- Add `cpp` extension to default `link_regex` of IPD crawler
|
- Add `cpp` extension to default `link_regex` of IPD crawler
|
||||||
- Mention hrefs in IPD crawler for users of `link_regex` option
|
- Mention hrefs in IPD crawler's `--explain` output for users of `link_regex` option
|
||||||
|
- Simplify default IPD crawler `link_regex`
|
||||||
|
|
||||||
### Fixed
|
### Fixed
|
||||||
- IPD crawler crashes on some sites
|
- IPD crawler crashes on some sites
|
||||||
- Meeting name normalization for yesterday, today and tomorrow fails
|
- Meeting name normalization for yesterday, today and tomorrow
|
||||||
- Crawling of meeting file previews
|
- Crawling of meeting file previews
|
||||||
|
|
||||||
## 3.4.0 - 2022-05-01
|
## 3.4.0 - 2022-05-01
|
||||||
|
@ -146,7 +146,7 @@ requests is likely a good idea.
|
|||||||
- `target`: URL to a KIT-IPD page
|
- `target`: URL to a KIT-IPD page
|
||||||
- `link_regex`: A regex that is matched against the `href` part of links. If it
|
- `link_regex`: A regex that is matched against the `href` part of links. If it
|
||||||
matches, the given link is downloaded as a file. This is used to extract
|
matches, the given link is downloaded as a file. This is used to extract
|
||||||
files from KIT-IPD pages. (Default: `^.*/[^/]*\.(?:pdf|zip|c|cpp|java)$`)
|
files from KIT-IPD pages. (Default: `^.*?[^/]+\.(pdf|zip|c|cpp|java)$`)
|
||||||
|
|
||||||
### The `kit-ilias-web` crawler
|
### The `kit-ilias-web` crawler
|
||||||
|
|
||||||
|
@ -27,7 +27,7 @@ class KitIpdCrawlerSection(HttpCrawlerSection):
|
|||||||
return target
|
return target
|
||||||
|
|
||||||
def link_regex(self) -> Pattern[str]:
|
def link_regex(self) -> Pattern[str]:
|
||||||
regex = self.s.get("link_regex", r"^.*/[^/]*\.(?:pdf|zip|c|cpp|java)$")
|
regex = self.s.get("link_regex", r"^.*?[^/]+\.(pdf|zip|c|cpp|java)$")
|
||||||
return re.compile(regex)
|
return re.compile(regex)
|
||||||
|
|
||||||
|
|
||||||
|
Loading…
Reference in New Issue
Block a user