Commit Graph

18 Commits

Author SHA1 Message Date
I-Al-Istannen
79efdb56f7 Adjust ILIAS html explain messages 2021-05-23 18:24:25 +02:00
I-Al-Istannen
59f13bb8d6 Explain ILIAS HTML parsing and add some warnings 2021-05-23 18:14:54 +02:00
I-Al-Istannen
05ad06fbc1 Only enclose get_page in iorepeat in ILIAS crawler
We previously also gathered in there, which could lead to some more
surprises when the method was retried.
2021-05-23 18:14:51 +02:00
Joscha
29d5a40c57 Replace asyncio.gather with custom Crawler function 2021-05-23 17:25:16 +02:00
I-Al-Istannen
e1d18708b3 Rename "no_videos" to videos 2021-05-23 13:30:42 +02:00
I-Al-Istannen
ecdedfa1cf Add no-videos flag to ILIAS crawler 2021-05-23 12:37:01 +02:00
I-Al-Istannen
3d4b997d4a Retry crawl_url and work around Python's closure handling
Closures capture the scope and not the variables. Therefore, any
type-narrowing performed by mypy on captured variables is lost inside
the closure.
2021-05-23 12:28:15 +02:00
I-Al-Istannen
33a81a5f5c Document authentication in HTTP crawler and rename prepare_request 2021-05-23 11:55:34 +02:00
Joscha
803e5628a2 Clean up logging
Paths are now (hopefully) logged consistently across all crawlers
2021-05-23 11:37:19 +02:00
I-Al-Istannen
53e031d9f6 Reuse dl/cl for I/O retries in ILIAS crawler 2021-05-23 00:28:27 +02:00
I-Al-Istannen
3053278721 Move HTTP crawler to own file 2021-05-22 23:23:21 +02:00
I-Al-Istannen
4d07de0d71 Adjust forum log message in ilias crawler 2021-05-22 23:20:21 +02:00
I-Al-Istannen
953a1bba93 Adjust to new crawl / download names 2021-05-22 23:18:05 +02:00
I-Al-Istannen
4b104b6252 Try out some HTTP authentication handling
This is by no means final yet and will change a bit once the dl and cl
are changed, but it might serve as a first try. It is also wholly
untested.
2021-05-21 12:02:51 +02:00
I-Al-Istannen
83d12fcf2d Add some explains to ilias crawler and use crawler exceptions 2021-05-20 14:58:54 +02:00
I-Al-Istannen
e4f9560655 Only retry on aiohttp errors in ILIAS crawler
This patch removes quite a few retries and now only retries the ilias
element method. Every other HTTP-interacting method (except for the root
requests) is called from there and should be covered.

In the future we also want to retry the root a few times, but that
will be done after the download sink API is adjusted.
2021-05-19 22:01:09 +02:00
I-Al-Istannen
8cfa818f04 Only call should_crawl once 2021-05-19 21:57:55 +02:00
I-Al-Istannen
81301f3a76 Rename the ilias crawler to ilias web crawler 2021-05-19 21:41:17 +02:00