I-Al-Istannen
79efdb56f7
Adjust ILIAS html explain messages
2021-05-23 18:24:25 +02:00
I-Al-Istannen
59f13bb8d6
Explain ILIAS HTML parsing and add some warnings
2021-05-23 18:14:54 +02:00
I-Al-Istannen
05ad06fbc1
Only enclose get_page in iorepeat in ILIAS crawler
...
We previously also gathered in there, which could lead to some more
surprises when the method was retried.
2021-05-23 18:14:51 +02:00
Joscha
29d5a40c57
Replace asyncio.gather with custom Crawler function
2021-05-23 17:25:16 +02:00
I-Al-Istannen
e1d18708b3
Rename "no_videos" to videos
2021-05-23 13:30:42 +02:00
I-Al-Istannen
ecdedfa1cf
Add no-videos flag to ILIAS crawler
2021-05-23 12:37:01 +02:00
I-Al-Istannen
3d4b997d4a
Retry crawl_url and work around Python's closure handling
...
Closures capture the scope and not the variables. Therefore, any
type-narrowing performed by mypy on captured variables is lost inside
the closure.
2021-05-23 12:28:15 +02:00
I-Al-Istannen
33a81a5f5c
Document authentication in HTTP crawler and rename prepare_request
2021-05-23 11:55:34 +02:00
Joscha
803e5628a2
Clean up logging
...
Paths are now (hopefully) logged consistently across all crawlers
2021-05-23 11:37:19 +02:00
I-Al-Istannen
53e031d9f6
Reuse dl/cl for I/O retries in ILIAS crawler
2021-05-23 00:28:27 +02:00
I-Al-Istannen
3053278721
Move HTTP crawler to own file
2021-05-22 23:23:21 +02:00
I-Al-Istannen
4d07de0d71
Adjust forum log message in ilias crawler
2021-05-22 23:20:21 +02:00
I-Al-Istannen
953a1bba93
Adjust to new crawl / download names
2021-05-22 23:18:05 +02:00
I-Al-Istannen
4b104b6252
Try out some HTTP authentication handling
...
This is by no means final yet and will change a bit once the dl and cl
are changed, but it might serve as a first try. It is also wholly
untested.
2021-05-21 12:02:51 +02:00
I-Al-Istannen
83d12fcf2d
Add some explains to ilias crawler and use crawler exceptions
2021-05-20 14:58:54 +02:00
I-Al-Istannen
e4f9560655
Only retry on aiohttp errors in ILIAS crawler
...
This patch removes quite a few retries and now only retries the ilias
element method. Every other HTTP-interacting method (except for the root
requests) is called from there and should be covered.
In the future we also want to retry the root a few times, but that
will be done after the download sink API is adjusted.
2021-05-19 22:01:09 +02:00
I-Al-Istannen
8cfa818f04
Only call should_crawl once
2021-05-19 21:57:55 +02:00
I-Al-Istannen
81301f3a76
Rename the ilias crawler to ilias web crawler
2021-05-19 21:41:17 +02:00