59f13bb8d6
Explain ILIAS HTML parsing and add some warnings
2021-05-23 18:14:54 +02:00
463f8830d7
Add warn_contd
2021-05-23 18:14:54 +02:00
05ad06fbc1
Only enclose get_page in iorepeat in ILIAS crawler
...
We previously also gathered in there, which could lead to some more
surprises when the method was retried.
2021-05-23 18:14:51 +02:00
29d5a40c57
Replace asyncio.gather with custom Crawler function
2021-05-23 17:25:16 +02:00
c0cecf8363
Log crawl and download actions more extensively
2021-05-23 16:25:44 +02:00
b998339002
Fix cleanup logging of paths
2021-05-23 16:25:44 +02:00
245c9c3dcc
Explain output dir decisions and steps
2021-05-23 16:25:44 +02:00
d8f26a789e
Implement CLI Command for ilias crawler
2021-05-23 13:30:42 +02:00
e1d18708b3
Rename "no_videos" to videos
2021-05-23 13:30:42 +02:00
b44b49476d
Fix noncritical and anoncritical decorators
...
I must've forgot to update the anoncritical decorator when I last changed the
noncritical decorator. Also, every exception should make the crawler not
error_free, not just CrawlErrors.
2021-05-23 13:24:53 +02:00
7e0bb06259
Clean up TODOs
2021-05-23 12:47:30 +02:00
ecdedfa1cf
Add no-videos flag to ILIAS crawler
2021-05-23 12:37:01 +02:00
3d4b997d4a
Retry crawl_url and work around Python's closure handling
...
Closures capture the scope and not the variables. Therefore, any
type-narrowing performed by mypy on captured variables is lost inside
the closure.
2021-05-23 12:28:15 +02:00
e81005ae4b
Fix CLI arguments
2021-05-23 12:24:21 +02:00
33a81a5f5c
Document authentication in HTTP crawler and rename prepare_request
2021-05-23 11:55:34 +02:00
25e2abdb03
Improve transformer explain wording
2021-05-23 11:45:14 +02:00
803e5628a2
Clean up logging
...
Paths are now (hopefully) logged consistently across all crawlers
2021-05-23 11:37:19 +02:00
c88f20859a
Explain config file dumping
2021-05-23 11:04:50 +02:00
ec3767c545
Create crawler base dir at start of crawl
2021-05-23 10:52:02 +02:00
729ff0a4c7
Fix simple authenticator output
2021-05-23 10:45:37 +02:00
6fe51e258f
Number rules starting at 1
2021-05-23 10:45:37 +02:00
44ecb2fbe7
Fix cleanup deleting crawler's base directory
2021-05-23 10:45:37 +02:00
53e031d9f6
Reuse dl/cl for I/O retries in ILIAS crawler
2021-05-23 00:28:27 +02:00
8ac85ea0bd
Fix a few typos in HttpCrawler
2021-05-22 23:37:34 +02:00
adfdc302d7
Save cookies after successful authentication in HTTP crawler
2021-05-22 23:30:32 +02:00
3053278721
Move HTTP crawler to own file
2021-05-22 23:23:21 +02:00
4d07de0d71
Adjust forum log message in ilias crawler
2021-05-22 23:20:21 +02:00
953a1bba93
Adjust to new crawl / download names
2021-05-22 23:18:05 +02:00
e724ff7c93
Fix normal arrow
2021-05-22 20:44:59 +00:00
62f0f7bfc5
Explain crawling and partially explain downloading
2021-05-22 20:39:57 +00:00
9cb2b68f09
Fix arrow parsing error messages
2021-05-22 20:39:29 +00:00
1bbc0b705f
Improve transformer error handling
2021-05-22 20:38:56 +00:00
662191eca9
Fix crash as soon as first cl or dl token was acquired
2021-05-22 20:25:58 +00:00
ae3d80664c
Update local crawler to new crawler structure
2021-05-22 21:46:36 +02:00
e21795ee35
Make file cleanup part of default crawler behaviour
2021-05-22 21:45:51 +02:00
ec95dda18f
Unify crawling and downloading steps
...
Now, the progress bar, limiter etc. for downloading and crawling are all handled
via the reusable CrawlToken and DownloadToken context managers.
2021-05-22 21:36:53 +02:00
098ac45758
Remove deprecated repeat decorators
2021-05-22 21:13:25 +02:00
9889ce6b57
Improve PFERD error handling
2021-05-22 21:13:25 +02:00
b4d97cd545
Improve output dir and report error handling
2021-05-22 20:54:42 +02:00
afac22c562
Handle abort in exclusive output state correctly
...
If the event loop is stopped while something holds the exclusive output, the
"log" singleton is now reset so the main thread can print a few more messages
before exiting.
2021-05-22 18:58:19 +02:00
552cd82802
Run async input and password getters in daemon thread
...
Previously, it ran in the event loop's default executor, which would block until
all its workers were done working.
If Ctrl+C was pressed while input or a password were being read, the
asyncio.run() call in the main thread would be interrupted however, not the
input thread. This meant that multiple key presses (either enter or a second
Ctrl+C) were necessary to stop a running PFERD in some circumstances.
This change instead runs the input functions in daemon threads so they exit as
soon as the main thread exits.
2021-05-22 18:37:53 +02:00
dfde0e2310
Improve reporting of unexpected exceptions
2021-05-22 18:36:25 +02:00
54dd2f8337
Clean up main and improve error handling
2021-05-22 16:47:24 +02:00
b5785f260e
Extract CLI argument parsing to separate module
2021-05-22 15:03:45 +02:00
98b8ca31fa
Add some todos
2021-05-22 14:45:46 +02:00
4b104b6252
Try out some HTTP authentication handling
...
This is by no means final yet and will change a bit once the dl and cl
are changed, but it might serve as a first try. It is also wholly
untested.
2021-05-21 12:02:51 +02:00
83d12fcf2d
Add some explains to ilias crawler and use crawler exceptions
2021-05-20 14:58:54 +02:00
e4f9560655
Only retry on aiohttp errors in ILIAS crawler
...
This patch removes quite a few retries and now only retries the ilias
element method. Every other HTTP-interacting method (except for the root
requests) is called from there and should be covered.
In the future we also want to retry the root a few times, but that
will be done after the download sink API is adjusted.
2021-05-19 22:01:09 +02:00
8cfa818f04
Only call should_crawl once
2021-05-19 21:57:55 +02:00
81301f3a76
Rename the ilias crawler to ilias web crawler
2021-05-19 21:41:17 +02:00