Commit Graph

387 Commits

Author SHA1 Message Date
Joscha
6ca0ecdf05 Load and store reports 2021-05-23 20:46:29 +02:00
I-Al-Istannen
6e9f8fd391 Add a keyring authenticator 2021-05-23 19:44:12 +02:00
Joscha
2fdf24495b Restructure crawling and auth related modules 2021-05-23 19:16:42 +02:00
Joscha
bbf9f8f130 Add -C as alias for --crawler 2021-05-23 19:06:09 +02:00
I-Al-Istannen
37f8d84a9c Output total amount of http requests in HTTP Crawler 2021-05-23 19:00:01 +02:00
Joscha
5edd868d5b Fix always-smart redownloading the wrong files 2021-05-23 18:49:34 +02:00
Joscha
e4e5e83be6 Fix downloader using crawl bar
Looks like I made a dumb copy-paste error. Now the download bar shows the proper
progress and speed again.
2021-05-23 18:39:43 +02:00
Joscha
74c7b39dc8 Clean up files in alphabetical order 2021-05-23 18:39:25 +02:00
Joscha
445dffc987 Reword some explanations 2021-05-23 18:35:32 +02:00
I-Al-Istannen
d97d6bf147 Fix handling nested ILIAS folders 2021-05-23 18:29:28 +02:00
I-Al-Istannen
79efdb56f7 Adjust ILIAS html explain messages 2021-05-23 18:24:25 +02:00
Joscha
a9af56a5e9 Improve specifying crawlers via CLI
Instead of removing the sections of unselected crawlers from the config file,
crawler selection now happens in the Pferd after loading the crawlers and is
more sophisticated. It also has better error messages.
2021-05-23 18:18:50 +02:00
I-Al-Istannen
59f13bb8d6 Explain ILIAS HTML parsing and add some warnings 2021-05-23 18:14:54 +02:00
I-Al-Istannen
463f8830d7 Add warn_contd 2021-05-23 18:14:54 +02:00
I-Al-Istannen
05ad06fbc1 Only enclose get_page in iorepeat in ILIAS crawler
We previously also gathered in there, which could lead to some more
surprises when the method was retried.
2021-05-23 18:14:51 +02:00
Joscha
29d5a40c57 Replace asyncio.gather with custom Crawler function 2021-05-23 17:25:16 +02:00
Joscha
c0cecf8363 Log crawl and download actions more extensively 2021-05-23 16:25:44 +02:00
Joscha
b998339002 Fix cleanup logging of paths 2021-05-23 16:25:44 +02:00
Joscha
245c9c3dcc Explain output dir decisions and steps 2021-05-23 16:25:44 +02:00
I-Al-Istannen
d8f26a789e Implement CLI Command for ilias crawler 2021-05-23 13:30:42 +02:00
I-Al-Istannen
e1d18708b3 Rename "no_videos" to videos 2021-05-23 13:30:42 +02:00
Joscha
b44b49476d Fix noncritical and anoncritical decorators
I must've forgot to update the anoncritical decorator when I last changed the
noncritical decorator. Also, every exception should make the crawler not
error_free, not just CrawlErrors.
2021-05-23 13:24:53 +02:00
Joscha
7e0bb06259 Clean up TODOs 2021-05-23 12:47:30 +02:00
I-Al-Istannen
ecdedfa1cf Add no-videos flag to ILIAS crawler 2021-05-23 12:37:01 +02:00
I-Al-Istannen
3d4b997d4a Retry crawl_url and work around Python's closure handling
Closures capture the scope and not the variables. Therefore, any
type-narrowing performed by mypy on captured variables is lost inside
the closure.
2021-05-23 12:28:15 +02:00
Joscha
e81005ae4b Fix CLI arguments 2021-05-23 12:24:21 +02:00
I-Al-Istannen
33a81a5f5c Document authentication in HTTP crawler and rename prepare_request 2021-05-23 11:55:34 +02:00
Joscha
25e2abdb03 Improve transformer explain wording 2021-05-23 11:45:14 +02:00
Joscha
803e5628a2 Clean up logging
Paths are now (hopefully) logged consistently across all crawlers
2021-05-23 11:37:19 +02:00
Joscha
c88f20859a Explain config file dumping 2021-05-23 11:04:50 +02:00
Joscha
ec3767c545 Create crawler base dir at start of crawl 2021-05-23 10:52:02 +02:00
Joscha
729ff0a4c7 Fix simple authenticator output 2021-05-23 10:45:37 +02:00
Joscha
6fe51e258f Number rules starting at 1 2021-05-23 10:45:37 +02:00
Joscha
44ecb2fbe7 Fix cleanup deleting crawler's base directory 2021-05-23 10:45:37 +02:00
I-Al-Istannen
53e031d9f6 Reuse dl/cl for I/O retries in ILIAS crawler 2021-05-23 00:28:27 +02:00
I-Al-Istannen
8ac85ea0bd Fix a few typos in HttpCrawler 2021-05-22 23:37:34 +02:00
I-Al-Istannen
adfdc302d7 Save cookies after successful authentication in HTTP crawler 2021-05-22 23:30:32 +02:00
I-Al-Istannen
3053278721 Move HTTP crawler to own file 2021-05-22 23:23:21 +02:00
I-Al-Istannen
4d07de0d71 Adjust forum log message in ilias crawler 2021-05-22 23:20:21 +02:00
I-Al-Istannen
953a1bba93 Adjust to new crawl / download names 2021-05-22 23:18:05 +02:00
Joscha
e724ff7c93 Fix normal arrow 2021-05-22 20:44:59 +00:00
Joscha
62f0f7bfc5 Explain crawling and partially explain downloading 2021-05-22 20:39:57 +00:00
Joscha
9cb2b68f09 Fix arrow parsing error messages 2021-05-22 20:39:29 +00:00
Joscha
1bbc0b705f Improve transformer error handling 2021-05-22 20:38:56 +00:00
Joscha
662191eca9 Fix crash as soon as first cl or dl token was acquired 2021-05-22 20:25:58 +00:00
Joscha
ae3d80664c Update local crawler to new crawler structure 2021-05-22 21:46:36 +02:00
Joscha
e21795ee35 Make file cleanup part of default crawler behaviour 2021-05-22 21:45:51 +02:00
Joscha
ec95dda18f Unify crawling and downloading steps
Now, the progress bar, limiter etc. for downloading and crawling are all handled
via the reusable CrawlToken and DownloadToken context managers.
2021-05-22 21:36:53 +02:00
Joscha
098ac45758 Remove deprecated repeat decorators 2021-05-22 21:13:25 +02:00
Joscha
9889ce6b57 Improve PFERD error handling 2021-05-22 21:13:25 +02:00
Joscha
b4d97cd545 Improve output dir and report error handling 2021-05-22 20:54:42 +02:00
Joscha
afac22c562 Handle abort in exclusive output state correctly
If the event loop is stopped while something holds the exclusive output, the
"log" singleton is now reset so the main thread can print a few more messages
before exiting.
2021-05-22 18:58:19 +02:00
Joscha
552cd82802 Run async input and password getters in daemon thread
Previously, it ran in the event loop's default executor, which would block until
all its workers were done working.

If Ctrl+C was pressed while input or a password were being read, the
asyncio.run() call in the main thread would be interrupted however, not the
input thread. This meant that multiple key presses (either enter or a second
Ctrl+C) were necessary to stop a running PFERD in some circumstances.

This change instead runs the input functions in daemon threads so they exit as
soon as the main thread exits.
2021-05-22 18:37:53 +02:00
Joscha
dfde0e2310 Improve reporting of unexpected exceptions 2021-05-22 18:36:25 +02:00
Joscha
54dd2f8337 Clean up main and improve error handling 2021-05-22 16:47:24 +02:00
Joscha
b5785f260e Extract CLI argument parsing to separate module 2021-05-22 15:03:45 +02:00
Joscha
98b8ca31fa Add some todos 2021-05-22 14:45:46 +02:00
I-Al-Istannen
4b104b6252 Try out some HTTP authentication handling
This is by no means final yet and will change a bit once the dl and cl
are changed, but it might serve as a first try. It is also wholly
untested.
2021-05-21 12:02:51 +02:00
I-Al-Istannen
83d12fcf2d Add some explains to ilias crawler and use crawler exceptions 2021-05-20 14:58:54 +02:00
I-Al-Istannen
e4f9560655 Only retry on aiohttp errors in ILIAS crawler
This patch removes quite a few retries and now only retries the ilias
element method. Every other HTTP-interacting method (except for the root
requests) is called from there and should be covered.

In the future we also want to retry the root a few times, but that
will be done after the download sink API is adjusted.
2021-05-19 22:01:09 +02:00
I-Al-Istannen
8cfa818f04 Only call should_crawl once 2021-05-19 21:57:55 +02:00
I-Al-Istannen
81301f3a76 Rename the ilias crawler to ilias web crawler 2021-05-19 21:41:17 +02:00
I-Al-Istannen
2976b4d352 Move ILIAS file templates to own file 2021-05-19 21:37:10 +02:00
I-Al-Istannen
9f03702e69 Split up ilias crawler in multiple files
The ilias crawler contained a crawler and an HTML parser, now they are
split in two.
2021-05-19 21:34:36 +02:00
Joscha
3300886120 Explain config file loading 2021-05-19 18:11:43 +02:00
Joscha
0d10752b5a Configure explain log level via cli and config file 2021-05-19 17:50:10 +02:00
Joscha
92886fb8d8 Implement --version flag 2021-05-19 17:33:36 +02:00
Joscha
5916626399 Make noqua comment more specific 2021-05-19 17:16:59 +02:00
Joscha
a7c025fd86 Implement reusable FileSinkToken for OutputDirectory 2021-05-19 17:16:23 +02:00
Joscha
b7a999bc2e Clean up crawler exceptions and (a)noncritical 2021-05-19 13:25:57 +02:00
Joscha
3851065500 Fix local crawler's download bars
Display the pure path instead of the local path.
2021-05-18 23:23:40 +02:00
Joscha
4b68fa771f Move logging logic to singleton
- Renamed module and class because "conductor" didn't make a lot of sense
- Used singleton approach (there's only one stdout after all)
- Redesigned progress bars (now with download speed!)
2021-05-18 22:45:19 +02:00
I-Al-Istannen
1525aa15a6 Fix link template error and use indeterminate progress bar 2021-05-18 22:40:28 +02:00
I-Al-Istannen
db1219d4a9 Create a link file in ILIAS crawler
This allows us to crawl links and represent them in the file system.
Users can choose between an ILIAS-imitation (that optionally
auto-redirects) and a plain text variant.
2021-05-17 21:44:54 +02:00
I-Al-Istannen
b8efcc2ca5 Respect filters in ILIAS crawler 2021-05-17 21:30:26 +02:00
Joscha
0bae009189 Run formatting tools 2021-05-16 14:32:53 +02:00
I-Al-Istannen
8b76ebb3ef Rename IliasCrawler to KitIliasCrawler 2021-05-16 13:28:06 +02:00
I-Al-Istannen
2b6235dc78 Fix pylint warnings (and 2 found bugs) in ILIAS crawler 2021-05-16 13:17:12 +02:00
I-Al-Istannen
1c226c31aa Add some repeat annotations to the ILIAS crawler 2021-05-16 13:01:56 +02:00
I-Al-Istannen
9ec0d3e16a Implement date-demangling in ILIAS crawler 2021-05-16 13:01:56 +02:00
I-Al-Istannen
cf6903d109 Retry crawling on I/O failure 2021-05-16 13:01:56 +02:00
Joscha
9fd356d290 Ensure tmp files are deleted
This doesn't seem to fix the case where an exception bubbles up to the top of
the event loop. It also doesn't seem to fix the case when a KeyboardInterrupt is
thrown, since that never makes its way into the event loop in the first place.

Both of these cases lead to the event loop stopping, which means that the tmp
file cleanup doesn't get executed even though it's inside a "with" or "finally".
2021-05-15 23:00:40 +02:00
Joscha
989032fe0c Fix cookies getting deleted 2021-05-15 22:25:48 +02:00
Joscha
05573ccc53 Add fancy CLI options 2021-05-15 22:22:01 +02:00
I-Al-Istannen
c454fabc9d Add support for exercises in ILIAS crawler 2021-05-15 21:40:17 +02:00
I-Al-Istannen
7d323ec62b Implement video downloads in ilias crawler 2021-05-15 21:32:32 +02:00
I-Al-Istannen
c7494e32ce Start implementing crawling in ILIAS crawler
The ilias crawler can now crawl quite a few filetypes, splits off
folders and crawls them concurrently.
2021-05-15 20:42:18 +02:00
I-Al-Istannen
1123c8884d Implement an IliasPage
This allows PFERD to semantically understand ILIAS HTML and is the
foundation for the ILIAS crawler. This patch extends the ILIAS crawler
to crawl the personal desktop and print the elements on it.
2021-05-15 18:59:23 +02:00
Joscha
e1104f888d Add tfa authenticator 2021-05-15 18:27:16 +02:00
Joscha
8c32da7f19 Let authenticators provide username and password separately 2021-05-15 18:27:03 +02:00
Joscha
d63494908d Properly invalidate exceptions
The simple authenticator now properly invalidates its credentials. Also, the
invalidation functions have been given better names and documentation.
2021-05-15 17:37:05 +02:00
Joscha
b70b62cef5 Make crawler sections start with "crawl:"
Also, use only the part of the section name after the "crawl:" as the crawler's
output directory. Now, the implementation matches the documentation again
2021-05-15 17:24:37 +02:00
Joscha
868f486922 Rename local crawler path to target 2021-05-15 17:12:25 +02:00
I-Al-Istannen
b2a2b5999b Implement ILIAS auth and crawl home page
This commit introduces the necessary machinery to authenticate with
ILIAS and crawl the home page.

It can't do much yet and just silently fetches the homepage.
2021-05-15 15:25:05 +02:00
Joscha
595de88d96 Fix authenticator and crawler names
Now, the "auth:" and "crawl:" parts are considered part of the name. This fixes
crawlers not being able to find their authenticators.
2021-05-15 15:25:05 +02:00
Joscha
a6fdf05ee9 Allow variable whitespace in arrow rules 2021-05-15 15:25:05 +02:00
Joscha
f897d7c2e1 Add name variants for all arrows 2021-05-15 15:25:05 +02:00
Joscha
b0f731bf84 Make crawlers use transformers 2021-05-15 15:25:05 +02:00
Joscha
302b8c0c34 Fix errors loading local crawler config
Apparently getint and getfloat may return a None even though this is not
mentioned in their type annotations.
2021-05-15 15:25:05 +02:00
Joscha
acd674f0a0 Change limiter logic
Now download tasks are a subset of all tasks.
2021-05-15 15:25:05 +02:00