Commit Graph

509 Commits

Author SHA1 Message Date
Joscha
fc31100a0f Always use '/' as path separator for regex rules
Previously, regex-matching paths on windows would, in some cases, require four
backslashes ('\\\\') to escape a single path separator. That's just too much.

With this commit, regex transforms now use '/' instead of '\' as path separator,
meaning rules can more easily be shared between platforms (although they are not
guaranteed to be 100% compatible since on Windows, '\' is still recognized as a
path separator).

To make rules more intuitive to write, local relative paths are now also printed
with '/' as path separator on Windows. Since Windows also accepts '/' as path
separator, this change doesn't really affect other rules that parse their sides
as paths.
2021-06-04 18:12:45 +02:00
Joscha
31b6311e99 Remove incorrect tmp file explain message 2021-06-01 19:03:06 +02:00
Joscha
85b9f45085 Bump version to 3.0.1 2021-06-01 09:49:30 +00:00
Joscha
f656e3ff34 Fix credential parsing 2021-06-01 09:18:17 +00:00
Joscha
e1bda94329 Load credential file from correct path 2021-06-01 09:18:08 +00:00
Joscha
f6b26f4ead Fix unexpected exception when credential file not found 2021-06-01 09:10:58 +00:00
Joscha
722970a255 Store cookies in text-based format
Using the stdlib's http.cookie module, cookies are now stored as one
"Set-Cookie" header per line. Previously, the aiohttp.CookieJar's save() and
load() methods were used (which use pickling).
2021-05-31 20:18:20 +00:00
Joscha
f40820c41f Warn if using concurrent tasks with kit-ilias-web 2021-05-31 20:18:20 +00:00
Joscha
49ad1b6e46 Clean up authenticator code formatting 2021-05-31 18:45:06 +02:00
Joscha
1ce32d2f18 Add CLI option for credential file auth to kit-ilias-web 2021-05-31 18:45:06 +02:00
Joscha
9d5ec84b91 Add credential file authenticator 2021-05-31 18:33:34 +02:00
I-Al-Istannen
1fba96abcb Fix exercise date parsing for non-group submissions
ILIAS apparently changes the order of the fields as it sees fit, so we
now try to parse *every* column, starting at from the right, as a date.
The first column that parses successfully is then used.
2021-05-31 18:15:12 +02:00
Joscha
7b062883f6 Use raw paths for --debug-transforms
Previously, the already-transformed paths were used, which meant that
--debug-transforms was cumbersome to use (as you had to remove all transforms
and crawl once before getting useful results).
2021-05-31 12:33:37 +02:00
Joscha
64a2960751 Align paths in status messages and progress bars
Also print "Ignored" when paths are ignored due to transforms
2021-05-31 12:32:42 +02:00
Joscha
17879a7f69 Print box around message for unexpected exceptions 2021-05-31 12:05:49 +02:00
Joscha
1dd24551a5 Add link to repo in --version output 2021-05-31 11:44:17 +02:00
Joscha
84f775013f Use event loop workaround only on windows
This avoids an unnecessary one-second sleep on other platforms. However, a
better "fix" for this sleep would be a less ugly workaround on windows.
2021-05-31 11:41:52 +02:00
I-Al-Istannen
1ca6740e05 Improve log messages when parsing ILIAS HTML
Previously some logs were split around an "await", which isn't a great
idea.
2021-05-27 17:59:22 +02:00
Joscha
474aa7e1cc Use sorted path order when debugging transforms 2021-05-27 15:41:00 +00:00
I-Al-Istannen
5beb4d9a2d Fix renaming conflict with multi-stage video elements 2021-05-27 15:41:00 +02:00
I-Al-Istannen
19eed5bdff Fix authentication logic conflicts with videos 2021-05-27 15:41:00 +02:00
Joscha
6fa9cfd4c3 Fix error when capturing group is None 2021-05-27 15:41:00 +02:00
Joscha
80acc4b50d Implement new name arrows 2021-05-27 13:43:02 +02:00
Joscha
533f75ea71 Add --debug-transforms flag 2021-05-26 11:37:32 +02:00
Joscha
adb5d4ade3 Print files that are *not* deleted by cleanup
These are files that are not present on the remote source any more, but still
present locally. They also show up in the report.
2021-05-26 10:58:19 +02:00
Joscha
a879c6ab6e Fix function being printed 2021-05-26 10:54:01 +02:00
Joscha
915e42fd07 Fix report not being printed if pferd exits normally 2021-05-26 10:53:54 +02:00
I-Al-Istannen
2d8dcc87ff Send CSRF token in TFA request 2021-05-25 22:50:40 +02:00
I-Al-Istannen
66f0e398a1 Await result in tfa authenticate path 2021-05-25 19:19:51 +02:00
Joscha
30be4e29fa Add workaround for RuntimeError after program finishes on Windows 2021-05-25 16:34:22 +00:00
I-Al-Istannen
263780e6a3 Use certifi to ensure CA certificates are bundled in pyinstaller 2021-05-25 18:24:06 +02:00
Joscha
07a75a37c3 Fix FileNotFoundError on Windows 2021-05-25 15:57:03 +00:00
Joscha
f85b75df8c Switch from exit() to sys.exit()
Pyinstaller doesn't recognize exit().
2021-05-25 17:33:38 +02:00
Joscha
519a7ef435 Split --dump-config into two options
--dump-config with its optional argument tended to consume the command name, so
it had to be split up.
2021-05-25 17:17:35 +02:00
I-Al-Istannen
a848194601 Rename plaintext link option to "plaintext" 2021-05-25 17:15:13 +02:00
Joscha
aabce764ac Clean up TODOs 2021-05-25 15:54:01 +02:00
Joscha
5a331663e4 Rename functions for consistency 2021-05-25 15:49:06 +02:00
Joscha
40144f8bd8 Fix rule error messages 2021-05-25 15:47:09 +02:00
Joscha
f68849c65f Fix rules not being parsed entirely 2021-05-25 15:42:46 +02:00
Joscha
edb52a989e Print report even if exiting due to Ctrl+C 2021-05-25 15:35:36 +02:00
Joscha
980578d05a Avoid downloading in some cases
Depending on how on_conflict is set, we can determine a few situations where
downloading is never necessary.
2021-05-25 15:20:30 +02:00
I-Al-Istannen
486699cef3 Create anonymous TFA authenticator in ilias crawler
This ensures that *some* TFA authenticator is always present when
authenticating, even if none is specified in the config.

The TfaAuthenticator does not depend on any configured values, so it can
be created on-demand.
2021-05-25 15:11:52 +02:00
I-Al-Istannen
0096a0c077 Remove section and config parameter from Authenticator 2021-05-25 15:11:33 +02:00
I-Al-Istannen
d905e95dbb Allow invalidation of keyring authenticator 2021-05-25 15:02:35 +02:00
Joscha
61430c8739 Overhaul config and CLI option names 2021-05-25 14:23:38 +02:00
Joscha
eb8b915813 Fix path prefix on windows
Previously, the path prefix was only set if "windows_paths" was true, regardless
of OS. Now the path prefix is always set on windows and never set on other OSes.
2021-05-25 14:23:38 +02:00
Joscha
22c2259adb Clean up authenticator exceptions
- Renamed to *Error for consistency
- Treating AuthError like CrawlError
2021-05-25 14:23:38 +02:00
Joscha
c15a1aecdf Rename keyring authenticator file for consistency 2021-05-25 14:20:26 +02:00
I-Al-Istannen
651b087932 Use cl/dl deduplication mechanism for ILIAS crawler 2021-05-25 12:15:38 +02:00
Joscha
bce3dc384d Deduplicate path names in crawler
Also rename files so they follow the restrictions for windows file names if
we're on windows.
2021-05-25 12:11:15 +02:00
I-Al-Istannen
c21ddf225b Add a CLI option to configure ILIAS links behaviour 2021-05-25 11:58:41 +02:00
I-Al-Istannen
4fefb98d71 Add a wrapper to pretty-print ValueErrors in argparse parsers 2021-05-25 11:57:59 +02:00
I-Al-Istannen
ffda4e43df Add extension to link files 2021-05-25 11:41:57 +02:00
I-Al-Istannen
69cb2a7734 Add Links option to ilias crawler
This allows you to configure what type the link files should have and
whether to create them at all.
2021-05-25 11:41:57 +02:00
I-Al-Istannen
85f89a7ff3 Interpret accordions and expandable headers as virtual folders
This allows us to find a file named "Test" in an accordion "Acc" as "Acc/Test".
2021-05-24 18:54:26 +02:00
I-Al-Istannen
9ce20216b5 Do not set a timeout for whole HTTP request
Downloads might take longer!
2021-05-24 18:54:26 +02:00
Joscha
86ba47541b Fix cookie loading and saving 2021-05-24 16:55:11 +02:00
I-Al-Istannen
492ec6a932 Detect and skip ILIAS tests 2021-05-24 16:36:15 +02:00
I-Al-Istannen
342076ee0e Handle exercise detail containers in ILIAS html parser 2021-05-24 16:22:51 +02:00
I-Al-Istannen
d44f6966c2 Log authentication attempts in HTTP crawler 2021-05-24 16:22:11 +02:00
Joscha
1c1f781be4 Reword some log messages 2021-05-24 13:17:28 +02:00
Joscha
c687d4a51a Implement cookie sharing 2021-05-24 13:10:44 +02:00
I-Al-Istannen
fca62541ca De-duplicate element names in ILIAS crawler
This prevents any conflicts caused by multiple files with the same name.
Conflicts may still arise due to transforms, but that is out of our
control and a user error.
2021-05-24 00:24:31 +02:00
I-Al-Istannen
3ab3581f84 Add timeout for HTTP connection 2021-05-23 23:41:05 +02:00
I-Al-Istannen
8dd0689420 Add keyring authentication to ILIAS CLI 2021-05-23 23:04:18 +02:00
Joscha
79be6e1dc5 Switch some other options to BooleanOptionalAction 2021-05-23 22:49:09 +02:00
Joscha
edbd92dbbf Add --status and --report flags 2021-05-23 22:41:59 +02:00
Joscha
27b5a8e490 Rename log.action to log.status 2021-05-23 22:40:33 +02:00
Joscha
1f400d5964 Implement BooleanOptionalAction 2021-05-23 22:26:59 +02:00
Joscha
0ca0680165 Simplify --version 2021-05-23 21:40:48 +02:00
Joscha
ce1dbda5b4 Overhaul colours
"Crawled" and "Downloaded" are now printed less bright than "Crawling" and
"Downloading" as they're not as important. Explain topics are printed in yellow
to stand out a bit more from the cyan action messages.
2021-05-23 21:33:04 +02:00
Joscha
9cce78669f Print report after all crawlers have finished 2021-05-23 21:17:13 +02:00
Joscha
6ca0ecdf05 Load and store reports 2021-05-23 20:46:29 +02:00
I-Al-Istannen
6e9f8fd391 Add a keyring authenticator 2021-05-23 19:44:12 +02:00
Joscha
2fdf24495b Restructure crawling and auth related modules 2021-05-23 19:16:42 +02:00
Joscha
bbf9f8f130 Add -C as alias for --crawler 2021-05-23 19:06:09 +02:00
I-Al-Istannen
37f8d84a9c Output total amount of http requests in HTTP Crawler 2021-05-23 19:00:01 +02:00
Joscha
5edd868d5b Fix always-smart redownloading the wrong files 2021-05-23 18:49:34 +02:00
Joscha
e4e5e83be6 Fix downloader using crawl bar
Looks like I made a dumb copy-paste error. Now the download bar shows the proper
progress and speed again.
2021-05-23 18:39:43 +02:00
Joscha
74c7b39dc8 Clean up files in alphabetical order 2021-05-23 18:39:25 +02:00
Joscha
445dffc987 Reword some explanations 2021-05-23 18:35:32 +02:00
I-Al-Istannen
d97d6bf147 Fix handling nested ILIAS folders 2021-05-23 18:29:28 +02:00
I-Al-Istannen
79efdb56f7 Adjust ILIAS html explain messages 2021-05-23 18:24:25 +02:00
Joscha
a9af56a5e9 Improve specifying crawlers via CLI
Instead of removing the sections of unselected crawlers from the config file,
crawler selection now happens in the Pferd after loading the crawlers and is
more sophisticated. It also has better error messages.
2021-05-23 18:18:50 +02:00
I-Al-Istannen
59f13bb8d6 Explain ILIAS HTML parsing and add some warnings 2021-05-23 18:14:54 +02:00
I-Al-Istannen
463f8830d7 Add warn_contd 2021-05-23 18:14:54 +02:00
I-Al-Istannen
05ad06fbc1 Only enclose get_page in iorepeat in ILIAS crawler
We previously also gathered in there, which could lead to some more
surprises when the method was retried.
2021-05-23 18:14:51 +02:00
Joscha
29d5a40c57 Replace asyncio.gather with custom Crawler function 2021-05-23 17:25:16 +02:00
Joscha
c0cecf8363 Log crawl and download actions more extensively 2021-05-23 16:25:44 +02:00
Joscha
b998339002 Fix cleanup logging of paths 2021-05-23 16:25:44 +02:00
Joscha
245c9c3dcc Explain output dir decisions and steps 2021-05-23 16:25:44 +02:00
I-Al-Istannen
d8f26a789e Implement CLI Command for ilias crawler 2021-05-23 13:30:42 +02:00
I-Al-Istannen
e1d18708b3 Rename "no_videos" to videos 2021-05-23 13:30:42 +02:00
Joscha
b44b49476d Fix noncritical and anoncritical decorators
I must've forgot to update the anoncritical decorator when I last changed the
noncritical decorator. Also, every exception should make the crawler not
error_free, not just CrawlErrors.
2021-05-23 13:24:53 +02:00
Joscha
7e0bb06259 Clean up TODOs 2021-05-23 12:47:30 +02:00
I-Al-Istannen
ecdedfa1cf Add no-videos flag to ILIAS crawler 2021-05-23 12:37:01 +02:00
I-Al-Istannen
3d4b997d4a Retry crawl_url and work around Python's closure handling
Closures capture the scope and not the variables. Therefore, any
type-narrowing performed by mypy on captured variables is lost inside
the closure.
2021-05-23 12:28:15 +02:00
Joscha
e81005ae4b Fix CLI arguments 2021-05-23 12:24:21 +02:00
I-Al-Istannen
33a81a5f5c Document authentication in HTTP crawler and rename prepare_request 2021-05-23 11:55:34 +02:00
Joscha
25e2abdb03 Improve transformer explain wording 2021-05-23 11:45:14 +02:00
Joscha
803e5628a2 Clean up logging
Paths are now (hopefully) logged consistently across all crawlers
2021-05-23 11:37:19 +02:00
Joscha
c88f20859a Explain config file dumping 2021-05-23 11:04:50 +02:00
Joscha
ec3767c545 Create crawler base dir at start of crawl 2021-05-23 10:52:02 +02:00
Joscha
729ff0a4c7 Fix simple authenticator output 2021-05-23 10:45:37 +02:00
Joscha
6fe51e258f Number rules starting at 1 2021-05-23 10:45:37 +02:00
Joscha
44ecb2fbe7 Fix cleanup deleting crawler's base directory 2021-05-23 10:45:37 +02:00
I-Al-Istannen
53e031d9f6 Reuse dl/cl for I/O retries in ILIAS crawler 2021-05-23 00:28:27 +02:00
I-Al-Istannen
8ac85ea0bd Fix a few typos in HttpCrawler 2021-05-22 23:37:34 +02:00
I-Al-Istannen
adfdc302d7 Save cookies after successful authentication in HTTP crawler 2021-05-22 23:30:32 +02:00
I-Al-Istannen
3053278721 Move HTTP crawler to own file 2021-05-22 23:23:21 +02:00
I-Al-Istannen
4d07de0d71 Adjust forum log message in ilias crawler 2021-05-22 23:20:21 +02:00
I-Al-Istannen
953a1bba93 Adjust to new crawl / download names 2021-05-22 23:18:05 +02:00
Joscha
e724ff7c93 Fix normal arrow 2021-05-22 20:44:59 +00:00
Joscha
62f0f7bfc5 Explain crawling and partially explain downloading 2021-05-22 20:39:57 +00:00
Joscha
9cb2b68f09 Fix arrow parsing error messages 2021-05-22 20:39:29 +00:00
Joscha
1bbc0b705f Improve transformer error handling 2021-05-22 20:38:56 +00:00
Joscha
662191eca9 Fix crash as soon as first cl or dl token was acquired 2021-05-22 20:25:58 +00:00
Joscha
ae3d80664c Update local crawler to new crawler structure 2021-05-22 21:46:36 +02:00
Joscha
e21795ee35 Make file cleanup part of default crawler behaviour 2021-05-22 21:45:51 +02:00
Joscha
ec95dda18f Unify crawling and downloading steps
Now, the progress bar, limiter etc. for downloading and crawling are all handled
via the reusable CrawlToken and DownloadToken context managers.
2021-05-22 21:36:53 +02:00
Joscha
098ac45758 Remove deprecated repeat decorators 2021-05-22 21:13:25 +02:00
Joscha
9889ce6b57 Improve PFERD error handling 2021-05-22 21:13:25 +02:00
Joscha
b4d97cd545 Improve output dir and report error handling 2021-05-22 20:54:42 +02:00
Joscha
afac22c562 Handle abort in exclusive output state correctly
If the event loop is stopped while something holds the exclusive output, the
"log" singleton is now reset so the main thread can print a few more messages
before exiting.
2021-05-22 18:58:19 +02:00
Joscha
552cd82802 Run async input and password getters in daemon thread
Previously, it ran in the event loop's default executor, which would block until
all its workers were done working.

If Ctrl+C was pressed while input or a password were being read, the
asyncio.run() call in the main thread would be interrupted however, not the
input thread. This meant that multiple key presses (either enter or a second
Ctrl+C) were necessary to stop a running PFERD in some circumstances.

This change instead runs the input functions in daemon threads so they exit as
soon as the main thread exits.
2021-05-22 18:37:53 +02:00
Joscha
dfde0e2310 Improve reporting of unexpected exceptions 2021-05-22 18:36:25 +02:00
Joscha
54dd2f8337 Clean up main and improve error handling 2021-05-22 16:47:24 +02:00
Joscha
b5785f260e Extract CLI argument parsing to separate module 2021-05-22 15:03:45 +02:00
Joscha
98b8ca31fa Add some todos 2021-05-22 14:45:46 +02:00
I-Al-Istannen
4b104b6252 Try out some HTTP authentication handling
This is by no means final yet and will change a bit once the dl and cl
are changed, but it might serve as a first try. It is also wholly
untested.
2021-05-21 12:02:51 +02:00
I-Al-Istannen
83d12fcf2d Add some explains to ilias crawler and use crawler exceptions 2021-05-20 14:58:54 +02:00
I-Al-Istannen
e4f9560655 Only retry on aiohttp errors in ILIAS crawler
This patch removes quite a few retries and now only retries the ilias
element method. Every other HTTP-interacting method (except for the root
requests) is called from there and should be covered.

In the future we also want to retry the root a few times, but that
will be done after the download sink API is adjusted.
2021-05-19 22:01:09 +02:00
I-Al-Istannen
8cfa818f04 Only call should_crawl once 2021-05-19 21:57:55 +02:00
I-Al-Istannen
81301f3a76 Rename the ilias crawler to ilias web crawler 2021-05-19 21:41:17 +02:00
I-Al-Istannen
2976b4d352 Move ILIAS file templates to own file 2021-05-19 21:37:10 +02:00
I-Al-Istannen
9f03702e69 Split up ilias crawler in multiple files
The ilias crawler contained a crawler and an HTML parser, now they are
split in two.
2021-05-19 21:34:36 +02:00
Joscha
3300886120 Explain config file loading 2021-05-19 18:11:43 +02:00
Joscha
0d10752b5a Configure explain log level via cli and config file 2021-05-19 17:50:10 +02:00
Joscha
92886fb8d8 Implement --version flag 2021-05-19 17:33:36 +02:00
Joscha
5916626399 Make noqua comment more specific 2021-05-19 17:16:59 +02:00
Joscha
a7c025fd86 Implement reusable FileSinkToken for OutputDirectory 2021-05-19 17:16:23 +02:00
Joscha
b7a999bc2e Clean up crawler exceptions and (a)noncritical 2021-05-19 13:25:57 +02:00
Joscha
3851065500 Fix local crawler's download bars
Display the pure path instead of the local path.
2021-05-18 23:23:40 +02:00
Joscha
4b68fa771f Move logging logic to singleton
- Renamed module and class because "conductor" didn't make a lot of sense
- Used singleton approach (there's only one stdout after all)
- Redesigned progress bars (now with download speed!)
2021-05-18 22:45:19 +02:00
I-Al-Istannen
1525aa15a6 Fix link template error and use indeterminate progress bar 2021-05-18 22:40:28 +02:00
I-Al-Istannen
db1219d4a9 Create a link file in ILIAS crawler
This allows us to crawl links and represent them in the file system.
Users can choose between an ILIAS-imitation (that optionally
auto-redirects) and a plain text variant.
2021-05-17 21:44:54 +02:00
I-Al-Istannen
b8efcc2ca5 Respect filters in ILIAS crawler 2021-05-17 21:30:26 +02:00
Joscha
0bae009189 Run formatting tools 2021-05-16 14:32:53 +02:00
I-Al-Istannen
8b76ebb3ef Rename IliasCrawler to KitIliasCrawler 2021-05-16 13:28:06 +02:00
I-Al-Istannen
2b6235dc78 Fix pylint warnings (and 2 found bugs) in ILIAS crawler 2021-05-16 13:17:12 +02:00