Commit Graph

495 Commits

Author SHA1 Message Date
I-Al-Istannen
10d9d74528 Bail out when crawling recursive courses 2022-01-08 20:28:30 +01:00
I-Al-Istannen
43c5453e10 Correctly crawl files on desktop
The files on the desktop do not include a download link, so we need to
rewrite it.
2022-01-08 20:00:53 +01:00
I-Al-Istannen
eb4de8ae0c Ignore 1970 dates as windows crashes when calling .timestamp() 2022-01-08 18:14:43 +01:00
I-Al-Istannen
e32c1f000f Fix mtime for single streams 2022-01-08 18:05:48 +01:00
I-Al-Istannen
5f527bc697 Remove Python 3.9 Pattern typehints 2022-01-08 17:14:40 +01:00
I-Al-Istannen
ced8b9a2d0 Fix some accordions 2022-01-08 16:58:30 +01:00
I-Al-Istannen
6f3cfd4396 Fix personal desktop crawling 2022-01-08 16:58:15 +01:00
I-Al-Istannen
462d993fbc Fix local video path cache (hopefully) 2022-01-08 00:27:48 +01:00
I-Al-Istannen
a99356f2a2 Fix video stream extraction 2022-01-08 00:27:34 +01:00
I-Al-Istannen
eac2e34161 Fix is_logged_in for ILIAS 7 2022-01-07 23:32:31 +01:00
I-Al-Istannen
a82a0b19c2 Collect crawler warnings/errors and include them in the report 2021-11-07 21:48:55 +01:00
I-Al-Istannen
90cb6e989b Do not download single videos if cache does not exist 2021-11-06 23:21:15 +01:00
I-Al-Istannen
6289938d7c Do not stop crawling files when encountering a CrawlWarning 2021-11-06 12:09:51 +01:00
I-Al-Istannen
13b8c3d9c6 Add regex option to config and CLI parser 2021-11-02 09:30:46 +01:00
I-Al-Istannen
88afe64a92 Refactor IPD crawler a bit 2021-11-02 01:25:01 +00:00
Julius Rüberg
6b2a657573 Fix IPD crawler for different subpages (#42)
This patch reworks the IPD crawler to support subpages which do not use
"/intern" for links and fetches the folder names from table headings.
2021-11-02 01:25:01 +00:00
I-Al-Istannen
e42ab83d32 Add support for ILIAS cards 2021-10-30 18:13:44 +02:00
I-Al-Istannen
f9a3f9b9f2 Handle multi-stream videos 2021-10-30 18:12:29 +02:00
I-Al-Istannen
ef7d5ea2d3 Allow storing crawler-specific data in reports 2021-10-30 18:09:05 +02:00
lukasprobst
55ea304ff3 Disable interpolation of ConfigParser 2021-10-25 23:37:42 +02:00
I-Al-Istannen
6673077397 Add kit-ipd crawler 2021-10-21 13:20:21 +02:00
Joscha
742632ed8d Bump version to 3.2.0 2021-08-04 18:27:26 +00:00
Joscha
544d45cbc5 Catch non-critical exceptions at crawler top level 2021-07-13 15:42:11 +02:00
I-Al-Istannen
ee67f9f472 Sort elements by ILIAS id to ensure deterministic ordering 2021-07-06 17:45:48 +02:00
I-Al-Istannen
8ec3f41251 Crawl ilias booking objects as links 2021-07-06 16:15:25 +02:00
I-Al-Istannen
89be07d4d3 Use final crawl path in HTML parsing message 2021-07-03 17:05:48 +02:00
I-Al-Istannen
91200f3684 Fix nondeterministic name deduplication 2021-07-03 12:09:55 +02:00
Joscha
9ffd603357 Error when using multiple segments with -name->
Previously, PFERD just silently never matched the -name-> arrow. Now, it errors
when loading the config file.
2021-07-01 11:14:50 +02:00
Joscha
80eeb8fe97 Add --skip option 2021-07-01 11:02:21 +02:00
Joscha
75fde870c2 Bump version to 3.1.0 2021-06-13 17:23:18 +02:00
I-Al-Istannen
6e4d423c81 Crawl all video stages in one crawl bar
This ensures folders are not renamed, as they are crawled twice
2021-06-13 17:18:45 +02:00
Joscha
57aef26217 Fix name arrows
I seem to have (re-)implemented them incorrectly and never tested them.
2021-06-13 16:33:29 +02:00
I-Al-Istannen
70ec64a48b Fix wrong base URL for multi-stage pages 2021-06-13 15:44:47 +02:00
Joscha
61d902d715 Overhaul transform logic
-re-> arrows now rename their parent directories (like -->) and don't require a
full match (like -exact->). Their old behaviour is available as -exact-re->.

Also, this change adds the ">>" arrow head, which modifies the current path and
continues to the next rule when it matches.
2021-06-09 22:45:52 +02:00
I-Al-Istannen
8ab462fb87 Use the exercise label instead of the button name as path 2021-06-04 19:24:23 +02:00
Joscha
df3ad3d890 Add 'skip' option to crawlers 2021-06-04 18:47:13 +02:00
Joscha
fc31100a0f Always use '/' as path separator for regex rules
Previously, regex-matching paths on windows would, in some cases, require four
backslashes ('\\\\') to escape a single path separator. That's just too much.

With this commit, regex transforms now use '/' instead of '\' as path separator,
meaning rules can more easily be shared between platforms (although they are not
guaranteed to be 100% compatible since on Windows, '\' is still recognized as a
path separator).

To make rules more intuitive to write, local relative paths are now also printed
with '/' as path separator on Windows. Since Windows also accepts '/' as path
separator, this change doesn't really affect other rules that parse their sides
as paths.
2021-06-04 18:12:45 +02:00
Joscha
31b6311e99 Remove incorrect tmp file explain message 2021-06-01 19:03:06 +02:00
Joscha
85b9f45085 Bump version to 3.0.1 2021-06-01 09:49:30 +00:00
Joscha
f656e3ff34 Fix credential parsing 2021-06-01 09:18:17 +00:00
Joscha
e1bda94329 Load credential file from correct path 2021-06-01 09:18:08 +00:00
Joscha
f6b26f4ead Fix unexpected exception when credential file not found 2021-06-01 09:10:58 +00:00
Joscha
722970a255 Store cookies in text-based format
Using the stdlib's http.cookie module, cookies are now stored as one
"Set-Cookie" header per line. Previously, the aiohttp.CookieJar's save() and
load() methods were used (which use pickling).
2021-05-31 20:18:20 +00:00
Joscha
f40820c41f Warn if using concurrent tasks with kit-ilias-web 2021-05-31 20:18:20 +00:00
Joscha
49ad1b6e46 Clean up authenticator code formatting 2021-05-31 18:45:06 +02:00
Joscha
1ce32d2f18 Add CLI option for credential file auth to kit-ilias-web 2021-05-31 18:45:06 +02:00
Joscha
9d5ec84b91 Add credential file authenticator 2021-05-31 18:33:34 +02:00
I-Al-Istannen
1fba96abcb Fix exercise date parsing for non-group submissions
ILIAS apparently changes the order of the fields as it sees fit, so we
now try to parse *every* column, starting at from the right, as a date.
The first column that parses successfully is then used.
2021-05-31 18:15:12 +02:00
Joscha
7b062883f6 Use raw paths for --debug-transforms
Previously, the already-transformed paths were used, which meant that
--debug-transforms was cumbersome to use (as you had to remove all transforms
and crawl once before getting useful results).
2021-05-31 12:33:37 +02:00
Joscha
64a2960751 Align paths in status messages and progress bars
Also print "Ignored" when paths are ignored due to transforms
2021-05-31 12:32:42 +02:00
Joscha
17879a7f69 Print box around message for unexpected exceptions 2021-05-31 12:05:49 +02:00
Joscha
1dd24551a5 Add link to repo in --version output 2021-05-31 11:44:17 +02:00
Joscha
84f775013f Use event loop workaround only on windows
This avoids an unnecessary one-second sleep on other platforms. However, a
better "fix" for this sleep would be a less ugly workaround on windows.
2021-05-31 11:41:52 +02:00
I-Al-Istannen
1ca6740e05 Improve log messages when parsing ILIAS HTML
Previously some logs were split around an "await", which isn't a great
idea.
2021-05-27 17:59:22 +02:00
Joscha
474aa7e1cc Use sorted path order when debugging transforms 2021-05-27 15:41:00 +00:00
I-Al-Istannen
5beb4d9a2d Fix renaming conflict with multi-stage video elements 2021-05-27 15:41:00 +02:00
I-Al-Istannen
19eed5bdff Fix authentication logic conflicts with videos 2021-05-27 15:41:00 +02:00
Joscha
6fa9cfd4c3 Fix error when capturing group is None 2021-05-27 15:41:00 +02:00
Joscha
80acc4b50d Implement new name arrows 2021-05-27 13:43:02 +02:00
Joscha
533f75ea71 Add --debug-transforms flag 2021-05-26 11:37:32 +02:00
Joscha
adb5d4ade3 Print files that are *not* deleted by cleanup
These are files that are not present on the remote source any more, but still
present locally. They also show up in the report.
2021-05-26 10:58:19 +02:00
Joscha
a879c6ab6e Fix function being printed 2021-05-26 10:54:01 +02:00
Joscha
915e42fd07 Fix report not being printed if pferd exits normally 2021-05-26 10:53:54 +02:00
I-Al-Istannen
2d8dcc87ff Send CSRF token in TFA request 2021-05-25 22:50:40 +02:00
I-Al-Istannen
66f0e398a1 Await result in tfa authenticate path 2021-05-25 19:19:51 +02:00
Joscha
30be4e29fa Add workaround for RuntimeError after program finishes on Windows 2021-05-25 16:34:22 +00:00
I-Al-Istannen
263780e6a3 Use certifi to ensure CA certificates are bundled in pyinstaller 2021-05-25 18:24:06 +02:00
Joscha
07a75a37c3 Fix FileNotFoundError on Windows 2021-05-25 15:57:03 +00:00
Joscha
f85b75df8c Switch from exit() to sys.exit()
Pyinstaller doesn't recognize exit().
2021-05-25 17:33:38 +02:00
Joscha
519a7ef435 Split --dump-config into two options
--dump-config with its optional argument tended to consume the command name, so
it had to be split up.
2021-05-25 17:17:35 +02:00
I-Al-Istannen
a848194601 Rename plaintext link option to "plaintext" 2021-05-25 17:15:13 +02:00
Joscha
aabce764ac Clean up TODOs 2021-05-25 15:54:01 +02:00
Joscha
5a331663e4 Rename functions for consistency 2021-05-25 15:49:06 +02:00
Joscha
40144f8bd8 Fix rule error messages 2021-05-25 15:47:09 +02:00
Joscha
f68849c65f Fix rules not being parsed entirely 2021-05-25 15:42:46 +02:00
Joscha
edb52a989e Print report even if exiting due to Ctrl+C 2021-05-25 15:35:36 +02:00
Joscha
980578d05a Avoid downloading in some cases
Depending on how on_conflict is set, we can determine a few situations where
downloading is never necessary.
2021-05-25 15:20:30 +02:00
I-Al-Istannen
486699cef3 Create anonymous TFA authenticator in ilias crawler
This ensures that *some* TFA authenticator is always present when
authenticating, even if none is specified in the config.

The TfaAuthenticator does not depend on any configured values, so it can
be created on-demand.
2021-05-25 15:11:52 +02:00
I-Al-Istannen
0096a0c077 Remove section and config parameter from Authenticator 2021-05-25 15:11:33 +02:00
I-Al-Istannen
d905e95dbb Allow invalidation of keyring authenticator 2021-05-25 15:02:35 +02:00
Joscha
61430c8739 Overhaul config and CLI option names 2021-05-25 14:23:38 +02:00
Joscha
eb8b915813 Fix path prefix on windows
Previously, the path prefix was only set if "windows_paths" was true, regardless
of OS. Now the path prefix is always set on windows and never set on other OSes.
2021-05-25 14:23:38 +02:00
Joscha
22c2259adb Clean up authenticator exceptions
- Renamed to *Error for consistency
- Treating AuthError like CrawlError
2021-05-25 14:23:38 +02:00
Joscha
c15a1aecdf Rename keyring authenticator file for consistency 2021-05-25 14:20:26 +02:00
I-Al-Istannen
651b087932 Use cl/dl deduplication mechanism for ILIAS crawler 2021-05-25 12:15:38 +02:00
Joscha
bce3dc384d Deduplicate path names in crawler
Also rename files so they follow the restrictions for windows file names if
we're on windows.
2021-05-25 12:11:15 +02:00
I-Al-Istannen
c21ddf225b Add a CLI option to configure ILIAS links behaviour 2021-05-25 11:58:41 +02:00
I-Al-Istannen
4fefb98d71 Add a wrapper to pretty-print ValueErrors in argparse parsers 2021-05-25 11:57:59 +02:00
I-Al-Istannen
ffda4e43df Add extension to link files 2021-05-25 11:41:57 +02:00
I-Al-Istannen
69cb2a7734 Add Links option to ilias crawler
This allows you to configure what type the link files should have and
whether to create them at all.
2021-05-25 11:41:57 +02:00
I-Al-Istannen
85f89a7ff3 Interpret accordions and expandable headers as virtual folders
This allows us to find a file named "Test" in an accordion "Acc" as "Acc/Test".
2021-05-24 18:54:26 +02:00
I-Al-Istannen
9ce20216b5 Do not set a timeout for whole HTTP request
Downloads might take longer!
2021-05-24 18:54:26 +02:00
Joscha
86ba47541b Fix cookie loading and saving 2021-05-24 16:55:11 +02:00
I-Al-Istannen
492ec6a932 Detect and skip ILIAS tests 2021-05-24 16:36:15 +02:00
I-Al-Istannen
342076ee0e Handle exercise detail containers in ILIAS html parser 2021-05-24 16:22:51 +02:00
I-Al-Istannen
d44f6966c2 Log authentication attempts in HTTP crawler 2021-05-24 16:22:11 +02:00
Joscha
1c1f781be4 Reword some log messages 2021-05-24 13:17:28 +02:00
Joscha
c687d4a51a Implement cookie sharing 2021-05-24 13:10:44 +02:00
I-Al-Istannen
fca62541ca De-duplicate element names in ILIAS crawler
This prevents any conflicts caused by multiple files with the same name.
Conflicts may still arise due to transforms, but that is out of our
control and a user error.
2021-05-24 00:24:31 +02:00
I-Al-Istannen
3ab3581f84 Add timeout for HTTP connection 2021-05-23 23:41:05 +02:00