ERDDAP web logs analysis

Develop a tool that reads in the web logs of an ERDDAP server to analyse how the server is being used. This would include:

Filtering out bots/crawlers/spam
Analysing and visualising which datasets receive the most requests
Examining geographical/temporal distribution of users
Investigating what user agents are making requests (browsers, erddapy, other ERDDAP servers etc.)

Initially this will be a static tool that is run on the tomcat/nginx logs of an ERDDAP server. The next step will be to run it alongside the ERDDAP server perhaps with a tool like prometheus

Expected Outcomes:

A python based tool that ERDDAP admins can use to quickly and easily establish how data from their server are being used.

This tool will be rapidly iterated on by the community (current ERDDAP operators). Once the tool is stable, it may be integrated into core ERDDAP, or changes made to core ERDDAP to enable the tool to be used more easily

Skills required:

Python
Some knowledge of servers/web logs useful
Access to logs from an active ERDDAP server useful

Difficulty:

Novice

Relevant links:

Work in progress here https://github.com/callumrollo/erddaplogs

Other potentially relevant tools that have been referenced:

Functioning Prototype

Workflow

Read in apache and nginx logs, combine them into one consistent dataframe
Find the ips that made the greatest number of requests. Get their info from ip-api.com
Remove suspected spam/bot requests
Perform basic anaylysis to graph number of requests and users over time, most popular datasets/datatypes and geographic distribution of users etc
Output anonymized data for sharing via ERDDAP