Java EE Timer-based JAX-RS client to cloud file store application to facilitate analysis of text content obtained from APIs (proof of concept).
Source code: https://github.com/UCL/wsclient-cloudstore-poc
Background
Context
Currently, the data collection relies on physical and virtual infrastructure provided by CS UCL. These are Linux machines where components for monitoring and logging are custom-made or missing. Also, some of the business logic relies on CRON tasks. Given the variety of cloud-based resources and platforms available, this proof-of-concept implements a very simple process: retrieve periodically JSON data from a public REST API to save it in some cloud storage facility with the business logic carried out completely by the application.
Objectives
-
Develop a deployable artifact as a proof-of-concept of data collection on the cloud
-
The application should run on PaaS, IaaS such a cloud server or locally on one of the servers in CS UCL
-
The application must be configurable to allow users to use their own credentials
Resources and platforms
Data sources - Twitter API
This proof-of-concept application is designed to retrieve JSON data from a REST endpoint such as Twitter and Instagram APIs.
| This implementation does not support the use of "Streaming" APIs. |
Twitter’s standard search API is used by the application. While not meant to be an exhaustive source of tweets, it still offers a combination of search terms and operators that can be combined with geolocation parameters and language, as well as date limits within the last 7-day range. This would allow to run a query to extract the tweets created in one specific date.
- Resource URL
For more information on Twitter Standard search API:
PaaS - Heroku
While the wsclient-app module can be deployed in an existing installation of a Java EE 8 server like Glassfieh 5. This proof-of-concept
uses the public PaaS Heroku
Heroku uses dynos, lightweight Linuz application containers pre-configured for specific process types. The free account allows 1 worker/1 web process types.
This proof-of-concept uses a worker dyno.
For more information on Heroku https://devcenter.heroku.com/articles/how-heroku-works
Deployment with Git
Heroku allows the deployment of applications from sources in an existing Github repository. With its CLI, the process would be
heroku login git clone git@github.com:UCL/wsclient-cloudstore-poc.git cd wsclient-cloudstore-poc heroku create wsclient-cloudstore-poc-ly4y5z
The last command will create an instance on Heroku with its own Git repository, which is added as a remote to the user:
$ git remote -v heroku https://git.heroku.com/wsclient-cloudstore-poc-ly4y5z.git (fetch) heroku https://git.heroku.com/wsclient-cloudstore-poc-ly4y5z.git (push) origin git@github.com:UCL/wsclient-cloudstore-poc.git (fetch) origin git@github.com:UCL/wsclient-cloudstore-poc.git (push)
Deployment of the application is done with:
heroku push heroku master
This instructs Heroku to build the application using Maven, unless Maven Wrapper is detected, in which case mvnw will be used instead.
It is possible to run the application with Heroku locally, for testing before pushing to the heroku remote:
heroku local
Deployment with Heroku Maven Plugin
It is also possible to build and package the application locally and deploy the artifacts on Heroku using the Maven plugin. This method is also recommended when deploying from a CI server like Travis CI.
The application module wsclient-main is configured to use this plugin in its pom.xml:
<plugin>
<groupId>com.heroku.sdk</groupId>
<artifactId>heroku-maven-plugin</artifactId>
<version>${plu.heroku.version}</version>
<configuration>
<appName>wsclient-cloudstore-poc-ly4y5z</appName>
<processTypes>
<worker>sh target/app/bin/glassfish5</worker>
</processTypes>
</configuration>
</plugin>
Then the application can be deployed with:
./mvnw clean heroku:deploy -pl wsclient-main
UCL Office365 OneDrive
Cloud-based Office 365 suite used by UCL includes OneDrive which can be accessed through a REST API. This API is part of the Microsoft Graph, a common API for Microsoft resources.
The endpoint used in this proof-of-concept application is:
PUT /me/drive/items/{parent-id}:/{filename}:/content
where {parent-id} could point to the personal drive root or a shared drive.
A successful upload (PUT request) returns a 201 HTTP code (created).
For more information on OneDrive API: https://docs.microsoft.com/en-gb/onedrive/developer/
Constraints
When using Heroku with a free account, the application must run in the following environment:
-
RAM 512MB
-
Compressed size of application sources 500MB
-
Compilation limited to 15 min
-
1 worker/1 web max
Core components
wsclient-app
The wsclient-app application is an EJB module based in Java EE 8 technologies (Glassfish 5):
-
EJB Timer
-
CDI events
-
JSONB and JSONP
-
JAXRS client (NB: at moment it is implemented using Jersey-specific features)
Business logic
The complete process should be transactional and must complete the following tasks:
-
Download JSON data from public REST endpoints such as Twitter Search API
-
Package the serialised JSON data in a temporary compressed ZIP file
-
Upload to UCL OneDrive using a MS Graph REST endpoint
The process is orchestrated by the EJB Timer service. In this proof of concept, the interval to use is of 1 hour. Once the timeout has been reached, a CDI event with a qualifier specific to a REST client is fired.
The complete process is represented in the following sequence diagram:
Since the source endpoint is queried to return data for a specific day, the application could could store in a cache object a flag to indicate when the transaction has been completed successfully to prevent the REST client to call the source endpoint repeatedly every hour. This is represented in the following activity diagram:
wsclient-main
The wsclient-main module allows the deployment of wsclient-app in environments where a Java EE 8 server is not available. wsclient-main uses Glassfish 5 embedded to run wsclient-app. The build process generates a shell script that can be used to configure and run the application in standalone mode in Glassfish embedded.
sh wsclient-main/target/app/bin/glassfish5
This script allows the deployment of wsclient-app in Heroku as a worker dyno, as defined in the Procfile:
worker: sh wsclient-main/target/app/bin/glassfish5
Data model
Query parameters
The application uses the following base parameters to call Twitter Standard search API endpoint
/1.1/search/tweets.json:
| result_type |
recent |
| lang |
en |
| include_entities |
false |
| include_user_entities |
false |
Query terms
At the moment the application is configured to search for the word "flu". This query string can be refined by adding more words and operators to a maximum of 500 characters including operators.
|
Information on standard search operators is available at https://developer.twitter.com/en/docs/tweets/search/guides/standard-operators |
Location
The application is set to the geocode of the Met Office station in Islington with a 10mi radius. This would allow the option to add other geolocated datasets in the future, such as the Met Office’s.
Time frame
Twitter Standard search API allows access to tweets created in the last 7 days. This limit can be used as a start date for queries, allowing the application to retrieve JSON data for a specific date,
provided it falls within the 7-day window. This is done with the until query parameter.
JSON object
Here’s an example of the structure of the JSON response provided by Twitter Standard search API:
It could be beneficial to filter out elements such as user from the JSON response before serialising and uploading to OneDrive. This will help reduce the size of the files.
|
Data flow
The following diagram shows the overall flow of data and boundaries of components deployed in the cloud and components within CS UCL. If the cloud components help with the data collection and processing, the next step would be to develop an API (in Python) to download files from OneDrive that can be used by members of the team.
Along with a database, it would be beneficial to explore object storage for archival, instead of the current file storage.
Future features
Open to discussion:
-
Configuration of query parameters, tokens and OneDrive id via ejb-jar.xml
-
Configure logging system to send messages/notifications to a server within UCL CS. Alternatively, Outlook API could potentially be used to send email notifications.
-
Python API to download files from OneDrive