Java EE Timer-based JAX-RS client to cloud file store application to facilitate analysis of text content obtained from APIs (proof of concept).

DRAFT

This document is only a rough draft. These recommendations are subject to change.

Background

Context

Currently, the data collection relies on physical and virtual infrastructure provided by CS UCL. These are Linux machines where components for monitoring and logging are custom-made or missing. Also, some of the business logic relies on CRON tasks. Given the variety of cloud-based resources and platforms available, this proof-of-concept implements a very simple process: retrieve periodically JSON data from a public REST API to save it in some cloud storage facility with the business logic carried out completely by the application.

Objectives

  • Develop a deployable artifact as a proof-of-concept of data collection on the cloud

  • The application should run on PaaS, IaaS such a cloud server or locally on one of the servers in CS UCL

  • The application must be configurable to allow users to use their own credentials

Resources and platforms

Data sources - Twitter API

This proof-of-concept application is designed to retrieve JSON data from a REST endpoint such as Twitter and Instagram APIs.

This implementation does not support the use of "Streaming" APIs.

Twitter’s standard search API is used by the application. While not meant to be an exhaustive source of tweets, it still offers a combination of search terms and operators that can be combined with geolocation parameters and language, as well as date limits within the last 7-day range. This would allow to run a query to extract the tweets created in one specific date.

For more information on Twitter Standard search API:

PaaS - Heroku

While the wsclient-app module can be deployed in an existing installation of a Java EE 8 server like Glassfieh 5. This proof-of-concept uses the public PaaS Heroku

Heroku uses dynos, lightweight Linuz application containers pre-configured for specific process types. The free account allows 1 worker/1 web process types. This proof-of-concept uses a worker dyno.

Deployment with Git

Heroku allows the deployment of applications from sources in an existing Github repository. With its CLI, the process would be

heroku login
git clone git@github.com:UCL/wsclient-cloudstore-poc.git
cd wsclient-cloudstore-poc
heroku create wsclient-cloudstore-poc-ly4y5z

The last command will create an instance on Heroku with its own Git repository, which is added as a remote to the user:

$ git remote -v
heroku	https://git.heroku.com/wsclient-cloudstore-poc-ly4y5z.git (fetch)
heroku	https://git.heroku.com/wsclient-cloudstore-poc-ly4y5z.git (push)
origin	git@github.com:UCL/wsclient-cloudstore-poc.git (fetch)
origin	git@github.com:UCL/wsclient-cloudstore-poc.git (push)

Deployment of the application is done with:

heroku push heroku master

This instructs Heroku to build the application using Maven, unless Maven Wrapper is detected, in which case mvnw will be used instead.

It is possible to run the application with Heroku locally, for testing before pushing to the heroku remote:

heroku local

Deployment with Heroku Maven Plugin

It is also possible to build and package the application locally and deploy the artifacts on Heroku using the Maven plugin. This method is also recommended when deploying from a CI server like Travis CI.

The application module wsclient-main is configured to use this plugin in its pom.xml:

<plugin>
  <groupId>com.heroku.sdk</groupId>
  <artifactId>heroku-maven-plugin</artifactId>
  <version>${plu.heroku.version}</version>
  <configuration>
    <appName>wsclient-cloudstore-poc-ly4y5z</appName>
    <processTypes>
      <worker>sh target/app/bin/glassfish5</worker>
    </processTypes>
  </configuration>
</plugin>

Then the application can be deployed with:

./mvnw clean heroku:deploy -pl wsclient-main
OPENSHIFT

In addition to Heroku, OpenShift is another PaaS system that should be considered. OpenShift is a RedHat product based on Docker and Kubernetes available in 3 configurations: hosted (online), dedicated (public cloud) and on-premise. This last option could be worth exploring for a local installation within CS UCL.

For more information on OpenShift https://www.openshift.com/

UCL Office365 OneDrive

Cloud-based Office 365 suite used by UCL includes OneDrive which can be accessed through a REST API. This API is part of the Microsoft Graph, a common API for Microsoft resources.

The endpoint used in this proof-of-concept application is:

PUT /me/drive/items/{parent-id}:/{filename}:/content

where {parent-id} could point to the personal drive root or a shared drive.

A successful upload (PUT request) returns a 201 HTTP code (created).

For more information on OneDrive API: https://docs.microsoft.com/en-gb/onedrive/developer/

AZURE

Microsoft also offers storage as part of the range of Azure products. As with Google Cloud and Amazon AWS it offers a range of cloud computing tools and resources that would go beyond the scope of this proof-of-concept application.

For more information on storage options in Azure https://azure.microsoft.com/en-gb/services/storage/

Constraints

When using Heroku with a free account, the application must run in the following environment:

  • RAM 512MB

  • Compressed size of application sources 500MB

  • Compilation limited to 15 min

  • 1 worker/1 web max

Core components

wsclient-app

The wsclient-app application is an EJB module based in Java EE 8 technologies (Glassfish 5):

  • EJB Timer

  • CDI events

  • JSONB and JSONP

  • JAXRS client (NB: at moment it is implemented using Jersey-specific features)

Business logic

The complete process should be transactional and must complete the following tasks:

  1. Download JSON data from public REST endpoints such as Twitter Search API

  2. Package the serialised JSON data in a temporary compressed ZIP file

  3. Upload to UCL OneDrive using a MS Graph REST endpoint

The process is orchestrated by the EJB Timer service. In this proof of concept, the interval to use is of 1 hour. Once the timeout has been reached, a CDI event with a qualifier specific to a REST client is fired.

The complete process is represented in the following sequence diagram:

wsclient sequence

Since the source endpoint is queried to return data for a specific day, the application could could store in a cache object a flag to indicate when the transaction has been completed successfully to prevent the REST client to call the source endpoint repeatedly every hour. This is represented in the following activity diagram:

control iterations

wsclient-main

The wsclient-main module allows the deployment of wsclient-app in environments where a Java EE 8 server is not available. wsclient-main uses Glassfish 5 embedded to run wsclient-app. The build process generates a shell script that can be used to configure and run the application in standalone mode in Glassfish embedded.

sh wsclient-main/target/app/bin/glassfish5

This script allows the deployment of wsclient-app in Heroku as a worker dyno, as defined in the Procfile:

worker: sh wsclient-main/target/app/bin/glassfish5

Data model

Query parameters

The application uses the following base parameters to call Twitter Standard search API endpoint
/1.1/search/tweets.json:

result_type

recent

lang

en

include_entities

false

include_user_entities

false

Query terms

At the moment the application is configured to search for the word "flu". This query string can be refined by adding more words and operators to a maximum of 500 characters including operators.

Information on standard search operators is available at
https://developer.twitter.com/en/docs/tweets/search/guides/standard-operators

Location

The application is set to the geocode of the Met Office station in Islington with a 10mi radius. This would allow the option to add other geolocated datasets in the future, such as the Met Office’s.

Time frame

Twitter Standard search API allows access to tweets created in the last 7 days. This limit can be used as a start date for queries, allowing the application to retrieve JSON data for a specific date, provided it falls within the 7-day window. This is done with the until query parameter.

JSON object

Here’s an example of the structure of the JSON response provided by Twitter Standard search API:

It could be beneficial to filter out elements such as user from the JSON response before serialising and uploading to OneDrive. This will help reduce the size of the files.

Data flow

The following diagram shows the overall flow of data and boundaries of components deployed in the cloud and components within CS UCL. If the cloud components help with the data collection and processing, the next step would be to develop an API (in Python) to download files from OneDrive that can be used by members of the team.

Along with a database, it would be beneficial to explore object storage for archival, instead of the current file storage.

dataflow

Future features

Open to discussion:

  • Configuration of query parameters, tokens and OneDrive id via ejb-jar.xml

  • Configure logging system to send messages/notifications to a server within UCL CS. Alternatively, Outlook API could potentially be used to send email notifications.

  • Python API to download files from OneDrive