Handling Statistics

The statistics that we gather from the profiling runs can have multiple purposes. Some are to be collected and plotted across multiple profiling runs; such as the CPU time or memory usage. Others may hold information specific to a particular profiling run that provides information about how to replicate the profiling results; like the commit SHA of the model that was profiled, workflow trigger that set the profiling run off, and the time the run was triggered.

The statistics that are to be read from the profiling outputs are stored as profiling_statistics.Statistic objects (these can then be grouped into a profiling_statistics.StatisticsCollection, which is a convenient wrapper class). The profiling_statistics.py file then defines a static variable, profiling_statistics.STATS, which corresponds to the statistics that the profiling workflow on the main repository produce and which should be read-able from the output files. If the profiling workflow is edited to produce different or additional statistics, or changes the format in which they are saved, the corresponding entries in profiling_statistics.STATS should be updated / added accordingly. They will then automatically be handled by the builder.Builder.

The Statistic class

class profiling_statistics.Statistic(key_in_stats_file: str, plot_title: str | None = None, dtype: type = <class 'float'>, dataframe_col_name: str | None = None, plot_y_label: str | None = None, plot_svg_name: str | None = None, converter: ~typing.Callable[[~typing.Any], ~typing.Any] | None = None, default_value: ~typing.Any | None = None)

A general class for tracking information about a statistic captured by the profiling run.

An instance of this class must have a key_in_stats_file corresponding to a key which appears in the statistics files produced by the profiling workflow. This key will be used to extract the value of that statistic from the files.

If a plot_title is specified, the statistic is flagged to be plotted across multiple profiling runs. The corresponding webpage will include a plot of the value of this statistic across all available profiling runs.

The dtype member can be used to ensure correct type casting occurs when reading statistics from files. Similarly, the converter attribute can be passed a function which acts on the value read from the statistics file, and saves the result as the value of the statistic. This can be used to avoid manual steps in the build process, such as converting timestamps (floats) to datetimes.

dataframe_col_name is the name displayed in the lookup table for the statistic, and also the column name used internally by the builder.Builder DataFrame.

A plot_y_label and plot_svg_name can be configured to change elements of the plot that is produced.

The default_value will be used when the statistic cannot be read from a file.

Parameters:
  • key_in_stats_file (str) – Key in the statistics files produced by the profiling workflow that holds the value this statistic.

  • plot_title (str, optional) – The title to display in the plot (showing change over time) of this statistic. If None (default), then no plot will be produced for this statistic.

  • dtype (type, optional) – The Python type that the statistic should be read as. Defaults to float.

  • dataframe_col_name (str, optional) – The column name in the builder.Builder DataFrame and lookup table to use for this statistic. If None (default), use the key_in_stats_file value.

  • plot_y_label (str, optional) – The y-axis label to assign to the plot of this statistic across profiling runs. If None (default), use the plot_title.

  • plot_svg_name (str, optional) – Name under which to save the plot svg that will be produced. If None (default), auto-generate a unique filename.

  • default_value (Any, optional) – Default value to assign to the statistic if it cannot be read or is missing from a statistics file. Defaults to None to flag missing data.

  • converter (Callable[[Any] Any], optional) – A function to apply to the value read from the statistics file, with the result saved as the value of the statistic.

dtype

alias of float

property produces_plot: bool

Whether this statistic should produce a plot of its value across profiling runs.

Examples

Define a statistic whose key in the statistics files is “trigger”, should be saved as a string, and displayed in the lookup table under the heading “Triggered by”.

Statistic("trigger", dtype=str, dataframe_col_name="Triggered by"),

Define a statistic containing a duration of time. The value as read from the “duration” field in the statistics file actually contains the time duration in seconds, and we want it in minutes. As such, pass a lambda function to the converter field so that Python knows to run this conversion when reading this statistic from the files.

convert = lambda t_secs: t_secs / 60.
Statistic("duration", dtype=float, dataframe_col_name="Session duration (min)", converter = convert),

Collections of Statistic s

The profiling_statistics.StatisticCollection class provides a convenient wrapper for looping through the statistics that we gather from the profiling runs.

class profiling_statistics.StatisticCollection(*stats: Statistic)

A collection of Statistic objects.

Defines convenient wrapper functions for extracting one particular attribute from each statistic in the collection, and for reading the values of all the statistics from a file using a single function.

Parameters:

statistics (List[profiling_statistics.Statistic]) – A list of statistics that should be collected - and plotted where requested.

property is_empty: bool

Return True if this instance contains no Statistics.

read_from_file(file: Path | None = None, branch: str | None = None, string: str | None = None) List[int | float | str]

Parse the statistics in this collection from the values provided in the file, or a string with the parsed contents. File (or string representing parsed file) should be parse-able as a json.

Values are returned in the order that the .values() method returns their names. If values cannot be found, defaults are assigned (usually None to flag missing data).

Parameters:
  • file – A json-readable file containing values of the statistics in this collection.

  • branch – If provided, read the file from an alternative branch to the one that is currently checked-out.

  • string – A string representing a parsed json file, which is of the format described previously.

Returns:

A list of values that correspond to the values of the statistics in this collection, extracted from the input file.

values(attribute: str) List[int | float | str]

For each Statistic in the collection, return the value stored under the attribute provided.

Returns:

[s.attribute for s in self.statistics]

Example: The STATS constant

The profiling_statistics.STATS constant initialises a profiling_statistics.StatisticCollection instance which defines all of the variables we expect to receive in the profiling-workflow-produced-statistics files. The builder.Builder references this constant when constructing the profiling results website; so if future updates add another statistic to the profiling outputs, an additional profiling_statistics.Statistic can be added to the profiling_statistics.STATS constant, and it will be automatically included in the next build of the website.