Data and Tasks Description

This page describes the data and respective data mining tasks (Task 1, Task 2 and Task 3. To get the access to the data, you must first register and then you can proceed to download the data.

Data

Regretfully, the plan to provide data to the participants in CSV format wasn't achieved in due time. Therefore, the only data format, for all three tasks, will be RDF/Turtle, until the LDMC deadline. (CSV however remains the format for mining results, for Tasks 1 and 2.) We, as LDMC organizers, confess that we greatly underestimated the effort needed for preparing and running the conversion process from RDF to CSV, which proved anything but straightforward: our part of the challenge thus really challenged us… We however keep working towards completion of the process, in view of future similar events.

In all tasks the experimenters should try to make use of linked data resources. Some external resources are already interlinked to the original dataset (see the documentation of data). It is of course possible to heuristically link further resources from the Linked Data Cloud. The data provided is not fully cleaned and it does not have to adhere to the used ontologies entirely, especially regarding cardinalities.

The data is modelled using RDF vocabularies and ontologies, including the following:

The format of data is RDF in Turtle serialization (GZipped).

Task 1

Description: Predict of the number of bidders (as integer value). In the training dataset, the number of bidders is expressed as value of the pc:numberOfTenders property. The prediction should be as precise as possible. The preciseness matters most for the lower values, e.g. predicting 2 bidders where there are 3 is a more important error than predicting 12 bidders where there are 13. This is reflected in the evaluation measure.

Data:

Format of Results: The results for the task will be delivered in CSV format with two columns. First column will contain URI of an annotated public contract, second column will contain the predicted number of tenders for the public contract in the format of positive integer.

Evaluation: The principal evaluation measure at the level of individual object will be the absolute value of the difference between the predicted value and the reference value , adjusted by the reciprocal value of the (smaller) value size and normalized to [0,1] by a sigmoidal function:

The adjustment by reciprocal value makes the cost of errors uneven for the same value difference (same difference for larger values counting less than that for smaller values). The error values will be aggregated by average.

As we identified ex post, the dataset also contains contracts with number of tenders equal to 0. For this reason, to avoid division by zero in the error formula, we will consider the calculation of modified formula as follows: in the denominator inside the exponent of , the expression will be replaced by .

Task 2

Description: Classify the contracts as multi-contract or its opposite. A multi-contract is a contract that (often, ‘suspiciously’) unifies two or more unrelated commodities. It is possible to classify a contract as borderline. In the training dataset, the multi-contract annotation is expressed as value of the artificially added multicontract property.

Data: Unfortunately, very tiny, due to difficulties in the annotation process…

The data correspond to UK public contracts, plus CPV codes, and DBpedia entities.

Format of results: The results for the task will be delivered in CSV format with two columns. The first column will contain the URI of an annotated public contract, and the second column will contain annotation for the predicted variable with three possible values: 0 if the contract is not a multi-contract, 0.5 if the contract is a borderline case, and 1 if the contract is a multi-contract.

Evaluation: The evaluation measures considered are (without strong bias towards one of them):

Task 3

Description: Find (and possibly attempt to suggest explanations to) any kind of interesting hypotheses (nuggets) in data. An example could be hypotheses related to uneven distributions of CPV codes in different geographical segments of contracts data, but many other options are possible.

Data:

Evaluation: Interestingness of the findings (and possibly their interpretation), described in the submitted paper and judged by experts in public procurement.

Further Information

If you have any questions regarding the data mining tasks or the associated datasets, please contact Vojtěch Svátek (svatek [at] vse [dot] cz).