Labelbox

How Consensus Works

Auto Consensus enables you to quantitatively measure the quality of your training data — this is important because high quality training data leads to performant AI.

How it Works

Auto Consensus works by having more than one labeler (human or machine) label the same asset (image, text string, video, etc...). Once an asset has been labeled more than once, the results can be compared quantitatively (by using an equation) and a consensus score is calculated automatically. Auto Consensus works in real time so you can take immediate and corrective actions towards improving your training data and model performance.

Project Consensus

Every asset that has been labeled more than once within a Labelbox project has a consensus score. The results of the consensus data is shown in the Consensus chart in the project overview.

Consensus histogram shown in the Labelbox project overview.

Consensus histogram shown in the Labelbox project overview.

Individual Asset Consensus

The consensus of each asset is shown in the Activity table. To view all of the label results for a particular asset, click the stack icon to filter the Activity table on that asset. Clicking one of the table rows will begin a review of all labeled instances of this asset.

Assets that have been labeled more than once have a Consensus score.

Assets that have been labeled more than once have a Consensus score.

Consensus Score Calculations

The consensus score calculation compares the annotations of a labeled asset against existing labeled instances of the same asset. The set of existing labeled instances is referred to as the field.

Basic Example

To demonstrate how the consensus score is calculated, let's look at a basic example using a classification labeling task: Select the aircraft model.

The consensus score for this label is calculated against the field of existing labels. In this example case, this image has been labeled by 3 labelers; one of them selected Airbus A350 and the other two selected Boeing 787.

Labeler
Selection
Consensus

Arian Olierock

Airbus A350

0%

Julia Bohmer

Boeing 787

50%

Taylor Grant

Boeing 787

50%

For this classification task, the consensus of each label is calculated by dividing the count of labels that agree with the label by the total count of labels. For both of the Boeing 787 selections, the consensus score is 50% because the count of other labels (in the field) that agree is 1 and the total count of other labels is 2.

Classification Consensus Scoring

Labeler
Question 1
Question 2
Question 3
Consensus

Arian Olierock

A

A

A

50%

Julia Bohmer

A

A

B

50%

Taylor Grant

A

B

C

33%

To calculate the consensus for a multi-question classification label submission, we add up the consensus score for each question and divide by the number of questions. To calculate the consensus score for a question, we use the following equation:

The consensus score for a multi-question classification label submission is therefore:

Using the equation above, we can calculate the consensus scores for Arian:

We do a similar calculation for Julia and Taylor (per question score calculation detail hidden for brevity):

Segmentation Consensus Scoring

/TODO