Skip to content

Releases: argilla-io/argilla

v1.1.0

24 Nov 12:55
9468ed7
Compare
Choose a tag to compare

1.1.0 (2022-11-24)

Highlights

Add, update, and delete rules from a Dataset using the Python client

You can now manage rules programmatically and reflect them in Argilla Datasets so you can iterate on labeling rules from both Python and the UI. This is especially useful for leveraging linguistic resources (such as terminological lists) and making the rules available in the UI for domain experts to refine them.

# Read a file with keywords or phrases
labeling_rules_df = pd.read_csv("../../_static/datasets/weak_supervision_tutorial/labeling_rules.csv")

# Create rules
predefined_labeling_rules = []
for index, row in labeling_rules_df.iterrows():
    predefined_labeling_rules.append(
        Rule(row["query"], row["label"])
    )

# Add the rules to the weak_supervision_yt dataset. The rules will be manageable from the UI
add_rules(dataset="weak_supervision_yt", rules=predefined_labeling_rules

You can find more info about this feature in the deep dive guide: https://docs.argilla.io/en/latest/guides/techniques/weak_supervision.html#3.-Building-and-analyzing-weak-labels

Sort by timestamp fields in the UI

Users can now sort the records by last_updated and other timestamp fields to improve the labeling and review processes

Features

  • #1929 add warning about using wrong hostnames (#1930) (a3bc554)
  • Add, delete and edit labeling rules from Python client (#1884) (d534a29), closes #1855
  • Added more explicit error message regarding dataset name validation (#1933) (c25a225), closes #1931 #1918
  • Allow sort records by event_timestamp or last_updated fields (#1924) (1c08c36), closes #1835
  • Create a contextual help to support the user in the different dataset views (#1913) (8e3851e)
  • Enable metadata length field config by environment variable (#1923) (0ff2de7), closes #1761
  • Update error page (#1932) (caeb7d4), closes #1894
  • Using new top_k_mentions metrics instead of entity_consistency (#1880) (42f702d), closes #1834

Bug Fixes

Documentation

As always, thanks to our amazing contributors!

v1.0.1

04 Nov 10:00
Compare
Choose a tag to compare

1.0.1 (2022-11-04)

Bug Fixes

Documentation

  • corrected for tutorial and api redirections (#1820) (26ccdcc)

v0.19.0

24 Oct 18:17
Compare
Choose a tag to compare
chore(release): 0.19.0

v0.18.0

05 Oct 22:25
Compare
Choose a tag to compare

0.18.0 (2022-10-05)

⚡ Highlights

Better validation of token classification records

When working with Token Classification records, there are very often misalignment problems between the entity spans and provided tokens.
Before this release, it was difficult to understand and fix these errors because validation happened on the server side.

With this release, records are validated during instantiation, giving you a clear error message which can help you to fix/ignore problematic records.

For example, the following record:

import rubrix as rb

rb.TokenClassificationRecord(
    tokens=["I", "love", "Paris"],
    text="I love Paris!",
    prediction=[("LOC",7,13)]
)

Will give you the following error message:

ValueError: Following entity spans are not aligned with provided tokenization
Spans:
- [Paris!] defined in ...love Paris!
Tokens:
['I', 'love', 'Paris']

Delete records by query

Now it's possible to delete specific records, either by ids or by a query using Lucene's syntax. This is useful for clean up and better dataset maintenance:

import rubrix as rb

## Delete by id
rb.delete_records(name="example-dataset", ids=[1,3,5])

## Discard records by query
rb.delete_records(name="example-dataset", query="metadata.code=33", discard_only=True)

New tutorials

We have two new tutorials!

Few-shot classification with SetFit and a custom dataset: https://rubrix.readthedocs.io/en/stable/tutorials/few-shot-classification-with-setfit.html

Analyzing predictions with model explainability methods: https://rubrix.readthedocs.io/en/stable/tutorials/nlp_model_explainability.html
https://rubrix.readthedocs.io/en/stable/tutorials/few-shot-classification-with-setfit.html

Features

Bug Fixes

Visual enhancements

Documentation

  • Add interpret tutorial with Transformers (#1728) (c3fa079), closes #1729
  • Adds tutorial about custom few-shot classification with SetFit (#1739) (4f15ee6), closes #1741
  • fixing the active learning tutorial with small-text (#1726) (909efdf), closes #1693
  • raise small-text version to 1.1.0 and adapt tutorial (#1744) (16f19b7), closes #1693
  • Resolve many typos in documentation, comments and tutorials (#1701) (f05e1c1)
  • using official token class. mapper since is compatible now (#1738) (e82fd13), closes #482

As always, thanks to our amazing contributors!

v0.17.0

22 Aug 14:53
Compare
Choose a tag to compare

0.17.0 (2022-08-22)

⚡ Highlights

Preparing a training set in the spaCy DocBin format

prepare_for_training is a method that prepares a dataset for training. Before prepare_for_training prepared the data for easily training Hugginface Transformers.

Now, you can prepare your training data for spaCy NER pipelines, thanks to our great community contributor @ignacioct !

With the example below, you can export your Rubrix dataset into a Docbin, save it to disk, and then use it with the spacy train command.

import spacy
import rubrix as rb

from datasets import load_dataset

# Load annotated dataset from Rubrix
rb_dataset = rb.load("ner_dataset")

# Loading an spaCy blank language model to create the Docbin, as it works faster
nlp = spacy.blank("en")

# After this line, the file will be stored in disk
rb_dataset.prepare_for_training(framework="spacy", lang=nlp).to_disk("train.spacy")

You can find a full example at: https://rubrix.readthedocs.io/en/v0.17.0/guides/cookbook.html#Train-a-spaCy-model-by-exporting-to-Docbin

Load large datasets using batches

Before this release, the rb.load method to read datasets from Python retrieved the full dataset. For large datasets, this could cause high memory consumption, network timeouts, and the inability to read datasets larger than the available memory.

Thanks to the awesome work by @maxserras. Now it's possible to optimize memory consumption and avoid network timeouts when working with large datasets. To that end, a simple batch-iteration over the whole database can be done employing the from_id parameter in the rb.load method.

An example of reading the first 1000 records and the next batch of up to 1000 records:

import rubrix as rb
dataset_batch_1 = rb.load(name="example-dataset", limit=1000)
dataset_batch_2 = rb.load(name="example-dataset", limit=1000, id_from=dataset_batch_1[-1].id)

The reference to the rb.load method can be found at: https://rubrix.readthedocs.io/en/v0.17.0/reference/python/python_client.html#rubrix.load

Larger pagination sizes for faster bulk review and annotation

Using filters and search for data annotation and review, some users are able to filter and quickly review dozens of records in one go. To serve those users, it's now possible to see and bulk annotate 50 and 100 records in each page.

Screenshot 2022-08-25 at 10 33 58

Copy record text to clipboard

Sometimes is useful to copy the text in records to use inspect it or process it with another application. Now, this is possible thanks to the feature request by our great community member and contributor @Ankush-Chander !

Screenshot 2022-08-25 at 10 38 19

Better error logging for generic errors

Thanks to work done by @Ankush-Chander and @frascuchon we now have more meaningful messages for generic server errors!

Features

  • Add new pagination size ranges (#1667) (5b4f1f2), closes #1578
  • Allow rb.load fetch records in batches passing the from_id argument (3e6344a)
  • Copy to clipboard the record text (#1625) (d634a7b), closes #1616
  • Error Logging: send error detail in response for generic server errors (#1648) (ad17631)
  • Listeners: allow using query params in the condition through search parameter (#1627) (a0a245d), closes #1622
  • prepare_for_training supports spacy (#1635) (8587808)

Bug Fixes

Documentation

Visual enhancements

You can see all work included in the release here

  • fix: Update progress bar when refreshing after adding new records (#1666) by @leiyre
  • chore: configure miniconda for readthedocs builder by @frascuchon
  • style: Small visual adjustments for Text2Text record card (#1632) by @leiyre
  • feat: Copy to clipboard the record text (#1625) by @leiyre
  • docs: Add Slack support link in README's get started (#1688) by @dvsrepo
  • chore: update version by @frascuchon
  • feat: Add new pagination size ranges (#1667) by @leiyre
  • fix: handle stream api connection errors gracefully (#1636) by @Ankush-Chander
  • feat: allow rb.load fetch records in batches passing the from_id argument by @maxserras
  • fix(Client): reusing the inner httpx client (#1640) by @frascuchon
  • feat(Error Logging): send error detail in response for generic server errors (#1648) by @frascuchon
  • docs: spacy DocBin cookbook (#1642) by @ignacioct
  • feat: prepare_for_training supports spacy (#1635) by @frascuchon
  • style: Improve card spacing (#1638) by @leiyre
  • docs: Adding Elasticsearch persistence to docker compose section (#1643) by @maxserras
  • chore: remove old rubrix client class (#1639) by @frascuchon
  • feat(Listeners): allow using query params in the condition through search parameter (#1627) by @frascuchon
  • doc: show metric graphs in documentation (#1669) by @leiyre
  • fix(docker-compose.yaml): default volume and disable disk threshold (#1656) by @frascuchon
  • fix: Encode rule name in Weak Labeling API requests (#1649) by @leiyre

v0.16.1

22 Jul 14:14
3cb4c07
Compare
Choose a tag to compare

0.16.1 (2022-07-22)

Bug Fixes

  • 'WeakMultiLabels.summary' and 'show_records' after extending the weak label matrix (#1633) (3cb4c07), closes #1631
  • Display metadata in Text2Text dataset (#1626) (0089e0a), closes #1623
  • Show predicted OK/KO when predictions exist (#1620) (ef66e9c), closes #1619

Documentation

You can see all work included in the release here

  • fix: 'WeakMultiLabels.summary' and 'show_records' after extending the weak label matrix (#1633) by @dcfidalgo
  • fix: Display metadata in Text2Text dataset (#1626) by @leiyre
  • chore: set version by @dcfidalgo
  • docs: Fix typo in Getting Started -> Concepts (#1618) by @dcfidalgo
  • fix: Show predicted OK/KO when predictions exist (#1620) by @leiyre

v0.16.0

08 Jul 14:54
Compare
Choose a tag to compare

0.16.0 (2022-07-08)

Highlights

👂 Listeners: enable more interactive workflows between client and server

Listeners enable you to define functions that get executed under certain conditions when something changes in a dataset. There are many use cases for this: monitoring annotation jobs, monitoring model predictions, enabling active learning workflows, and many more.

You can find the Python API reference docs here: https://rubrix.readthedocs.io/en/stable/reference/python/python_listeners.html#python-listeners

We will be documenting these use cases with practical examples, but for this release, we've included a new tutorial for using this with active learning: https://rubrix.readthedocs.io/en/stable/tutorials/active_learning_with_small_text.html. This tutorial includes the following listener function, which implements the active learning loop:

from rubrix.listeners import listener
from sklearn.metrics import accuracy_score

# Define some helper variables
LABEL2INT = trec["train"].features["label-coarse"].str2int
ACCURACIES = []

# Set up the active learning loop with the listener decorator
@listener(
    dataset=DATASET_NAME,
    query="status:Validated AND metadata.batch_id:{batch_id}",
    condition=lambda search: search.total==NUM_SAMPLES,
    execution_interval_in_seconds=3,
    batch_id=0
)
def active_learning_loop(records, ctx):

    # 1. Update active learner
    print(f"Updating with batch_id {ctx.query_params['batch_id']} ...")
    y = np.array([LABEL2INT(rec.annotation) for rec in records])

    # initial update
    if ctx.query_params["batch_id"] == 0:
        indices = np.array([rec.id for rec in records])
        active_learner.initialize_data(indices, y)
    # update with the prior queried indices
    else:
        active_learner.update(y)
    print("Done!")

    # 2. Query active learner
    print("Querying new data points ...")
    queried_indices = active_learner.query(num_samples=NUM_SAMPLES)
    ctx.query_params["batch_id"] += 1
    new_records = [
        rb.TextClassificationRecord(
            text=trec["train"]["text"][idx],
            metadata={"batch_id": ctx.query_params["batch_id"]},
            id=idx,
        )
        for idx in queried_indices
    ]

    # 3. Log the batch to Rubrix
    rb.log(new_records, DATASET_NAME)

    # 4. Evaluate current classifier on the test set
    print("Evaluating current classifier ...")
    accuracy = accuracy_score(
        dataset_test.y,
        active_learner.classifier.predict(dataset_test),
    )
    ACCURACIES.append(accuracy)
    print("Done!")

    print("Waiting for annotations ...")

📖 New docs!

https://rubrix.readthedocs.io/

Screenshot 2022-07-13 at 12 49 42

🧱 extend_matrix: Weak label augmentation using embeddings

This release includes an exciting feature to augment the coverage of your weak labels using embeddings. You can find a practical tutorial here: https://rubrix.readthedocs.io/en/stable/tutorials/extend_weak_labels_with_embeddings.html

Features

Bug Fixes

Documentation

You can see all work included in the release here

Read more

v0.15.0

08 Jun 20:57
Compare
Choose a tag to compare

0.15.0 (2022-06-08)

🔆 Highlights

🏷️ Configure datasets with a labeling scheme

You can now predefine and change the label schema of your datasets. This is useful for fixing a set of labels for you and your annotation teams.

import rubrix as rb

# Define labeling schema
settings = rb.TextClassificationSettings(label_schema=["A", "B", "C"])

# Apply seetings to a new or already existing dataset
rb.configure_dataset(name="my_dataset", settings=settings)

# Logging to the newly created dataset triggers the validation checks
rb.log(rb.TextClassificationRecord(text="text", annotation="D"), "my_dataset")
#BadRequestApiError: Rubrix server returned an error with http status: 400

Read the docs: https://rubrix.readthedocs.io/en/stable/guides/dataset_settings.html

🧱 Weak label matrix augmentation using embeddings

You can now use an augmentation technique inspired by https://github.com/HazyResearch/epoxy to augment the coverage of your rules using embeddings (e.g., sentence transformers). This is useful for improving the recall of your labeling rules.

Read the tutorial: https://rubrix.readthedocs.io/en/stable/tutorials/extend_weak_labels_with_embeddings.html

🏛️ Tutorial Gallery

Tutorials are now organized into different categories and with a new gallery design!

Read the docs: https://rubrix.readthedocs.io/en/stable/tutorials/introductory.html

🏁 Basics guide

This is the first version of the basics guide. This guide will show you how to perform the most basic actions with Rubrix, such as uploading data or data annotation.

Read the docs: https://rubrix.readthedocs.io/en/stable/getting_started/basics.html

Features

Bug Fixes

New contributors

@RafaelBod made his first contribution in #1413

v0.14.2

31 May 10:18
Compare
Choose a tag to compare

0.14.2 (2022-05-31)

Bug Fixes

v0.14.1

20 May 16:04
Compare
Choose a tag to compare

0.14.1 (2022-05-20)

Bug Fixes