Releases: argilla-io/argilla
v1.1.0
1.1.0 (2022-11-24)
Highlights
Add, update, and delete rules from a Dataset using the Python client
You can now manage rules programmatically and reflect them in Argilla Datasets so you can iterate on labeling rules from both Python and the UI. This is especially useful for leveraging linguistic resources (such as terminological lists) and making the rules available in the UI for domain experts to refine them.
# Read a file with keywords or phrases
labeling_rules_df = pd.read_csv("../../_static/datasets/weak_supervision_tutorial/labeling_rules.csv")
# Create rules
predefined_labeling_rules = []
for index, row in labeling_rules_df.iterrows():
predefined_labeling_rules.append(
Rule(row["query"], row["label"])
)
# Add the rules to the weak_supervision_yt dataset. The rules will be manageable from the UI
add_rules(dataset="weak_supervision_yt", rules=predefined_labeling_rules
You can find more info about this feature in the deep dive guide: https://docs.argilla.io/en/latest/guides/techniques/weak_supervision.html#3.-Building-and-analyzing-weak-labels
Sort by timestamp fields in the UI
Users can now sort the records by last_updated and other timestamp fields to improve the labeling and review processes
Features
- #1929 add warning about using wrong hostnames (#1930) (a3bc554)
- Add, delete and edit labeling rules from Python client (#1884) (d534a29), closes #1855
- Added more explicit error message regarding dataset name validation (#1933) (c25a225), closes #1931 #1918
- Allow sort records by event_timestamp or last_updated fields (#1924) (1c08c36), closes #1835
- Create a contextual help to support the user in the different dataset views (#1913) (8e3851e)
- Enable metadata length field config by environment variable (#1923) (0ff2de7), closes #1761
- Update error page (#1932) (caeb7d4), closes #1894
- Using new
top_k_mentions
metrics instead ofentity_consistency
(#1880) (42f702d), closes #1834
Bug Fixes
- Avoid closing the score filter when dragging the slider (#1822) (91a72c5), closes #1802
- Change method for Doc creation by spacy.Language (#1891) (6264983), closes #1890
- DAO: datasets dao filter datasets by tasks (#1934) (937b410)
- docker: Prevent wrong elastic server for wait-for-it (c6a10c7)
- Improve access to label list in Text Classification (#1916) (24729bd), closes #1804
- Improve explanation readability (#1815) (52c712e), closes #1774
- Monitoring: Serializable log middleware (#1908) (53a57f7)
- Move "Show less" button to the end of entities list (#1875) (6d796a4), closes #1779
- Remove "Help explain button" in Manage rule view (#1909) (8bc70b0), closes #1807
- Remove extra html when text is not highlighted (#1904) (7858dc5), closes #1758
- Remove extra type when highlighting the query in the text (#1863) (341c581), closes #1758
Documentation
- change iframe for mp4 (dfac8b2)
- corrected for iframe (935f586)
- Link key features (#1805) (#1809) (4c83604)
- resolved miss-direction and old naming in README.md (f45fe1e)
- Update README links linkedin and twitter (#1797) (2d4d03a)
As always, thanks to our amazing contributors!
v1.0.1
v0.19.0
chore(release): 0.19.0
v0.18.0
0.18.0 (2022-10-05)
⚡ Highlights
Better validation of token classification records
When working with Token Classification records, there are very often misalignment problems between the entity spans and provided tokens.
Before this release, it was difficult to understand and fix these errors because validation happened on the server side.
With this release, records are validated during instantiation, giving you a clear error message which can help you to fix/ignore problematic records.
For example, the following record:
import rubrix as rb
rb.TokenClassificationRecord(
tokens=["I", "love", "Paris"],
text="I love Paris!",
prediction=[("LOC",7,13)]
)
Will give you the following error message:
ValueError: Following entity spans are not aligned with provided tokenization
Spans:
- [Paris!] defined in ...love Paris!
Tokens:
['I', 'love', 'Paris']
Delete records by query
Now it's possible to delete specific records, either by ids or by a query using Lucene's syntax. This is useful for clean up and better dataset maintenance:
import rubrix as rb
## Delete by id
rb.delete_records(name="example-dataset", ids=[1,3,5])
## Discard records by query
rb.delete_records(name="example-dataset", query="metadata.code=33", discard_only=True)
New tutorials
We have two new tutorials!
Few-shot classification with SetFit and a custom dataset: https://rubrix.readthedocs.io/en/stable/tutorials/few-shot-classification-with-setfit.html
Analyzing predictions with model explainability methods: https://rubrix.readthedocs.io/en/stable/tutorials/nlp_model_explainability.html
https://rubrix.readthedocs.io/en/stable/tutorials/few-shot-classification-with-setfit.html
Features
- API: provide a dict for record annotations/predictions (#1658) (12b0f83)
- Client: expose client extra headers in init function (#1715) (79f0529), closes #1706
- Client: improve httpx errors handling (#1662) (85da336)
- Client: validate token classification annotations in client (#1709) (936d1ca), closes #1579
- Datasets: delete records by query (#1721) (bc9685d), closes #1714 #1737
- Datasets: restrict dataset deletion only to creators and super-users (#1713) (c1bef9d), closes #1740
- Server: Add server telemetry (#1687) (d7cc006)
Bug Fixes
- 'MajorityVoter.score' when using multi-labels (#1678) (0b94c86), closes #1628
- Metadata limits: exclude subfields from mappings (#1700) (9f9650e), closes #1699
- Normalizes the UnauthorizationError for the API response (#1748) (6a68048)
- Search tag reset prior annotation (#1736) (dc0a17f), closes #1711
Visual enhancements
Documentation
- Add interpret tutorial with Transformers (#1728) (c3fa079), closes #1729
- Adds tutorial about custom few-shot classification with SetFit (#1739) (4f15ee6), closes #1741
- fixing the active learning tutorial with
small-text
(#1726) (909efdf), closes #1693 - raise small-text version to 1.1.0 and adapt tutorial (#1744) (16f19b7), closes #1693
- Resolve many typos in documentation, comments and tutorials (#1701) (f05e1c1)
- using official token class. mapper since is compatible now (#1738) (e82fd13), closes #482
As always, thanks to our amazing contributors!
- refactor: accept flat text as input for token classification mapper (#1686) by @Ankush-Chander
- feat(Client): improve httpx errors handling (#1662) by @Ankush-Chander
- fix: 'MajorityVoter.score' when using multi-labels (#1678) by @dcfidalgo
- docs: raise small-text version to 1.1.0 and adapt tutorial (#1744) by @chschroeder
- refactor: Incompatible attribute type fixed (#1675) by @luca-digrazia
- docs: Resolve many typos in documentation, comments and tutorials (#1701) by @tomaarsen
- refactor: Collection of changes, primarily regarding test suite and its coverage (#1702) by @tomaarsen
v0.17.0
0.17.0 (2022-08-22)
⚡ Highlights
Preparing a training set in the spaCy DocBin format
prepare_for_training
is a method that prepares a dataset for training. Before prepare_for_training
prepared the data for easily training Hugginface Transformers.
Now, you can prepare your training data for spaCy
NER pipelines, thanks to our great community contributor @ignacioct !
With the example below, you can export your Rubrix dataset into a Docbin, save it to disk, and then use it with the spacy train command.
import spacy
import rubrix as rb
from datasets import load_dataset
# Load annotated dataset from Rubrix
rb_dataset = rb.load("ner_dataset")
# Loading an spaCy blank language model to create the Docbin, as it works faster
nlp = spacy.blank("en")
# After this line, the file will be stored in disk
rb_dataset.prepare_for_training(framework="spacy", lang=nlp).to_disk("train.spacy")
You can find a full example at: https://rubrix.readthedocs.io/en/v0.17.0/guides/cookbook.html#Train-a-spaCy-model-by-exporting-to-Docbin
Load large datasets using batches
Before this release, the rb.load
method to read datasets from Python retrieved the full dataset. For large datasets, this could cause high memory consumption, network timeouts, and the inability to read datasets larger than the available memory.
Thanks to the awesome work by @maxserras. Now it's possible to optimize memory consumption and avoid network timeouts when working with large datasets. To that end, a simple batch-iteration over the whole database can be done employing the from_id
parameter in the rb.load
method.
An example of reading the first 1000 records and the next batch of up to 1000 records:
import rubrix as rb
dataset_batch_1 = rb.load(name="example-dataset", limit=1000)
dataset_batch_2 = rb.load(name="example-dataset", limit=1000, id_from=dataset_batch_1[-1].id)
The reference to the rb.load
method can be found at: https://rubrix.readthedocs.io/en/v0.17.0/reference/python/python_client.html#rubrix.load
Larger pagination sizes for faster bulk review and annotation
Using filters and search for data annotation and review, some users are able to filter and quickly review dozens of records in one go. To serve those users, it's now possible to see and bulk annotate 50 and 100 records in each page.
Copy record text to clipboard
Sometimes is useful to copy the text in records to use inspect it or process it with another application. Now, this is possible thanks to the feature request by our great community member and contributor @Ankush-Chander !
Better error logging for generic errors
Thanks to work done by @Ankush-Chander and @frascuchon we now have more meaningful messages for generic server errors!
Features
- Add new pagination size ranges (#1667) (5b4f1f2), closes #1578
- Allow
rb.load
fetch records in batches passing thefrom_id
argument (3e6344a) - Copy to clipboard the record text (#1625) (d634a7b), closes #1616
- Error Logging: send error detail in response for generic server errors (#1648) (ad17631)
- Listeners: allow using query params in the condition through search parameter (#1627) (a0a245d), closes #1622
prepare_for_training
supports spacy (#1635) (8587808)
Bug Fixes
- Client: reusing the inner
httpx
client (#1640) (854a972), closes #1646 - docker-compose.yaml: default volume and disable disk threshold (#1656) (05ae688), closes #1275
- Encode rule name in Weak Labeling API requests (#1649) (4634df8), closes #1645
- handle stream api connection errors gracefully (#1636) (a106ec4), closes #1559
- Update progress bar when refreshing after adding new records (#1666) (7e0d915), closes #1590
Documentation
- Add Slack support link in README's get started (#1688) (bef010c)
- Adding Elasticsearch persistence to docker compose section (#1643) (ecdc854)
- spacy
DocBin
cookbook (#1642) (bb98278), closes #420
Visual enhancements
- Small visual adjustments for Text2Text record card (#1632) (9c87cf1), closes #1138
- Improve card spacing (#1638) (fd4016a), closes #1624
You can see all work included in the release here
- fix: Update progress bar when refreshing after adding new records (#1666) by @leiyre
- chore: configure miniconda for readthedocs builder by @frascuchon
- style: Small visual adjustments for Text2Text record card (#1632) by @leiyre
- feat: Copy to clipboard the record text (#1625) by @leiyre
- docs: Add Slack support link in README's get started (#1688) by @dvsrepo
- chore: update version by @frascuchon
- feat: Add new pagination size ranges (#1667) by @leiyre
- fix: handle stream api connection errors gracefully (#1636) by @Ankush-Chander
- feat: allow
rb.load
fetch records in batches passing thefrom_id
argument by @maxserras - fix(Client): reusing the inner
httpx
client (#1640) by @frascuchon - feat(Error Logging): send error detail in response for generic server errors (#1648) by @frascuchon
- docs: spacy
DocBin
cookbook (#1642) by @ignacioct - feat: prepare_for_training supports spacy (#1635) by @frascuchon
- style: Improve card spacing (#1638) by @leiyre
- docs: Adding Elasticsearch persistence to docker compose section (#1643) by @maxserras
- chore: remove old rubrix client class (#1639) by @frascuchon
- feat(Listeners): allow using query params in the condition through search parameter (#1627) by @frascuchon
- doc: show metric graphs in documentation (#1669) by @leiyre
- fix(docker-compose.yaml): default volume and disable disk threshold (#1656) by @frascuchon
- fix: Encode rule name in Weak Labeling API requests (#1649) by @leiyre
v0.16.1
0.16.1 (2022-07-22)
Bug Fixes
- 'WeakMultiLabels.summary' and 'show_records' after extending the weak label matrix (#1633) (3cb4c07), closes #1631
- Display metadata in Text2Text dataset (#1626) (0089e0a), closes #1623
- Show predicted OK/KO when predictions exist (#1620) (ef66e9c), closes #1619
Documentation
You can see all work included in the release here
- fix: 'WeakMultiLabels.summary' and 'show_records' after extending the weak label matrix (#1633) by @dcfidalgo
- fix: Display metadata in Text2Text dataset (#1626) by @leiyre
- chore: set version by @dcfidalgo
- docs: Fix typo in Getting Started -> Concepts (#1618) by @dcfidalgo
- fix: Show predicted OK/KO when predictions exist (#1620) by @leiyre
v0.16.0
0.16.0 (2022-07-08)
Highlights
👂 Listeners: enable more interactive workflows between client and server
Listeners enable you to define functions that get executed under certain conditions when something changes in a dataset. There are many use cases for this: monitoring annotation jobs, monitoring model predictions, enabling active learning workflows, and many more.
You can find the Python API reference docs here: https://rubrix.readthedocs.io/en/stable/reference/python/python_listeners.html#python-listeners
We will be documenting these use cases with practical examples, but for this release, we've included a new tutorial for using this with active learning: https://rubrix.readthedocs.io/en/stable/tutorials/active_learning_with_small_text.html. This tutorial includes the following listener function, which implements the active learning loop:
from rubrix.listeners import listener
from sklearn.metrics import accuracy_score
# Define some helper variables
LABEL2INT = trec["train"].features["label-coarse"].str2int
ACCURACIES = []
# Set up the active learning loop with the listener decorator
@listener(
dataset=DATASET_NAME,
query="status:Validated AND metadata.batch_id:{batch_id}",
condition=lambda search: search.total==NUM_SAMPLES,
execution_interval_in_seconds=3,
batch_id=0
)
def active_learning_loop(records, ctx):
# 1. Update active learner
print(f"Updating with batch_id {ctx.query_params['batch_id']} ...")
y = np.array([LABEL2INT(rec.annotation) for rec in records])
# initial update
if ctx.query_params["batch_id"] == 0:
indices = np.array([rec.id for rec in records])
active_learner.initialize_data(indices, y)
# update with the prior queried indices
else:
active_learner.update(y)
print("Done!")
# 2. Query active learner
print("Querying new data points ...")
queried_indices = active_learner.query(num_samples=NUM_SAMPLES)
ctx.query_params["batch_id"] += 1
new_records = [
rb.TextClassificationRecord(
text=trec["train"]["text"][idx],
metadata={"batch_id": ctx.query_params["batch_id"]},
id=idx,
)
for idx in queried_indices
]
# 3. Log the batch to Rubrix
rb.log(new_records, DATASET_NAME)
# 4. Evaluate current classifier on the test set
print("Evaluating current classifier ...")
accuracy = accuracy_score(
dataset_test.y,
active_learner.classifier.predict(dataset_test),
)
ACCURACIES.append(accuracy)
print("Done!")
print("Waiting for annotations ...")
📖 New docs!
https://rubrix.readthedocs.io/
🧱 extend_matrix
: Weak label augmentation using embeddings
This release includes an exciting feature to augment the coverage of your weak labels using embeddings. You can find a practical tutorial here: https://rubrix.readthedocs.io/en/stable/tutorials/extend_weak_labels_with_embeddings.html
Features
- #1561: standardize icons (#1565) (15254e7), closes #1561
- #1602: new rubrix dataset listeners (#1507, #1586, #1583, #1596) (65747ab), closes #1602
- Add 'extend_matrix' to the WeakMultiLabel class (#1577) (cf89311)
- Improve from datasets (#1567) (2b0d607)
- token-class: adjust token spans spaces (#1599) (0fb3576)
Bug Fixes
- #1264: discard first space after a token (#1591) (eff0ac5), closes #1264
- #1545: highlight words with accents (#1550) (c42e77b), closes #1545
- #1548: access datasets for superusers when workspace is not provided (#1572, #1608) (0b04bc8), closes #1548
- #1551: don't show error traces for EntityNotFoundError's (#1569) (04e101c), closes #1551
- #1557: allow text editing when clicking the "edit" button (#1558) (e751414), closes #1557
- #1574: search highlighting for a single dot (#1592) (53474a1), closes #1574
- #1575: show predicted ok/ko in Text Classifier explore mode (#1576) (ada87c0), closes #1575
- compatibility with new dataset version (#1566) (ac26e30)
Documentation
- #1512: change theme to furo (#1564, #1604) (98869d2), closes #1512
- add 'how to prepare your data for training' to basics (#1589) (a21bcf3)
- add active learning with small text and listener tutorial (#1585, #1609) (d59573f), closes #1601 #421
- Add MajorityVoter to references + Add comments about multi-label support of the label models (#1582) (ab481c7)
- add pip version and dockertag as parameter in the build process (#1560) (73a31e2)
You can see all work included in the release here
- chore(docs): remove by @frascuchon
- docs: add active learning with small text and listener tutorial (#1585, #1609) by @dcfidalgo
- docs(#1512): change theme to furo (#1564, #1604) by @frascuchon
- chore: set version by @frascuchon
- feat(token-class): adjust token spans spaces (#1599) by @frascuchon
- feat(#1602): new rubrix dataset listeners (#1507, #1586, #1583, #1596) by @frascuchon
- docs: add 'how to prepare your data for training' to basics (#1589) by @dcfidalgo
- test: configure numpy to disable multi threading (#1593) by @frascuchon
- docs: Add MajorityVoter to references + Add comments about multi-label support of the label models (#1582) by @dcfidalgo
- feat(#1561): standardize icons (#1565) by @leiyre
- Feat: Improve from datasets (#1567) by @dcfidalgo
- feat: Add 'extend_matrix' to the WeakMultiLabel class (#1577) by @dcfidalgo
- docs: add pip version and dockertag as parameter in the build process (#1560) by @frascuchon
- refactor: remove
words
references in searches (#1571) by @frascuchon - ci: check conda env cache (#1570) by @frascuchon
- fix(#1264): discard first space after a token (#1591) by @frascuchon
- ci(package): regenerate view snapshot (#1600) by @frascuchon
- fix(#1574): search highlighting for a single dot (#1592) by @leiyre
- fix(#1575): show predicted ok/ko in Text Classifier explore mode (#1576) by @leiyre
- fix(#1548): access datasets for superusers when workspace is not provided (#1572, #1608) by @frascuchon
- fix(#1551): don't show error traces for EntityNotFoundError's (#1569) by @frascuchon
- fix: compatibility with new dataset version (#1566) by @dcfidalgo
- fix(#1557): allow text editing when clicking the "edit" button (#1558) by @leiyre
- fix(#...
v0.15.0
0.15.0 (2022-06-08)
🔆 Highlights
🏷️ Configure datasets with a labeling scheme
You can now predefine and change the label schema of your datasets. This is useful for fixing a set of labels for you and your annotation teams.
import rubrix as rb
# Define labeling schema
settings = rb.TextClassificationSettings(label_schema=["A", "B", "C"])
# Apply seetings to a new or already existing dataset
rb.configure_dataset(name="my_dataset", settings=settings)
# Logging to the newly created dataset triggers the validation checks
rb.log(rb.TextClassificationRecord(text="text", annotation="D"), "my_dataset")
#BadRequestApiError: Rubrix server returned an error with http status: 400
Read the docs: https://rubrix.readthedocs.io/en/stable/guides/dataset_settings.html
🧱 Weak label matrix augmentation using embeddings
You can now use an augmentation technique inspired by https://github.com/HazyResearch/epoxy to augment the coverage of your rules using embeddings (e.g., sentence transformers). This is useful for improving the recall of your labeling rules.
Read the tutorial: https://rubrix.readthedocs.io/en/stable/tutorials/extend_weak_labels_with_embeddings.html
🏛️ Tutorial Gallery
Tutorials are now organized into different categories and with a new gallery design!
Read the docs: https://rubrix.readthedocs.io/en/stable/tutorials/introductory.html
🏁 Basics guide
This is the first version of the basics guide. This guide will show you how to perform the most basic actions with Rubrix, such as uploading data or data annotation.
Read the docs: https://rubrix.readthedocs.io/en/stable/getting_started/basics.html
Features
- #1134: Allow extending the weak label matrix with embeddings (#1487) (4d54994), closes #1134
- #1432: configure datasets with a label schema (21e48c0), closes #1432
- #1446: copy icon position in datasets list (#1448) (7c9fa52), closes #1446
- #1460: include text hyphenation (#1469) (ec23b2d), closes #1460
- #1463: change icon position in table header (#1473) (5172324), closes #1463
- #1467: include animation delay for last progress bar track (#1462) (c772b74), closes #1467
- configuraton: add elasticsearch ca_cert path variable (#1502) (f0eda12)
- UI: improve access to actions in metadata and sort dropdowns (#1510) (8d33090), closes #1435
Bug Fixes
- #1522: dates metadata fields accessible for sorting (#1529) (a576ceb), closes #1522
- #1527: check agents instead labels for
predicted
computation (#1528) (2f2ee2e), closes #1527 - #1532: correct domain for filter score histogram (#1540) (7478d6c), closes #1532
- #1533: restrict highlighted fields (3a8b8a9), closes #1533
- #1534: fix progress in the metrics sidebar when page is refreshed (#1536) (1b572c4)
- #1539: checkbox behavior with value 0 (#1541) (7a0ab63), closes #1539
- metrics: compute f1 for text classification (#1530) (147d38a)
- search: highlight only textual input fields (8b83a82), closes #1538 #1544
New contributors
@RafaelBod made his first contribution in #1413
v0.14.2
0.14.2 (2022-05-31)
Bug Fixes
- #1514: allow ent score
None
and change default value to 0.0 (#1521) (0a02c70), closes #1514 - #1516: restore read-only to copied dataset (#1520) (5b9cf0e), closes #1516
- #1517: stop background task when something happens to main thread (#1519) (0304f40), closes #1517
- #1518: disable global actions checkbox when no data was found (#1525) (bf35e72), closes #1518
- UI: remove selected metadata fields for sortable fields dropdown (#1513) (bb9482b)
v0.14.1
0.14.1 (2022-05-20)
Bug Fixes
- #1447: change agent when validating records with annotation but default status (#1480) (126e6f4), closes #1447
- #1472: hide scrollbar in scrollable components (#1490) (b056e4e), closes #1472
- #1483: close global actions "Annotate as" selector after deselect records checkbox (#1485) (a88f8cb)
- #1503: Count filter values when loading a dataset with a route query (#1506) (43be9b8), closes #1503
- documentation: fix user management guide (#1511) (63f7bee), closes #1501
- filters: sort filter values by count (#1488) (0987167), closes #1484