Add `format_as`, `push_to_huggingface`, & `from_huggingface` in `FeedbackDataset` #2949

alvarobartt · 2023-05-19T07:20:06Z

Description

This PR adds some methods to the rg.FeedbackDataset as follows:

format_as to format the records of the existing dataset as 🤗datasets format (i.e. datasets.Dataset) to be extended to support other formats like pandas or numpy
push_to_huggingface to push the existing records and dataset configuration to the HuggingFace Hub (including an auto-generated dataset card! 🎉)
from_huggingface is a classmethod to load a rg.FeedbackDataset from the HuggingFace Hub, pulling the dataset itself (records and annotations) and the configuration (guidelines, fields, and questions)

Closes #2765

Type of change

New feature (non-breaking change which adds functionality)

How Has This Been Tested

Added integration tests for push_to_huggingface and from_huggingface
Added integration tests for format_as("datasets")

Checklist

I have merged the original branch into my forked branch
I added relevant documentation
follows the style guidelines of this project
I did a self-review of my code
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works

e.g. `hf_hub_download` cannot be mocked if imported as a top-level import

It seems that the mock for `load_dataset` is not working fine, since it's being ignored for some reason

alvarobartt · 2023-05-19T07:26:28Z

Here's an example on how to use the implemented features:

import argilla as rg

fds = rg.FeedbackDataset(fields=[...], questions=[...])
fds.add_records([...])
fds.push_to_huggingface(repo_id="argilla/feedback-dataset", private=True, ...) # args and **kwargs correspond to the original `push_to_hub` method from 🤗HuggingFace

import argilla as rg

fds = rg.FeedbackDataset.from_huggingface(repo_id="argilla/feedback-dataset", ...) # args and **kwargs correspond to the original `load_dataset` function from 🤗HuggingFace

alvarobartt · 2023-05-19T07:27:51Z

I wanted to clarify what do we do when the dataset to be pushed to the HuggingFace Hub has no records, should we still upload the configuration and the dataset card? Or should we show a warning that the dataset cannot be created since there are no records? cc @frascuchon @dvsrepo

…-huggingface

dvsrepo · 2023-05-19T08:05:50Z

Thats

I wanted to clarify what do we do when the dataset to be pushed to the HuggingFace Hub has no records, should we still upload the configuration and the dataset card? Or should we show a warning that the dataset cannot be created since there are no records? cc @frascuchon @dvsrepo

That's a very good question.

On the one hand, I think is great to be able to push and read just the config.

On the other hand, what happens when people use the load_dataset (instead of our own from_huggingface)? I wonder if they'll get a weird error. If the error is not weird or there's no error, then I'd say we are fine using warnings on our side as discussed below.

From our side we can always warn users both when pushing and doing from_huggingface. We could even raise an exception first time and recommend the user to use a param (e.g., force_whatever) to push without records (maybe a dirty option)

All in all, the thing I find interesting is to be able to push and read configs so I let you think of we can cover this with these methods or there's a complete different workflow we can provide in the future.

alvarobartt · 2023-05-19T08:09:05Z

Thats

I wanted to clarify what do we do when the dataset to be pushed to the HuggingFace Hub has no records, should we still upload the configuration and the dataset card? Or should we show a warning that the dataset cannot be created since there are no records? cc @frascuchon @dvsrepo

That's a very good question.

On the one hand, I think is great to be able to push and read just the config.

On the other hand, what happens when people use the load_dataset (instead of our own from_huggingface)? I wonder if they'll get a weird error. If the error is not weird or there's no error, then I'd say we are fine using warnings on our side as discussed below.

From our side we can always warn users both when pushing and doing from_huggingface. We could even raise an exception first time and recommend the user to use a param (e.g., force_whatever) to push without records (maybe a dirty option)

All in all, the thing I find interesting is to be able to push and read configs so I let you think of we can cover this with these methods or there's a complete different workflow we can provide in the future.

Thanks for the quick response! Then I guess for the moment we're safer with an exception being raised in case there are no records, and we can look into it in more detail when we tackle the from_config and save_config methods, so as to "standardize" a little bit more how we handle the configuration and, if applicable, upload just the configuration to the HuggingFace Hub 👍🏻

dvsrepo · 2023-05-19T08:13:22Z

Sounds good!

src/argilla/client/feedback.py

codecov · 2023-05-19T08:30:30Z

Codecov Report

Patch coverage: 86.04% and project coverage change: -0.04 ⚠️

Comparison is base (7b3ee2a) 92.86% compared to head (04195bd) 92.83%.

Additional details and impacted files

@@                       Coverage Diff                       @@
##           feat/2615-instructions-task    #2949      +/-   ##
===============================================================
- Coverage                        92.86%   92.83%   -0.04%     
===============================================================
  Files                              204      204              
  Lines                            10571    10657      +86     
===============================================================
+ Hits                              9817     9893      +76     
- Misses                             754      764      +10

Flag	Coverage Δ
pytest	`92.83% <86.04%> (-0.04%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
src/argilla/client/feedback.py	`79.67% <86.04%> (+2.64%)`	⬆️

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

src/argilla/client/feedback.py

frascuchon

Really nice!

Just some code refactor suggestions. Feel free to apply them or not.

Co-authored-by: Francisco Aranda <[email protected]>

alvarobartt added 10 commits May 18, 2023 12:46

Add format_as("datasets") (WIP)

79fa65e

Fix Dataset.from_dict call in format_as

b4af6d3

Fix docstring in push_to_argilla

b4a56df

Add push_to_huggingface

80fac2c

Add FeedbackDatasetConfig & upload to Hub

c841f2e

Add from_huggingface classmethod

b48d5c3

Remove TestFeedbackDataset as not required

fad8265

Slight import-restructuring

3a24be6

e.g. `hf_hub_download` cannot be mocked if imported as a top-level import

Add FeedbackDataset HF-related tests

d04af85

Update monkeypatch return value

2b5cf4e

It seems that the mock for `load_dataset` is not working fine, since it's being ignored for some reason

alvarobartt added type: enhancement Indicates new feature requests client labels May 19, 2023

Merge branch 'feat/2615-instructions-task' into feat/feedback-dataset…

913a8eb

…-huggingface

alvarobartt requested a review from frascuchon May 19, 2023 07:33

Fix unit tests due to extra self arg

fb22b94

dvsrepo removed the request for review from frascuchon May 19, 2023 08:03

Add format_as integration tests

907e5a9

Raise error in push_to_huggingface if no records

95d23c0