Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add format_as, push_to_huggingface, & from_huggingface in FeedbackDataset #2949

Conversation

alvarobartt
Copy link
Member

@alvarobartt alvarobartt commented May 19, 2023

Description

This PR adds some methods to the rg.FeedbackDataset as follows:

  • format_as to format the records of the existing dataset as 🤗datasets format (i.e. datasets.Dataset) to be extended to support other formats like pandas or numpy
  • push_to_huggingface to push the existing records and dataset configuration to the HuggingFace Hub (including an auto-generated dataset card! 🎉)
  • from_huggingface is a classmethod to load a rg.FeedbackDataset from the HuggingFace Hub, pulling the dataset itself (records and annotations) and the configuration (guidelines, fields, and questions)

Closes #2765

Type of change

  • New feature (non-breaking change which adds functionality)

How Has This Been Tested

  • Added integration tests for push_to_huggingface and from_huggingface
  • Added integration tests for format_as("datasets")

Checklist

  • I have merged the original branch into my forked branch
  • I added relevant documentation
  • follows the style guidelines of this project
  • I did a self-review of my code
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works

@alvarobartt alvarobartt added type: enhancement Indicates new feature requests client labels May 19, 2023
@alvarobartt
Copy link
Member Author

Here's an example on how to use the implemented features:

import argilla as rg

fds = rg.FeedbackDataset(fields=[...], questions=[...])
fds.add_records([...])
fds.push_to_huggingface(repo_id="argilla/feedback-dataset", private=True, ...) # args and **kwargs correspond to the original `push_to_hub` method from 🤗HuggingFace
import argilla as rg

fds = rg.FeedbackDataset.from_huggingface(repo_id="argilla/feedback-dataset", ...) # args and **kwargs correspond to the original `load_dataset` function from 🤗HuggingFace

@alvarobartt
Copy link
Member Author

I wanted to clarify what do we do when the dataset to be pushed to the HuggingFace Hub has no records, should we still upload the configuration and the dataset card? Or should we show a warning that the dataset cannot be created since there are no records? cc @frascuchon @dvsrepo

@alvarobartt alvarobartt requested a review from frascuchon May 19, 2023 07:33
@dvsrepo dvsrepo removed the request for review from frascuchon May 19, 2023 08:03
@dvsrepo
Copy link
Member

dvsrepo commented May 19, 2023

Thats

I wanted to clarify what do we do when the dataset to be pushed to the HuggingFace Hub has no records, should we still upload the configuration and the dataset card? Or should we show a warning that the dataset cannot be created since there are no records? cc @frascuchon @dvsrepo

That's a very good question.

On the one hand, I think is great to be able to push and read just the config.

On the other hand, what happens when people use the load_dataset (instead of our own from_huggingface)? I wonder if they'll get a weird error. If the error is not weird or there's no error, then I'd say we are fine using warnings on our side as discussed below.

From our side we can always warn users both when pushing and doing from_huggingface. We could even raise an exception first time and recommend the user to use a param (e.g., force_whatever) to push without records (maybe a dirty option)

All in all, the thing I find interesting is to be able to push and read configs so I let you think of we can cover this with these methods or there's a complete different workflow we can provide in the future.

@alvarobartt
Copy link
Member Author

Thats

I wanted to clarify what do we do when the dataset to be pushed to the HuggingFace Hub has no records, should we still upload the configuration and the dataset card? Or should we show a warning that the dataset cannot be created since there are no records? cc @frascuchon @dvsrepo

That's a very good question.

On the one hand, I think is great to be able to push and read just the config.

On the other hand, what happens when people use the load_dataset (instead of our own from_huggingface)? I wonder if they'll get a weird error. If the error is not weird or there's no error, then I'd say we are fine using warnings on our side as discussed below.

From our side we can always warn users both when pushing and doing from_huggingface. We could even raise an exception first time and recommend the user to use a param (e.g., force_whatever) to push without records (maybe a dirty option)

All in all, the thing I find interesting is to be able to push and read configs so I let you think of we can cover this with these methods or there's a complete different workflow we can provide in the future.

Thanks for the quick response! Then I guess for the moment we're safer with an exception being raised in case there are no records, and we can look into it in more detail when we tackle the from_config and save_config methods, so as to "standardize" a little bit more how we handle the configuration and, if applicable, upload just the configuration to the HuggingFace Hub 👍🏻

@dvsrepo
Copy link
Member

dvsrepo commented May 19, 2023

Sounds good!

@codecov
Copy link

codecov bot commented May 19, 2023

Codecov Report

Patch coverage: 86.04% and project coverage change: -0.04 ⚠️

Comparison is base (7b3ee2a) 92.86% compared to head (04195bd) 92.83%.

Additional details and impacted files
@@                       Coverage Diff                       @@
##           feat/2615-instructions-task    #2949      +/-   ##
===============================================================
- Coverage                        92.86%   92.83%   -0.04%     
===============================================================
  Files                              204      204              
  Lines                            10571    10657      +86     
===============================================================
+ Hits                              9817     9893      +76     
- Misses                             754      764      +10     
Flag Coverage Δ
pytest 92.83% <86.04%> (-0.04%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/argilla/client/feedback.py 79.67% <86.04%> (+2.64%) ⬆️

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

Copy link
Member

@frascuchon frascuchon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice!

Just some code refactor suggestions. Feel free to apply them or not.

@davidberenstein1957 davidberenstein1957 merged commit 8139f81 into feat/2615-instructions-task May 22, 2023
@davidberenstein1957 davidberenstein1957 deleted the feat/feedback-dataset-huggingface branch May 22, 2023 09:36
@frascuchon frascuchon mentioned this pull request May 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: enhancement Indicates new feature requests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow users push an feedback task dataset to the HF hub using push_to_hub
4 participants