-
Notifications
You must be signed in to change notification settings - Fork 412
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUGFIX] argilla
: Fix some from_hub
method errors
#5523
Conversation
@@ -192,8 +193,11 @@ def from_hub( | |||
@staticmethod | |||
def _log_dataset_records(hf_dataset: "HFDataset", dataset: "Dataset"): | |||
"""This method extracts the responses from a Hugging Face dataset and returns a list of `Record` objects""" | |||
# THIS IS REQUIRED SINCE NAME IN ARGILLA ARE LOWERCASE. The Settings BaseModel models apply this logic | |||
# Ideally, the restrictions in Argilla should be removed and the names should be case-insensitive | |||
hf_dataset = hf_dataset.rename_columns({col: col.lower() for col in hf_dataset.column_names}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we apply the same logic as for the settings names. I only included the change here, but we could have problems importing case-insensitive records.
But maybe the solution would be relaxing this constraint on the server backend cc @jfcalvo . Anyway, I think we currently have a risk applying the name.lower() for all the settings models and we should change it.
argilla
: Fix some from_hub
method errors:argilla
: Fix some from_hub
method errors
# Description <!-- Please include a summary of the changes and the related issue. Please also include relevant motivation and context. List any dependencies that are required for this change. --> Trying this code: ```python import argilla as rg client = rg.Argilla(...) dataset = rg.Dataset.from_hub("google-research-datasets/circa") ``` I found several errors ([Here the source dataset](https://huggingface.co/datasets/google-research-datasets/circa/viewer/default/train?f[goldstandard2][value]=0)): - The mapping for records does not match since column names contain uppercase characters - Some values for the ClassValue column are unlabelled, making the uncast process fail (-1) - Importing cached datasets without config files fails when trying to read the dataset JSON file. So, this PR tackles all the problems commented below. **Type of change** <!-- Please delete options that are not relevant. Remember to title the PR according to the type of change --> - Bug fix (non-breaking change which fixes an issue) **How Has This Been Tested** <!-- Please add some reference about how your feature has been tested. --> **Checklist** <!-- Please go over the list and make sure you've taken everything into account --> - I added relevant documentation - I followed the style guidelines of this project - I did a self-review of my code - I made corresponding changes to the documentation - I confirm My changes generate no new warnings - I have added tests that prove my fix is effective or that my feature works - I have added relevant notes to the CHANGELOG.md file (See https://keepachangelog.com/)
Description
Trying this code:
I found several errors (Here the source dataset):
So, this PR tackles all the problems commented below.
Type of change
How Has This Been Tested
Checklist