truelearn.datasets.load_peek_dataset_raw#

truelearn.datasets.load_peek_dataset_raw(*, dirname: str = '.', train_limit: Optional[int] = None, test_limit: Optional[int] = None, verbose: bool = True) Tuple[List[List[str]], List[List[str]], Dict[int, Tuple[str, str, str]]][source]#

Download and Load the raw PEEKDataset.

Examples

To load the data:

>>> from truelearn.datasets import load_peek_dataset_raw
>>> train, test, mapping = load_peek_dataset_raw(verbose=False)
>>> len(train)
203590
>>> train[0]  
['4248', '1', '1', '172.0', ..., '1']
>>> len(test)
86945
>>> test[0]  
['12730', '1', '1', '0.0', ..., '0']
>>> len(mapping)
30367
>>> mapping[0]
('https://en.wikipedia.org/wiki/"Hello,_World!"_program', '"Hello, World!" program', "Traditional beginners' computer program")
Parameters:
  • * – Use to reject positional arguments.

  • dirname – The directory name.

  • train_limit – An optional non-negative integer specifying the maximum number of lines to read from the train file. If None, it means no limit.

  • test_limit – An optional non-negative integer specifying the maximum number of lines to read from the test file. If None, it means no limit.

  • verbose – If True and the downloaded file doesn’t exist, this function outputs some information about the downloaded file.

Returns:

A tuple of (train, test, mapping) where train and test are list of lines and mapping is a dict mapping topic_id to (url, title, description). Each line in train and test is a list of strings where each string is the value of the cell in the original csv.

The data looks like this:

(
    [
        [slug, vid_id, ...],...  # follow the structure of header
    ],
    [
        ...
    ],
    {
        0: (url, title, description),...  # 0 is wiki id
    }
)

Raises:

TrueLearnValueError – If the train_limit or test_limit is less than 0.