{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "# Data Wrangling Exercises\n",
    "\n",
    "\n",
    "## Introduction\n",
    "\n",
    "Data wrangling is the process of cleaning, transforming, and organizing data to make it more suitable for analysis. It is a critical step in any data analysis project, as it ensures that the data is accurate, consistent, and complete.\n",
    "\n",
    "These exercises are designed to provide practice in data wrangling skills using a real-world dataset. The dataset used in these exercises is the Slovenian Natural Language Inference dataset (SI-NLI), which contains labeled examples of text pairs with corresponding labels of entailment, contradiction, or neutral.\n",
    "\n",
    "The exercises cover a range of data wrangling techniques, including importing data, performing basic statistics, subsetting observations and variables, creating new variables, grouping data, and combining datasets."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "## Get data\n",
    "\n",
    "1. Download SI-NLI from [link](https://www.clarin.si/repository/xmlui/handle/11356/1707).\n",
    "2. Load libraries.\n",
    "3. Import ```train.tsv``` file."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "## Basic statistics\n",
    "\n",
    "1. How many examples are in a dataframe?\n",
    "2. How many variables are in a dataframe?\n",
    "3. Count values in the ```label``` column.\n",
    "4. Are there any missing values in the data?\n",
    "5. Count the number of missing values per column."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "## Subset observations and variables\n",
    "\n",
    "1. Select ```premise``` column and store it in a list.\n",
    "2. Print first 3 rows from the first 3 columns.\n",
    "3. Select ```pair_id```, ```premise```, ```hypothesis```, ```label``` columns and save them into ```train_dataset``` variable.\n",
    "4. Drop ```pair_id``` column.\n",
    "5. Convert all columns to uppercase.\n",
    "6. Replace ```_``` with ```-``` in column names.\n",
    "7. Select rows that belong to the ```neutral``` label.\n",
    "8. Select last 30 rows.\n",
    "9. Select rows with ```hypothesis``` longer than 100 characters.\n",
    "10. Select rows with ```hypothesis``` longer than 100 characters and belong to the ```neutral``` label.\n",
    "11. Select the row with the longest ```hypothesis```.\n",
    "12. Remove rows that contain ```č```, ```š```, ```ž``` in ```premise``` or ```hypothesis```.\n",
    "13. Remove rows that contain at least one missing value.\n",
    "14. Remove the column with the most missing values."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "## Create new variables\n",
    "\n",
    "1. Create integer type variable ```vowel_count_premise``` which stores the number of vowels in a ```premise```. Repeat for ```hypothesis```.\n",
    "2. Create integer type variable with possible values ```1```, ```2```, ```3``` that counts how many annotations a single example received.\n",
    "3. Create boolean type variable ```agreement``` which reflects whether all annotators agreed on the label."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "## Save dataframes\n",
    "\n",
    "1. Save the original dataset to disk in a ```csv``` format.\n",
    "..."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
