Urdu

Collection of Urdu datasets for POS, NER, Sentiment, Summarization and NLP tasks.

View the Project on GitHub mirfan899/Urdu

Summary Dataset

This a summary dataset. You can train abstractive summarization model using this dataset. It contains 3 files i.e. train, test and val. Data is in jsonl format.

Every line has these keys.

id
url
title
summary
text

You can easily read the data with pandas

import pandas as pd
test = pd.read_json("summary/urdu_test.jsonl", lines=True)

POS dataset

Urdu dataset for POS training. This is a small dataset and can be used for training parts of speech tagging for Urdu Language. Structure of the dataset is simple i.e.

word TAG
word TAG

The tagset used to build dataset is taken from Sajjad’s Tagset

NER Datasets

Following are the datasets used for NER tasks.

UNER Dataset

Happy to announce that UNER (Urdu Named Entity Recognition) dataset is available for NLP apps. Following are NER tags which are used to build the dataset:

PERSON
LOCATION
ORGANIZATION
DATE
NUMBER
DESIGNATION
TIME

If you want to read more about the dataset check this paper Urdu NER. NER Dataset is in utf-16 format.

MK-PUCIT Dataset

Latest for Urdu NER is available. Check this paper for more information MK-PUCIT.

Entities used in the dataset are

Other
Organization
Person
Location

MK-PUCIT author also provided the Dropbox link to download the data. Dropbox

IJNLP 2008 dataset

IJNLP dataset has following NER tags.

O
LOCATION
PERSON
TIME
ORGANIZATION
NUMBER
DESIGNATION

Jahangir dataset

Jahangir dataset has following NER tags.

O
PERSON
LOCATION
ORGANIZATION
DATE
TIME

Datasets for Sentiment Analysis

IMDB Urdu Movie Review Dataset.

This dataset is taken from IMDB Urdu. It was translated using Google Translator. It has only two labels i.e.

positive
negative

Roman Dataset

This dataset can be used for sentiment analysis for Roman Urdu. It has 3 classes for classification.

Neutral
Positive
Negative

If you need more information about this dataset checkout the link Roman Urdu Dataset.

Products & Services dataset

This dataset is collected from different sources like social media and web for various products and services for sentiment analysis. It contains 3 classes.

pos
neg
neu

Daraz Products dataset

This dataset consists of reviews taken from Daraz. You can use it for sentiment analysis as well as spam or ham classification. It contains following columns.

Product_ID
Date
Rating
Spam(1) and Not Spam(0)
Reviews
Sentiment
Features

Dataset is taken from kaggle daraz

Urdu Dataset

Here is a small dataset for sentiment analysis. It has following classifying labels

P
N
O

Link to the paper Paper GitHub link to data Urdu Corpus V1

News Datasets

Urdu News Dataset 1M

This dataset(news/urdu-news-dataset-1M.tar.xz) is taken from Urdu News Dataset 1M. It has 4 classes and can be used for classification and other NLP tasks. I have removed unnecessary columns.

Business & Economics
Entertainment
Science & Technology
Sports

Real-Fake News

This dataset(news/real_fake_news.tar.gz) is used for classification of real and fake news in Fake News Dataset Dataset contains following domain news.

Technology 
Education 
Business
Sports
Politics
Entertainment

News Headlines

Headlines(news/headlines.csv.tar.gz) dataset is taken from Urd News Headlines. Original dataset is in Excel format, I’ve converted to csv for experiments. Can be used for clustering and classification.

RAW corpus and models

COUNTER (COrpus of Urdu News TExt Reuse) Dataset

This dataset is collected from journalism and can be used for Urdu NLP research. Here is the link to the resource for more information COUNTER.

Urdu model for SpaCy

Urdu model for SpaCy is available now. You can use it to build NLP apps easily. Install the package in your working environment.

pip install ur_model-0.0.0.tar.gz

You can use it with following code.

import spacy
nlp = spacy.load("ur_model")
doc = nlp("میں خوش ہوں کے اردو ماڈل دستیاب ہے۔ ")

NLP Tutorials for Urdu

Checkout my articles related to Urdu NLP tasks

These articles are available on UrduNLP.

Some Helpful Tips

Download Single file from GitHub

If you want to get only raw files(text or code) then use curl command i.e. ```shell script curl -LJO https://github.com/mirfan899/Urdu/blob/master/ner/uner.txt


### Concatenate files
```shell script
cd data
cat */*.txt > file_name.txt

MK-PUCIT

Concatenate files of MK-PUCIT into single file using. ```shell script cat /.txt > file_name.txt


Original dataset has a bug like `Others` and `Other` which are same entities, if you want to use the dataset 
from `dropbox` link, use following commands to clean it.
```python
import pandas as pd
data = pd.read_csv('ner/mk-pucit.txt', sep='\t', names={"tag", "word"})
data.tag.replace({"Others":"Other"}, inplace=True)
# save according you need as csv or txt by changing the extension
data.to_csv("ner/mk-pucit.txt", index=False, header=False, sep='\t')

Now csv/txt file has format

word tag

Note

If you have a dataset(link) and want to contribute, feel free to create PR.