Collection of Urdu datasets for POS, NER, Sentiment, Summarization and NLP tasks.
This a summary dataset. You can train abstractive summarization model using this dataset. It contains 3 files i.e.
train
, test
and val
. Data is in jsonl
format.
Every line
has these keys.
id
url
title
summary
text
You can easily read the data with pandas
import pandas as pd
test = pd.read_json("summary/urdu_test.jsonl", lines=True)
Urdu dataset for POS training. This is a small dataset and can be used for training parts of speech tagging for Urdu Language. Structure of the dataset is simple i.e.
word TAG
word TAG
The tagset used to build dataset is taken from Sajjad’s Tagset
Following are the datasets used for NER tasks.
Happy to announce that UNER (Urdu Named Entity Recognition) dataset is available for NLP apps. Following are NER tags which are used to build the dataset:
PERSON
LOCATION
ORGANIZATION
DATE
NUMBER
DESIGNATION
TIME
If you want to read more about the dataset check this paper Urdu NER.
NER Dataset is in utf-16
format.
Latest for Urdu NER is available. Check this paper for more information MK-PUCIT.
Entities used in the dataset are
Other
Organization
Person
Location
MK-PUCIT
author also provided the Dropbox
link to download the data. Dropbox
IJNLP dataset has following NER tags.
O
LOCATION
PERSON
TIME
ORGANIZATION
NUMBER
DESIGNATION
Jahangir dataset has following NER tags.
O
PERSON
LOCATION
ORGANIZATION
DATE
TIME
This dataset is taken from IMDB Urdu. It was translated using Google Translator. It has only two labels i.e.
positive
negative
This dataset can be used for sentiment analysis for Roman Urdu. It has 3 classes for classification.
Neutral
Positive
Negative
If you need more information about this dataset checkout the link Roman Urdu Dataset.
This dataset is collected from different sources like social media and web for various products and services for sentiment analysis. It contains 3 classes.
pos
neg
neu
This dataset consists of reviews taken from Daraz. You can use it for sentiment analysis as well as spam or ham classification. It contains following columns.
Product_ID
Date
Rating
Spam(1) and Not Spam(0)
Reviews
Sentiment
Features
Dataset is taken from kaggle daraz
Here is a small dataset for sentiment analysis. It has following classifying labels
P
N
O
Link to the paper Paper GitHub link to data Urdu Corpus V1
This dataset(news/urdu-news-dataset-1M.tar.xz
) is taken from Urdu News Dataset 1M. It has 4 classes and can be used for classification
and other NLP tasks. I have removed unnecessary columns.
Business & Economics
Entertainment
Science & Technology
Sports
This dataset(news/real_fake_news.tar.gz
) is used for classification of real and fake news in Fake News Dataset
Dataset contains following domain news.
Technology
Education
Business
Sports
Politics
Entertainment
Headlines(news/headlines.csv.tar.gz
) dataset is taken from Urd News Headlines. Original dataset is in Excel format,
I’ve converted to csv for experiments. Can be used for clustering and classification.
This dataset is collected from journalism and can be used for Urdu NLP research. Here is the link to the resource for more information COUNTER.
I have added two qa datasets, if someone wants to use it for QA based Chatbot. QA(Ahadis): qa_ahadis.csv
It contains qa pairs for Ahadis.
The dataset qa_gk.csv
it contains the general knowledge QA.
Urdu model for SpaCy is available now. You can use it to build NLP apps easily. Install the package in your working environment.
pip install ur_model-0.0.0.tar.gz
You can use it with following code.
import spacy
nlp = spacy.load("ur_model")
doc = nlp("میں خوش ہوں کے اردو ماڈل دستیاب ہے۔ ")
Checkout my articles related to Urdu NLP tasks
These articles are available on UrduNLP.
If you want to get only raw files(text or code) then use curl command i.e. ```shell script curl -LJO https://github.com/mirfan899/Urdu/blob/master/ner/uner.txt
### Concatenate files
```shell script
cd data
cat */*.txt > file_name.txt
Concatenate files of MK-PUCIT into single file using. ```shell script cat /.txt > file_name.txt
Original dataset has a bug like `Others` and `Other` which are same entities, if you want to use the dataset
from `dropbox` link, use following commands to clean it.
```python
import pandas as pd
data = pd.read_csv('ner/mk-pucit.txt', sep='\t', names={"tag", "word"})
data.tag.replace({"Others":"Other"}, inplace=True)
# save according you need as csv or txt by changing the extension
data.to_csv("ner/mk-pucit.txt", index=False, header=False, sep='\t')
Now csv/txt file has format
word tag
If you have a dataset(link) and want to contribute, feel free to create PR.