Lecture 1 |
|
|
Introduction |
|
|
|
|
Date |
|
|
8th Aug, 2015
|
|
|
|
|
Lab/Assignment |
|
|
Lab
Regular Expression for tokenization and sentence boundary identification.
Assignment
Find out data set for your respective languages on Twitter/Facebook
Run tokenization and sentence boundary identification on the data
|
Lecture 2 |
|
|
Minimum Edit Distance - Spell Correction |
|
|
|
|
Date |
|
|
22nd Aug, 2015
|
|
|
|
|
Lab/Assignment |
|
|
Lab
Write codes for MED: normal, alignment, and weighted.
Assignment
Run on the data I provided
More details in mail
|
Lecture 3 |
|
|
Web Page Indexing - Basics |
|
|
|
|
Date |
|
|
5th Sept, 2015
|
|
|
|
|
Lab/Assignment |
|
|
Lab/Assignment
Create Positional Index File for you language
Will give individual query to each group and the group has to send me the output.
Download Corpus: Choose your language. Choose any 5K docs.
English
Hindi
Bengali
Tamil
Gujarati
Marathi
Steps: Tokenize, Stop-Word Remove
|
Lecture 4 |
|
|
Language Modelling - Smoothing - Language Identification |
|
|
|
|
Date |
|
|
12th Sept, 2015
|
|
|
|
|
Lab/Assignment |
|
|
Lab/Assignment
Download Google bigram corpus: link
Assignment 1: Predict Next Word corpus: Twitter Corpus (given in the first assignment)
Steps:
- 1. Take random 5000 tweets. Divide them into 10 sets. Each set will be consisting of 500 words.
- 2. Automatically delete all the ith words from each tweet from the ith set.
for example delete 5 the word from the set 5.
- 3. Predict the all the words using bigram, tri-gram and quad grams using Google n-grams.
- 4. Report me the average accuracy of the experiemts: using bigram, trigram and quadgram.
Assignment 2: Create your own Language Model: bigram and trigram from the Twitter Corpus
Steps:
- Use the same 5K tweets for trainging
- Use Laplace Smoothing
- Calculate Perplexity on a new 5K set
- Repeat the steps of Assigment 1 and report me the accuracy of your language model in comparison with Google n-gram
Assignment 3: Create Language Identifier
Steps:
- Download the Corpus: link Divide it into training (60%), development (20%), and test(20%) set.
- Create Language Profiles from the training set
- unerstand the distance meausure: threshold from the development set.
- Apply the same on the test set and report me the accuracy.
Assignment 4: Create word-level Language Identifier for Code-Mixed text
Steps:
- Collect the Corpus from me via email
Divide it into training (60%), development (20%), and test(20%) set.
- Create Language Profiles from the training set: using character level ngrams
- unerstand the distance meausure: threshold from the development set.
- Apply the same on the test set and report me the accuracy.
|
Course Description
This is a introductory natural language processing (NLP) course. The broader goal is to understand how NLP tasks are carried out in the real world (e.g., Web, social media) and how to build tools for solving practical text mining problems. Throughout the course, emphasis will be placed on a understand NLP concepts and tying NLP techniques to specific real-world applications through hands-on experience. The course covers fundamental topics in statistical machine learning and touches upon topics in social media text processing and sentiment analysis and a bit of Big Data Analysis.
Theory to be covered
Introduction: Why NLP
Regular Expression
Tokenization
Stemming
Sentence Boundary Detection
Spell Correction
Minimum Edit Distance
Language Modeling
N-Gram
Smoothing
Language Identification
POS Tagging
Text Classification
Sentiment Analysis
Dependency Parsing
Information Retrieval
Practical Aspects to be covered
WEKA: Machine Learning Toolkit
Hidden Markov Models
Naive Bayes
Support Vector Machines
Stanford NLP
Social Media and NLP
References
Text Book
Jurafsky, Dan and Martin, James, Speech and Language Processing, Second Edition, Prentice Hall, 2008.
References
Manning, Christopher and Heinrich, Schutze, Foundations of Statistical Natural Language Processing, MIT Press, 1999.
Charniack, Eugene, Statistical Language Learning, MIT Press, 1993.
Open access study materials
link by Jurafsky and Manning, Stanford
link by Pushpak Bhattacharya, IIT Bombay
Evaluation and Grading
Project
Expect multiple mini projects on various aspects of NLP and there will be a final project which will be assigned group wise.
Scores obtained in all the components of evaluation shall be totaled and the final score will be converted into letter grades (A, B, C, D, E, or NC) as per NIIT University policy.
Attendance Policy
Attendance will be taken everyday and missing class can be expected to significantly reduce your chances of success. There will be no repetition.
Missing Exams
If you miss a exam due to an unexcused absence, you will receive a grade of 0 for that quiz/exam.
If you miss a exam due to an excused absence, you must provide appropriate verification within one week of the quiz/exam. You will then be allowed to take the make-up exam at a date/time to be decided later. The make-up exam may be SIGNIFICANTLY MORE DIFFICULT than the original exam.
If you cannot be at the final exam, let me know as soon as you know.
No excuses will be entertained for the final project. If you do not work for the project or miss to submit report the will a grade of 0.
A Few Obligatory Points
You mush have a NU email account: Somethime I will communicate via email. I will ask students to create a mail group for easy group commucation.
By enrolling in this course, you agree to the NIIT University Policies.
Electronic Devices: Remember to turn off all electronic communication devices at the beginning of each class. Hope you will be coperative to me and to other fellow students.