# Analysis of Police Survey Data

## Presentation for the OPCC, Norfolk

Ali Arsalan Kazmi

## Introduction

• England and Wales Police & Crime Commissioner Elections, 2012

• Norfolk Policing Survey: "Making a Difference"

• Quantitative & Qualitative questions
• Crime, Anti-social behaviour, Customer Service, etc.
• Analysis of survey

• To gain information on views of people at Norfolk
• Information gained would assist in policy-making

## Quantitative Data

### Research Questions

• Is complaint, x, correlated with age, y?
• What's the difference between the means of various age groups of respondents?
• What is the frequency distribution of respondents from various districts?
• Is there a causal relationship between age, y, and the type of complaint submitted, x?
• Cluster Analysis: can respondents be grouped based on variables such as x or y?
• etc.

## Qualitative Data

### Research Questions

• Is x correlated with y?
• What's the difference between the means of various age groups of respondents?
• What is the frequency distribution of respondents from various districts?
• Is there a causal relationship between age and the type of complaint submitted?
• Cluster Analysis: can respondents be grouped based on their characteristic variables?
• etc.

## Qualitative Data

### Research Questions

• Is x correlated with y?
• What's the difference between the means of various age groups of respondents?
• What is the frequency distribution of respondents from various districts?
• Is there a causal relationship between age and the type of complaint submitted?
• Cluster Analysis: can respondents be grouped based on their characteristic variables?
• etc.

## Qualitative Data

### Research Questions

• Which are the most frequent/important words?
• What are the contents of the responses?
• Which responses are similar/dissimilar?
• Which words are related/unrelated?
• What sentiment is expressed in responses?
• etc.

Everyday language is a part of the human organism and is no less complicated than it.

Ludwig Wittgenstein

## Qualitative Data

### Challenges

• Natural Language

• Synoynmy

• Do 'theft' and 'burglary' have different meanings for our purposes?
• Polysemy

• 'Patrol' as noun and verb
• Domain specific words

• 'Beat' in the police terminology
• Words influenced by History/Culture

## Text Mining Framework

Obtained from Dr. Beatriz's lecture on Text Mining

## Phase 1: Text Preprocessing

Objective: Apply operations to 'reshape' data into a format suitable for Text Mining.

• Standardisation

• To give a unified format to all data
• Stopwords Removal

• "That", "this", "mine", "should" are not informative
• Domain-specific stopwords
• Thesaurus

• 'Bobby' and 'Police'
• 'Car parks' and 'Parking'.
• 'Crime' and 'Offence'

## Phase 1: Stopwords & Thesaurus

### Example

• Document 1:

This is a vital issue and more bobbies must be on the beat. Increase the beat.

• Document 2:

Police needs to solve the problem. Car parks have been taken over by bicycles! Please increase Police beat too!

• Document 3:

I am a cyclist and I am proud of it!

## Phase 1: Stopwords & Thesaurus

### Example

• Document 1:

This is a vital issue and more bobbies must be on the beat. Increase the beat.

• Document 2:

Police needs to solve the problem. Car parks have been taken over by bicycles! Please increase Police beat too!

• Document 3:

I am a cyclist and I am proud of it!

## Phase 1: Stopwords & Thesaurus

### Example

• Document 1:

vital issue more police must be on beat increase beat

• Document 2:

police needs solve problem car parks taken over by bicycle please increase police beat

• Document 3:

bicyclist proud

## Section Summary

• Text Preprocessing objectives:

• To standardise textual data found in different formats
• To remove irrelevant/less informative words
• To replace synonymous words with a single word expressing the same meaning
• Text Preprocessing Techniques:

• Standardisation of documents
• Stopwords' removal
• Thesaurus

## Phase 2: Feature Generation

Objective: Apply operations to generate representations of Textual data.

• Basic unit of representation
• Terms (single words, word pairs, phrases)
• Documents/responses
• Choose a matrix format to generate from representation units
• Choose settings for matrix
• Binary Frequency
• Term Frequency
• Term Frequency X Inverse Document Frequency

## Phase 2: Term-Document Matrix (Binary)

• Term-Document Matrix
• Binary Frequency
• A representation that takes into account the presence/absence of Terms, by using 1s and 0s.

## Phase 2: Term-Document Matrix (Binary)

Words Document 1 Document 2 Document 3
beat 1 1 0
bicycle 0 1 0
bicyclist 0 0 1
car 0 1 0
increase 1 1 0
issue 1 0 0
... ... ... ...

## Phase 2: Term-Document Matrix (Term Frequency)

• Term-Document Matrix
• Term Frequency
• A representation that takes into account the total number of times a Term occurs in a Document.
• Term importance defined by its frequency
• More frequent words attain greater importance

## Phase 2: Term-Document Matrix (Term Frequency)

Words Document 1 Document 2 Document 3
beat 2 1 0
bicycle 0 1 0
bicyclist 0 0 1
car 0 1 0
increase 1 1 0
issue 1 0 0
... ... ... ...

## Phase 2: Term-Document Matrix (Tf-Idf)

• Term-Document Matrix
• Term Frequency X Inverse Doc. Frequency
• A representation that takes into account the number of times a Term occurs in documents, as well as the total number of Documents in which it occurs.
• Term importance defined by a high Term frequency and a low document frequency
• Rare terms in the document collection that possibly define their documents attain greatest importance

## Phase 2: Term-Document Matrix (Tf-Idf)

Words Document 1 Document 2 Document 3
beat 0.528 0.176 0
bicycle 0 0.4771 0
bicyclist 0 0 0.4771
car 0 0.4771 0
increase 0.176 0.176 0
issue 0.4771 0 0
... ... ... ...

## Phase 2: Term-Affiliations Matrix

• Term-Affiliations Matrix
• Also known as Word co-occurrences matrix
• Term Frequency

## Phase 2: Term-Affiliations Matrix

Words beat bicycle bicyclist car increase issue ...
beat 5 1 0 0 3 2 ...
bicycle 1 1 0 1 1 0 ...
bicyclist 0 0 1 0 0 0 ...
car 1 1 0 1 1 0 ...
increase 3 1 0 1 2 1 ...
issue 2 0 0 0 1 1 ...
... ... ... ... ... ... ... ...

## Section Summary

• Feature Generation objectives:
• To generate a representation of terms (words, phrases, etc.) and responses/documents
• Representation can be of different types:
• Term-Document Matrix
• Term-Affiliations Matrix
• etc.
• To assign importance to terms as required for analysis:
• Binary
• Term Frequency
• Term Frequency - Inverse Document Frequency
• etc.

## Phase 3: Feature Selection

• There may exist a large number of Terms in a Term Document Matrix (High-Dimensionality)

## Phase 3: Feature Selection

• There may exist a large number of Terms in a Term Document Matrix (High-Dimensionality)
Word # Words Document 1 Document 2 Document 3 ...
1 beat 2 1 0 ...
2 bicycle 0 1 0 ...
3 bicyclist 0 0 1 ...
4 car 0 1 0 ...
5 increase 1 1 0 ...
6 issue 1 0 0 ...
... ... ... ... ... ...
4573 zeal 1 0 0 ...
• Not all of the Terms will be useful
• Many Terms only present in a few documents (known as Sparse Terms)
• Unclear how these Terms are related to other, frequently occurring ones

## Phase 3: Feature Selection

• There may exist a large number of Terms in a Term Document Matrix (High-Dimensionality)
Word # Words Document 1 Document 2 Document 3 ...
1 beat 2 1 0 ...
2 bicycle 0 1 0 ...
3 bicyclist 0 0 1 ...
4 car 0 1 0 ...
5 increase 1 1 0 ...
6 issue 1 0 0 ...
... ... ... ... ... ...
4573 zeal 1 0 0 ...
• A large number of Terms poses challenges to efficient computation
• In addition to Text Mining operations

## Phase 3: Feature Selection

• There may exist a large number of Terms in a Term Document Matrix (High-Dimensionality)
Word # Words Document 1 Document 2 Document 3 ...
1 beat 2 1 0 ...
2 bicycle 0 1 0 ...
3 bicyclist 0 0 1 ...
4 car 0 1 0 ...
5 increase 1 1 0 ...
6 issue 1 0 0 ...
... ... ... ... ... ...
4573 zeal 1 0 0 ...
• Objective: Focus only on important Terms

## Phase 3: Feature Selection

### How to measure Feature/Term Importance?

• Depends on the Text Mining operation to be applied and type of data in Term-Document Matrix

• Term Frequency lowerbounds

• Discard all Terms with a value less than a pre-determined lowerbound
• Term Frequency X Inverse Document Frequency lowerbounds

• Discard all Terms with a value less than a pre-determined lowerbound
• Preset Sparsity level for Terms

• etc.

## Section Summary

• Not all terms in a Term-Document Matrix will be important
• Feature Selection Objective:
• To distinguish important terms from unimportant terms
• Choice of Feature Selection measure:
• Depends on the task (Classification, Clustering) and type of data in Term-Document Matrix
• Term Frequency lowerbounds
• Sparsity level lowerbounds
• etc.

## Text Mining: Research Questions

• Which are the most frequent/important words?
• What are the contents of the responses?
• Which responses are similar/dissimilar?
• Which words are related/unrelated?
• What sentiment is expressed in responses?
• etc.

## Text Mining: Research Questions

• Which are the most frequent/important words?
• What are the contents of the responses?
• Which responses are similar/dissimilar?
• Which words are related/unrelated?
• What sentiment is expressed in responses?
• etc.

## Clustering: Dendrogram

### Objectives

• Organise data
• Simplify data

## Clustering: Dendrogram

### Objectives

• Organise data

• Use and reveal inherent relationships
• Derive relationships using similarity measures
• Simplify data

• Especially for large datasets

## Clustering: Dendrogram

### Objectives

• Organise data

• Use and reveal inherent relationships
• Derive relationships using similarity measures
• Simplify data

• Especially for large datasets

## Clustering

• Clustering algorithms detect relationships inherent in data by detecting statistical patterns

• Hierarchical (Dendrograms)
• Non-Hierarchical (Associative Word Clouds)
• Usually represent patterns by using distance measures

• For example, Euclidean/ordinary distance
• Other distance measures do exist
• Place data in a Euclidean space

## Clustering

• Clustering algorithms detect relationships inherent in data by detecting statistical patterns

• Hierarchical (Dendrograms)
• Non-Hierarchical (Modified Word Clouds)
• Usually represent patterns by using distance measures

• For example, Euclidean (ordinary) distance
• Others: Manhattan, Mahalanobis, etc.
• Place data in a Euclidean space

• Use Similarity measures to identify groups, intra-group and inter-group linkages

• Cosine, Jaccuard, etc.
• Similar data tend to lie close to each other

## Clustering: Results

### Dendrogram

• Hierarchies of groups identified
• Largest and smallest clusters identified
• Summary of data realised

But

• Frequencies of terms not represented
• Possibly gets hard to interpret with large datasets

## Clustering: Results

### Associative Word Clouds

• Associative Word clouds display Term Frequency and relationships

## Section Summary

• Clustering Objectives:

• Organise data
• Identify inherent relationships/patterns in data using similarity measures
• Use the relationships to group data
• After Clustering:

• Data in one group is most similar to data in the same group
• Data in one group is dissimilar from data in other groups
• Clustering Graphics:

• Dendrograms
• Associative Word Clouds

## Qualitative Data

### Research Questions

• Which are the most frequent/important words?
• What are the contents of the responses?
• Which responses are similar/dissimilar?
• Which words are related/unrelated?
• What sentiment is expressed in responses?
• etc.

## Qualitative Data

### Research Questions

• Which are the most frequent/important words?
• What are the contents of the responses?
• Which responses are similar/dissimilar?
• Which words are related/unrelated?
• What sentiment is expressed in responses?
• etc.

## Sentiment Analysis

• Responses may express sentiments
• What do 'negative' responses state about policing?
• What do 'positive' responses state about policing?
• What do respondents complain about?
• What do respondents praise?
• Analyse popularity of policies
• etc.

## Sentiment Analysis

### How?

1. Supervised Approach

• Use a classifier
• Train it
• Test it
2. Unsupervised Approach

• Formulate a dictionary of 'Negative' and 'Positive' words
• Look for the dictionary words in responses
3. Unsupervised + Heuristics Approach

• Formulate a dictionary of 'Negative' and 'Positive' words and Grammatical rules
• Look for dictionary words + grammatical rules in responses

## Sentiment Analysis: Supervised

• Extract a sufficient number of responses from original data
• Partition the extracted data into 'Training' and 'Testing' samples
• $Dataset_{Complete} (2000 responses)$
• $Dataset_{Train} (1200 responses) \subset Dataset_{Complete}$
• $Dataset_{Test} (800 responses) \subset Dataset_{Complete}$
• Label the Training and Testing samples for sentiment
• Train a classifier (mathematical model) on Training samples for sentiment
• Test the trained classifier on Testing samples and report accuracy
• Deploy for use on unlabelled, new data.

## Sentiment Analysis: Unsupervised

Generally:

• Possibly formulate different dictionaries for nouns, adjectives, etc.
• Check for presence of 'Positive' or 'Negative' words in responses

## Sentiment Analysis: Unsupervised + Heuristics

• Possibly formulate different dictionaries for nouns, adjectives, etc
• Formulate and incorporate heuristics/rules for Grammar
• Check for presence of 'Positive' or 'Negative' words in responses
• Incorporate heuristics to make Sentiment tagging of words extensible
• Positive/Negative Adjectives would affect Nouns
• etc.

## Sentiment Analysis: Results

Results of Sentiment Analysis + Phrase clouds = Negative Phrase Cloud

## Sentiment Analysis: Results

### Negative Phrase Cloud

• Highlights phrases occurring in 'Negative' responses
• Scales phrase size according to frequency

## Sentiment Analysis: Results

### Negative Phrase Cloud

• Council Tax
• Anti-Social behaviour

## Section Summary

• Sentiments may be expressed in responses
• Sentiment Analysis Objective:
• Identify sentiments from responses
• Identification can be useful in many ways
• Identify complaints
• Identify praises
• Measure popularity of policies
• Techniques for Sentiment Analysis:
• Supervised
• Unsupervised
• Unsupervised + Heuristics

## Further Research in Text Mining

Utilise:

• Sometimes active
• Sometimes active
• Local newspapers' data
• Active

Utilise:

## Further Research in Sentiment Analysis

• Combine demographics with Sentiment Analysis
• Is there a relationship between Age and Sentiment?
• Can clusters be formed on Demographics + Sentiment?
• How do Sentiments change over a period of time, x?
• Is there a trend in the shift of Sentiments?
• Do Sentiments change due to introduction of policies?
• Can ideological differences be identified?
• Can political affiliations be determined?
• etc.

## Further Research in Social Network Analysis

• Identify types of users connected to OPCC on Twitter
• Combine with Sentiment Analysis
• Opinion Diffusion
• etc.

## A Prototype Textual Analytics Application

• Supports basic Text Mining Operations
• Data Pre-processing
• Feature Generation & Weighting
• Clustering
• Associative Word Clouds
• Network of Words

## A Prototype Textual Analytics Application

• Limitations
• Memory constraints
• Only 1 corpus per analysis
• Approximately 2000 words/terms supported
• 3000-4000 documents supported
• Requires R and Java
• Other Text Mining operations not supported