Skip to article frontmatterSkip to article content

Agriculture Parcels Analysis

1Land Cover Classification:

This study consists of 5 mauzas (Revenue Estates) of tehsil Pabbi, district Nowshera, Khyber Pakhtunkhwa Pakistan. These five reveue states are :

  1. Khushmaqam
  2. Khudrezi
  3. Chowki Mamraiz
  4. Chand Bibi
  5. Amankot

The original land use mentoined in the Landuse_Ma columns of the massavi is out dated and donot match with the ground realities. The land use and total number of parcels in the given dataset are

Landuse_Maparcel_counttotal_area_acre
Agriculture57593878.35
Built up2473.87
Graveyard2925.20
Other229130.85
Road/Streets172207.78
Stream302102.74
Total65154419

Some of the anomalies in this dataset are as follow:

  • The type mentioned in Land use column as agricultue has been converted into builtup, and therefore, not suitable for crop classification dataset.
  • The type mentioned in Land use column as agricultue has now both builtup and agriculture, should be cleaned to make it suiutable for crop classification.
  • Some of the on ground agriculture parcels has been recorded as other in Land use column maynot participate in crop classificaiton dataset as per this dataset.

1.1Purpose

The purpose of this exercise is to get clean agriculture parcels dataset for crop classificaiton using machine learning algorithms like random forest, SVM etc.

1.2Data Cleansing and Preperation Methodology:

Since the agriculture type in the Land use column does not give us a clear idea of the current land type therefore, we have to use other means to computer the current land use type. Therefore, as first step we will filter the 5,759 agriculter parcels to be included in the study. Since our focus is on agriculture and crops detection therefore, the rest of the parcels will filterout from this study and we will clean the 5,759 agricultuer parcels. The data pipeline shows step by step processing of model.

Fig: Model Pipeline

Imagery Acquisition & Processing

To support evidence-based land-use planning and crop monitoring at scale, we leveraged Google Earth Engine to ingest and process high-resolution Sentinel-2 surface reflectance data (COPERNICUS/S2_SR_HARMONIZED) over the period September 2022 – March 2025. A total of 419 scenes with < 10 % cloud cover were retrieved, quality-filtered, and aggregated into a median composite. This “mean of means” composite serves both to reduce atmospheric noise and to stabilize temporal variability.


Key Spectral Indices

We derived a suite of biophysical indicators—commonly endorsed for such studies worldwide and sustainable land management—to characterize vegetation health, water presence, and urban development:

  1. Normalized Difference Vegetation Index (NDVI)
  2. Normalized Difference Built-Up Index (NDBI)
  3. Normalized Difference Water Index (NDWI)
  4. Built-Up Index (BUI)
  5. Urban Index (UI)
  6. GLCM Texture on NDVI (IDM)
    • The inverse difference moment of the NDVI band, capturing spatial heterogeneity—valuable for distinguishing crop patterns from uniform built-up zones.
Additional Composite Metrics
  • Mean Spectral Bands (B2, B3, B4, B8, B11): Averaged across all 419 scenes.
  • Pixel-Count Indices:
    • Agriculture Percentage
    • Built-Up Percentage

These metrics were computed at the parcel level and exported as a geospatial attribute table to classify the parcels into agriculture and non-agriculture parcels.


Sample Label Data Generation:

A rule-based function high_confidence_label was used to assign preliminary land-use labels—Agriculture, Non-Agriculture, or Uncertain labels to each parcel in a GeoDataFrame based on spectral indices and pixel-based metrics. The logic aims to isolate only the most confidently classified samples for supervised learning.

  • A parcel is labeled “Agriculture” if it exhibits a high NDVI value (≥ 0.3), a very low Urban Index (UI ≤ 0.05), and a high proportion of agricultural pixels (≥ 60%), indicating healthy vegetation with minimal built-up interference.
  • It is labeled “Non-Agriculture” if the Urban Index is high (≥ 0.25) or the percentage of built-up pixels exceeds 50%, suggesting significant urban or impervious cover.
  • All other parcels are considered “Uncertain” and are not labeled in this phase to avoid introducing noise into the training dataset. This approach facilitates the creation of a reliable, high-quality labeled subset, which can be used to train a more robust machine learning model to classify the rest of the parcels.The output of this moduel is as follow:
Total Size of Filtered dataset: 5759
 Hybrid Labels Assigned to: 3074
Total Size of Uncategorized dataset: 2685
Split Details of Label dataset:
hc_label
Agriculture        1993
Non-Agriculture    1081
Name: count, dtype: int64

Landuse Classification using Machine Learning:

In the high confidence classification approach we have used human element to set conditions based on the feature set for the classification of land use. It was necessary at that stage because we were not having a reliable label dataset to directly feed to the machine. So utilizing the hybrid classification approach we generated a reliable(to a greater extent) labeled dataset of the agriculture and non-agriculture land use classes to process and make it ready for machine learning. We have to remove the geometry column and other unnecessary columns from our dataset to train the model on pure features dataset without any reference to geometry of the these land classes. The clean dataset is not divided into X,y for features and labels as

X = gdf[['B2','B3','B4','B8','B11', 'NDVI','NDBI', 'BUI', 'UI', 'NDWI']]

y = gdf['landuse_class_encoded']

Now the dataset is ready we divided into 30% -70% for training and testing of the model.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)
Training and Testing the Model:

An instance (rf_model) of the RandomForestClassifire model is created with the following parapmeters

rf_model = RandomForestClassifier(n_estimators=100, class_weight='balanced' ,max_depth=15, min_samples_split=2,random_state=42)
rf_model.fit(X_train, y_train)

rf_model is then trained on the training dataset (30%) of the whole data. Once the model is trained then it is the time for prediction. Now the model is provided with the X_test features to make prediction.

y_pred = rf_model.predict(X_test)

print(f"Total Available Dataset: {len(labeled_dataset)}")
print(f"Train Dataset 70%: {len(X_train)}")
print(f"Test Dataset 30%: {len(X_test)}")

This y_pred will be used to generate pixels based classification of land use. The output is as follow:

Total Available Dataset: 3074
Train Dataset 70%: 2459
Test Dataset 30%: 615

1.3Performance Evaluation :

Classification report and Confusion Matrix were calculated to print and analyze the performance. The classification report shows an 84% f1-score.

Classification Report:

                 precision    recall  f1-score   support

Non-Agriculture       1.00      0.99      1.00       216
    Agriculture       1.00      1.00      1.00       399

       accuracy                           1.00       615
      macro avg       1.00      1.00      1.00       615
   weighted avg       1.00      1.00      1.00       615

Confusion Matrix:
 [[214   2]
 [  0 399]]

1.4Model Application:

The model trained on the supervised dataset in this phase will be used to predict agricuture and non-agriculture parcels of the uncategorized dataset 2685 that were initially leftout by the hybrid approach. The model was used to predict the probabilities of agriculture and non-agriculter parcels based on the featureset.

# Predicting the unlabeled Parcels

X_unlbl = unlabeled_dataset[['B2', 'B3', 'B4', 'B8', 'B11', 'NDVI', 'NDBI', 'BUI', 'UI','NDWI']].values

probs = rf.predict_proba(X_unlbl)[:,1]
unlabeled_dataset['rf_prob_ag'] = probs

#Hard-Classify at 0.5 (threshold)

unlabeled_dataset['rf_label'] = np.where(probs>=0.5, 'Agriculture', 'Non-Agriculture')

As a result we got 2131 agriculture parcels and 554 non-agriculture parcels outof the total 2685 parcels.

Total Dataset Size: 2685
Agri Parcels: 2131
Non-Agri Parcels: 554
Unassigned Parcels: : 0

1.5End Product:

The final output is a complete labeled dataset (5759) of land cover parcels which is obtained by merging the hybrid labels (3074) and machine predicted labels (2685) and are more closely related to ground realities. Therefore, the 5759 parcels which are recorded as agriculture in the land record has been reduced to 4124 agriculture parcels whereas 1635 parcels have been converted to non-agriculture mostly builtup parcels which is 28% of the original agricultur parcels.

The output dataset has been displayed on the interactive map with relavent layers for analyis.

1.6Uses and Wayforward

  • Crop classification
  • Urban Planning
  • Land Use Policy.
import geopandas as gpd
url='https://github.com/code4geoai/pabbi/releases/download/5/Pabbi_AgriVsNonAgri.smart.geojson'
smart_gdf = gpd.read_file(url)
smart_gdf.head()
Loading...
agriculture = smart_gdf[smart_gdf['hc_label']== 'Agriculture']
nonagriculture = smart_gdf[smart_gdf['hc_label']== 'Non-Agriculture']

pred_agri = smart_gdf[smart_gdf['rf_label']== 'Agriculture']
pred_nonagri = smart_gdf[smart_gdf['rf_label']== 'Non-Agriculture']
import leafmap.foliumap as leafmap
m = leafmap.Map(center=(37.5, 70), zoom=6 , draw_control=False,
    zoom_control=False,
    fullscreen_control=False,
    measure_control=False,
    scale_control=False)
m.add_basemap('Esri.WorldImagery')

# 1. Create a minimal GeoDataFrame with only geometry + Target
legend_dict = {
    'Hybrid Predicted Agri':               '#00AA00',  # dark green
    'Hybrid Predicted Non-Agri':           '#00008B',  # darkblue
    'Machine Predicted Agri':              '#00FF00',  # light green
    'Machine Predicted Non-Agri':          "#81C2D7"  # light blue
    }

popup_gdf = smart_gdf[['geometry', 'target_label']].copy()

# 2. Add your styling layers *without* popups
layers_lsit = [ 
    (agriculture,    'Hybrid Predicted Agri',             legend_dict['Hybrid Predicted Agri']),
    (nonagriculture,  'Hybrid Predicted Non-Agri',        legend_dict['Hybrid Predicted Non-Agri']),
        ]

for gdf, name, color in layers_lsit:
    # 2.1 Add the layer to the map   
    m.add_gdf(
        gdf,
        layer_name=name,
        style={
            'color':    color,
            'fillColor':color,
            'fillOpacity':0.3,
            'weight':   1
        },
        zoom_to_layer=True,
        info_mode='off'   # make sure no popups for these layers
    )

# Adding the predicted Agriculture and Non-Agriculture parcels
m.add_gdf(
    pred_agri,
    layer_name='Machine Predicted Agri',
    style={
        'color': 'lightgreen',
        'fillColor': 'lightgreen',
        'fillOpacity': 0.4,
        'weight': 1
    }, info_mode='off'
)
m.add_gdf(
    pred_nonagri,
    layer_name='Machine Predicted Non-Agri',
    style={
        'color': 'lightblue',
        'fillColor': 'lightblue',
        'fillOpacity': 0.4,
        'weight': 1
    }, info_mode='off'
)


# 4. Legend & Layer Manager
m.add_legend(
    title='Agriculture vs Non-Agriculture',
    legend_dict=legend_dict,
    position='bottomleft'
)

m
Loading...