ویب پر گرفت اور تبدیل کرنے کے اوزار

سکریپڈ ڈیٹا کو بہتر بنانا

While many of the other articles deal with how to extract data this article explains how the extracted data can be refined so only the desired information remains. To do this the special Criteria methods are used while in all of the following examples the data is extracted from a HTML table, this data could be extracted from a variety of different sources as long as each source of data content divs, spans, images etc is of the same length.

Example table: book list

Below is the table data being scraped in this example this table, consists of four columns عنوان, مصنف, book age اور محبت کا درجہ.

عنوان مصنف book age محبت کا درجہ
How to Garden جان 5 اشاعت
How to use a Camera سارہ 0 نامکمل
How to use a Camera سارہ 0 نامکمل
Astronomy made easy ڈومینک 1 زیر جائزہ
How to Iron پال 1 زیر جائزہ
How to Draw مائک 3 اشاعت
How to use a PC راہیل 4 اشاعت
var titles = Page.getTagValues({"position":1,"tag":{"equals":"td"},"parent":{"tag":{"equals":"tr"}}});
var authors = Page.getTagValues({"position":2,"tag":{"equals":"td"},"parent":{"tag":{"equals":"tr"}}});
var ages = Page.getTagValues({"position":3,"tag":{"equals":"td"},"parent":{"tag":{"equals":"tr"}}});
var statuses = Page.getTagValues({"position":4,"tag":{"equals":"td"},"parent":{"tag":{"equals":"tr"}}});

Often scraped data needs to be refined so that they only have the information they require. This is where the Criteria functions are used. For instance if only published books are required you would need to restrict the statuses column above to published and then apply those changes to the other column data as shown below.

Criteria.create();
statuses = Criteria.equals(statuses, "Published");
titles = Criteria.apply(titles);
authors = Criteria.apply(authors);
ages = Criteria.apply(ages);

استعمال کرتے وقت Criteria methods to reduce the data all changes must be applied to on one single column at a time, before the apply method is used on any other columns that have to have the corresponding records removed. Once complete the Criteria.create() method has to be called before criteria are set for another columns. It is because of this reason that it is best practice to call the Criteria.create() before any other criteria methods.

In the example the statuses column has been restricted to only include اشاعت, then using the Criteria.apply method the corresponding records in the three other columns have also been removed to keep all of the columns consistent. Remember that the apply method is only useful if the different columns contain the same number of records.

Critieria can also be combined together to restrict the data in multiple ways. The below example restricts the book age column to books older than one but less than five years old by using the Criteria.lessThan() اور Criteria.greaterThan() طریقوں

Criteria.create();
ages = Criteria.greaterThan(ages, 1);
ages = Criteria.lessThan(ages, 5);
titles = Criteria.apply(titles);
authors = Criteria.apply(authors);
statuses = Criteria.apply(statuses);

Sometimes there is duplicate data that needs to be removed, to remove this information you can use the Criteria.unique طریقہ.

Criteria.create();
titles = Criteria.unique(titles);
authors = Criteria.apply(authors);
ages = Criteria.apply(ages);
statuses = Criteria.apply(statuses);

Now any duplicate rows based on the title collumn will be removed. The next method is the Criteria.remove method. This removes items from the column if those column values are found in the array parameter.

var authorsToRemove = ["Mike","Rachel"];
Criteria.create();
titles = Criteria.remove(authors, authorsToRemove);
authors = Criteria.apply(titles);
ages = Criteria.apply(ages);
statuses = Criteria.apply(statuses);

Here all records who equal Mike and Rachel in the authors column are removed the apply method then removes the corresponding records from the other columns.