Database+Advance+Search+Development

TEM 194 Spring 2012 Fadi Al-Ayed

**1.0 ****- ****Overview of the Databases Approaches ** A database is a structured group of data for one or more uses, typically in digital form which is capable to some degree of quality (proper in terms of accuracy, resilience, usability, and availability). However, there are two types of databases (structured databases and semi-structured databases). Some organizations are using structured databases such as Oracle, SQL (Structured Query Language) server and Mysql databases. Some other organizations are using semi-structured databases such as XML (Extensible Markup Language) are known also as non-relational databases. Amazon, for instance, is dealing with a complicated semi-structure database. XML is a designed system for transporting, validating, sharing, and storing data.



Generally, many online shopping sites such as Amazon do not have advance search engine optimization features in order to reduce the clients' results. In reality, many clients do possibly not know which precise keywords they require to discover their particular preferred result. This is one of several issues that causes retrieving data from the system to be technically slow and causes more data to be retrieved then important. In order to resolve this, I suggested to implement an XML keyword and key phrase search engine optimization utilizing advanced search functions. The major search engine enables clients to obtain what they really want quickly. The search engine features a powerful tool which utilizes Boolean semantic operators such as (AND, OR, NOT). The explanations of these operators are:
 * 2.0 – Motivation **

 AND: - Retrieving data with all the current input matching keywords.  OR: - Retrieving data together with a minimum of one from the input matching keywords. NOT: - Retrieving data without particular matching keywords.

**2.1- Examples **  Assume that we have three keywords which are (Acer, Asus, and Sony). First, In order for the clients to search for [Acer] OR [Asus], they might have to look for [Acer] in computers section and look for [Asus] in computers section. The output will get all the tuples who have keywords that match Acer, and all the tuples which have keywords that match Asus. The problem is the clients have to deal with duplicate keywords matches, the results of which probably have duplicate tuples. For example, the consumers need to enter their keywords two times to locate all tuples who have the keywords [Acer] OR [Asus]. A solution, the search for [Acer OR Asus] which should deal with the equivalent final result and the consumers won't need to handle repeat keywords searches as well. For instance, the consumers can look for multiple keywords simultaneously.

 Second example is, if the clients input two different search terms such as [Acer] AND [Asus], the results could be all of the tuples which have both keywords close to each other. In this situation, the final result may possibly not include the tuples that the clients desire. For instance, if one search term shows up at the beginning of a tuple and the other words show up at the end of the tuple. A solution, the explore for keywords [Acer AND Asus], the result would definitely be the whole matching keywords as well as scenario where for the words and phrases that are not next to each other. Therefore, all final results that clients’ desire would be included as well.

 Third example is, when the clients desire to do a search for an item with a certain details that does not have a certain keyword, the consumers need to input their major keywords such as [Acer] computer and have to search through each result to check if it does not contain the keyword they do not need. For instance, when the clients need to query [Acer] with NOT [21-Inch], the result would certainly be all Acer computers and all of the various screen sizes and they also have to search until they locate an item that does not include [21-Inch]. A solution, the search engine could handle NOT operator which allow the clients search [Acer] with NOT [21-Inch], the result would definitely be all the tuples that include the keyword [Acer] but tend not to have [21-inch].

**<span style="color: #0d0d0d; font-family: 'Times New Roman','serif'; font-size: 16px;"> 2.2 - Formalization ** <span style="color: #0d0d0d; font-family: 'Times New Roman','serif'; font-size: 16px;"> The problem definition is generally to develop techniques to create a XML keyword search engine making use of advanced operators. The strategies that have been used are Application Programming Interface (API) DOM for parsing, as well as XML nodes indexing for data retrieval. There are two main types of API parsing such SAX, and <span class="st" style="color: #0d0d0d; font-family: 'Times New Roman','serif'; font-size: 16px;">Document Object Model <span style="color: #0d0d0d; font-family: 'Times New Roman','serif'; font-size: 16px;"> (DOM) parsing. Nonetheless, there several pros and cons for applying these techniques. In this scenario, I used API DOM parsing, because it does manage Xpath while SAX does not. Moreover, since Amazon has a large dataset, DOM parsing facilitates navigation to document as a tree design.

**<span style="color: #0d0d0d; font-family: 'Times New Roman','serif'; font-size: 16px;">DOM Parsing Tree Structure Example **

<span style="color: #0d0d0d; display: block; font-family: 'Times New Roman','serif'; font-size: 16px; text-align: justify;">Solution requires from the clients should input at least one keyword as well as one XML file to perform the program. The mechanism of my implementation is organized in an alternative way of reducing users' results. If the clients input keywords, the system will list various XML files that match the keywords. The clients then need to make sure to select a desire listed document to see more details about the particular data matching keywords.
 * <span style="color: #0d0d0d; font-family: 'Times New Roman','serif'; font-size: 16px;">2.3 ****<span style="color: #0d0d0d; font-family: 'Times New Roman','serif'; font-size: 16px;">- Solution Requirements **


 * <span style="color: #000000; font-family: 'Times New Roman','serif'; font-size: 16px;">3.0 - Experimental Results **


 * <span style="color: #000000; font-family: 'Times New Roman','serif'; font-size: 16px;">3.1 - Metrics **

<span style="color: #0d0d0d; display: block; font-family: 'Times New Roman','serif'; font-size: 16px; text-align: justify;">In this approach, it may make use of a couple of analytics to evaluate how good the solution was operating. The first metric, can be utilized for evaluating the number of data files for each and every query. If decided on execute a query by using the AND Boolean operator, then the expected data files seen to be less than performing a query along with any of the keywords or phrases independently. If decided on execute a query by selecting the OR Boolean operator, then the expected data files discovered to be more than performing a query along with any of the keywords or phrases independently. For any query using the NOT Boolean operator, the expected the actual result to be the number data files that contains the initial keyword or key phrase minus the number of data files discovered while using AND Boolean operator. The OR Boolean operator needs to returning the very best results, whereas the NOT operator must return the second best, and also the AND operator must return the least (minimum) amount of results.

<span style="color: #000000; font-family: 'Times New Roman','serif'; font-size: 16px;">Several queries performed on the system. The queries tested by using different combinations of the Boolean operators.




 * || [[image:2.jpg]] ||  || [[image:3.jpg]] ||

**<span style="color: #0d0d0d; font-family: 'Times New Roman','serif'; font-size: 16px;">3.3 – Algorithm ****<span style="color: #0d0d0d; font-family: 'Times New Roman','serif'; font-size: 16px;">Complexity Analysis **
 * || [[image:4.jpg]] ||
 * ||^  ||   || [[image:5.jpg]] ||

<span style="font-family: 'Times New Roman','serif'; font-size: 16px;">Complexity analysis for <span style="color: #000000; font-family: 'Times New Roman','serif'; font-size: 16px;">Boolean operators:

<span style="display: block; font-family: 'Times New Roman','serif'; font-size: 16px; text-align: justify;"> AND = O( |SANDResult|*kdlog|S|+|S1|^2)

<span style="display: block; font-family: 'Times New Roman','serif'; font-size: 16px; text-align: justify;"> NOT = O(|SNOTRselt|*kdlog|S|+|S1|^2)

<span style="display: block; font-family: 'Times New Roman','serif'; font-size: 16px; text-align: justify;"> OR = O(|SORresult|*kdlog|S|+|S1|^2)

<span style="display: block; font-family: 'Times New Roman','serif'; font-size: 16px; text-align: justify;"> Overall complexity for (AND, NOT, OR) = O( |S3|*kdlog|S|+|S1|^2)

<span style="display: block; font-family: 'Times New Roman','serif'; font-size: 16px; text-align: justify;"> Where |S1| (|S|) is the size of keyword lists S1 through Sk


 * <span style="color: #000000; font-family: 'Times New Roman','serif'; font-size: 16px;">4.0 – Future work **

<span style="color: #0d0d0d; display: block; font-family: 'Times New Roman','serif'; font-size: 16px; text-align: justify;">The system could be enhanced much more by building auto-complete keyword search. Could possibly be accomplished using a query log to discover the vast majority of used queries to determine the queries which might be in close proximity to what the clients have inserted. Developing error tolerance would certainly even improve system much better. To achieve this, a dictionary can be developed from the words that show up often in the data.

<span style="color: #000000; font-family: 'Times New Roman','serif'; font-size: 16px;"> **For Example:**


 * <span style="color: #000000; font-family: 'Times New Roman','serif'; font-size: 16px;"> { } => Clients input **


 * __<span style="color: #000000; font-family: 'Times New Roman','serif'; font-size: 16px;">Auto-complete: __


 * <span style="color: #000000; font-family: 'Times New Roman','serif'; font-size: 16px;">Q ****<span style="color: #0d0d0d; font-family: 'Times New Roman','serif'; font-size: 16px;">uery-1 **<span style="color: #0d0d0d; font-family: 'Times New Roman','serif'; font-size: 16px;">: {Ac_} the method has to automatically finish the keyword search as <span style="color: #000000; font-family: 'Times New Roman','serif'; font-size: 16px;">{Acer}


 * <span style="color: #000000; font-family: 'Times New Roman','serif'; font-size: 16px;">Query-2: **<span style="color: #000000; font-family: 'Times New Roman','serif'; font-size: 16px;">{Son_} <span style="color: #0d0d0d; font-family: 'Times New Roman','serif'; font-size: 16px;">the method has to automatically finish the keyword search as { <span style="color: #000000; font-family: 'Times New Roman','serif'; font-size: 16px;">Sony}


 * <span style="color: #000000; font-family: 'Times New Roman','serif'; font-size: 16px;">Query-3: **<span style="color: #000000; font-family: 'Times New Roman','serif'; font-size: 16px;">{App_} <span style="color: #0d0d0d; font-family: 'Times New Roman','serif'; font-size: 16px;">the method has to automatically finish the keyword search as <span style="color: #000000; font-family: 'Times New Roman','serif'; font-size: 16px;">{Apple}


 * __<span style="color: #000000; font-family: 'Times New Roman','serif'; font-size: 16px;">Error Tolerance: __


 * <span style="color: #000000; font-family: 'Times New Roman','serif'; font-size: 16px;">Query-1: **<span style="color: #000000; font-family: 'Times New Roman','serif'; font-size: 16px;"> {Acre} <span style="color: #0d0d0d; font-family: 'Times New Roman','serif'; font-size: 16px;">the error tolerance has to fix this input to the right keyword search as <span style="color: #000000; font-family: 'Times New Roman','serif'; font-size: 16px;">{Acer}


 * <span style="color: #000000; font-family: 'Times New Roman','serif'; font-size: 16px;">Query-2: **<span style="color: #000000; font-family: 'Times New Roman','serif'; font-size: 16px;">{Sonoy} the <span style="color: #0d0d0d; font-family: 'Times New Roman','serif'; font-size: 16px;"> error tolerance has to fix this input to the right keyword search as <span style="color: #000000; font-family: 'Times New Roman','serif'; font-size: 16px;">{Sony}


 * <span style="color: #000000; font-family: 'Times New Roman','serif'; font-size: 16px;">Query-3: **<span style="color: #000000; font-family: 'Times New Roman','serif'; font-size: 16px;">{Appel} <span style="color: #0d0d0d; font-family: 'Times New Roman','serif'; font-size: 16px;">the error tolerance has to fix this input to the right keyword search as <span style="color: #000000; font-family: 'Times New Roman','serif'; font-size: 16px;"> {Apple}